Debugging a Complete Data Pipeline: From Source APIs to Cloud SQL

Debugging a Complete Data Pipeline: From Source APIs to Cloud SQL

The Moment I Realized Something Was Wrong

I had built what I thought was a complete end-to-end data pipeline - fetching rice prices from World Bank APIs, integrating market factors, and syncing everything to Google Cloud SQL. The pipeline ran, but when I checked the results, only partial data was flowing through. My "complete" pipeline wasn't so complete after all.

Discovering the First Issue: The Rice Price Extraction Failure

The first red flag appeared when I ran my pipeline and saw this:

$ python data-pipeline/rice_price_spread_analysis.py --start-year 2008
ERROR - Failed to extract rice price data: Could not identify date column in Pink Sheet data
ValueError: Could not identify date column in Pink Sheet data

Investigation Process

I needed to understand what the World Bank Excel file actually contained:

$ python -c "import pandas as pd; xls = pd.ExcelFile('data/raw/external_rice_market/worldbank_pink_sheet/CMO-Historical-Data-Monthly_20251009.xlsx'); print('Sheets:', xls.sheet_names)"
Sheets: ['AFOSHEET', 'Monthly Prices', 'Monthly Indices', 'Description', 'Index Weights']

Then I examined the header structure:

$ python -c "import pandas as pd; df = pd.read_excel('data/raw/external_rice_market/worldbank_pink_sheet/CMO-Historical-Data-Monthly_20251009.xlsx', 'Monthly Prices', header=4); print('Columns:', list(df.columns[:5]))"
Columns: ['Unnamed: 0', 'Crude oil, average', 'Crude oil, Brent', 'Crude oil, Dubai', 'Crude oil, WTI']

The problem became clear - the Excel file had multiple header rows with units in row 5, and dates were stored as 'YYYYMM' format strings like '1960M01' instead of standard datetime formats.

Root Cause

The World Bank had changed their Excel format. My script expected dates in a standard column, but they were now in the first unnamed column with a special format. Additionally, the rice column names had slight variations - "Rice, Viet Namese 5%" instead of "Rice, Vietnamese 5%".

The Second Issue: Matplotlib Chart Popups Blocking Pipeline

When I fixed the rice extraction and ran the market drivers script, something unexpected happened:

$ python data-pipeline/market_drivers.py
Series column chosen: Crude oil, Dubai
Rows: 198 | Date range: 2008-07-01 00:00:00 → 2024-12-01 00:00:00

A chart window popped up and froze my pipeline! The script was trying to display interactive charts instead of saving them to files.

Finding the Display Calls

I searched for the problematic code:

$ grep -n "plt.show()" data-pipeline/market_drivers.py
135:plt.show()
184:    plt.show()
244:    plt.show()
301:    plt.show()
363:    plt.show()
408:    plt.show()
453:    plt.show()
501:    plt.show()

Eight instances of plt.show() were blocking my pipeline execution.

The Fix That Led to Another Problem

I added a non-interactive backend at the top of the file and replaced all plt.show() calls with save operations. But then:

NameError: name 'PROCESSED_DATA_DIR' is not defined

I had introduced a new variable without defining it. Another fix was needed - adding the proper path definitions.

The Third Issue: GCP Connection Pool Exhaustion

With the data extraction working, I hit the final roadblock during GCP migration:

psycopg2.OperationalError: connection to server at "34.45.44.214", port 5432 failed: 
FATAL: remaining connection slots are reserved for non-replication superuser connections

Diagnosing the Connection Limit

I checked the Cloud SQL instance configuration:

$ gcloud sql instances describe rice-market-db --format="value(settings.maxConnections)"
# Empty output meant default of 25 connections for db-f1-micro

My pipeline was opening multiple connections for parallel processing, quickly exhausting the 25-connection limit.

The Solution

I increased the connection limit:

$ gcloud sql instances patch rice-market-db --database-flags=max_connections=100
WARNING: This patch modifies database flag values, which may require your instance to be restarted.
Do you want to continue (Y/n)? y
Patching Cloud SQL instance...done.

After waiting for the instance restart:

$ gcloud sql instances describe rice-market-db --format="value(state)"
RUNNABLE

Verification: The Complete Pipeline Finally Runs

After all fixes, I ran the complete pipeline:

$ python data-pipeline/pipeline_complete_from_source.py
######################################################################
# COMPLETE PIPELINE FROM SOURCE
# Started: 2025-10-09 18:09:12
######################################################################
✓ Rice prices fetched: 204 records
✓ Market factors: 7 indicators
✓ Rainfall: 198 monthly records
✓ Final dataset: 198 x 21
✓ Sync Airtable to Docker (13,818 records)
✓ Migrate to GCP Cloud SQL
# Duration: 993.8 seconds (16.6 minutes)
######################################################################

Success! The pipeline processed:

  • 204 months of rice price data from World Bank
  • 7 market indicators (oil, inflation, population, ENSO, fertilizer)
  • 198 integrated monthly records with 21 variables
  • 13,818 Airtable transaction records across 8 tables

Reflections on the Debugging Journey

This debugging session taught me valuable lessons about data pipeline development:

  1. External data sources change - I learned to write more defensive code that can handle format variations
  2. Silent failures are dangerous - Chart popups seemed harmless but completely blocked automation
  3. Resource limits matter - My local testing didn't reveal the connection pool exhaustion that occurred at scale
  4. Systematic investigation pays off - Using targeted CLI commands to examine each component saved hours of blind debugging

The most satisfying moment was seeing all green checkmarks after 16 minutes of processing. What started as a series of cryptic errors became a fully functional pipeline processing thousands of records from multiple sources.

CLI Commands Reference for Future Debugging

Excel/CSV Investigation

# Examine Excel structure
python -c "import pandas as pd; xls = pd.ExcelFile('file.xlsx'); print(xls.sheet_names)"

# Check DataFrame columns
python -c "import pandas as pd; df = pd.read_excel('file.xlsx', 'Sheet1', header=4); print(df.columns[:10])"

# Preview data with different header rows
python -c "import pandas as pd; df = pd.read_excel('file.xlsx', skiprows=5); print(df.head())"

Code Pattern Search

# Find function calls
grep -n "plt.show()" *.py

# Search for imports
grep -n "import matplotlib\|from matplotlib" *.py

# Find variable usage
grep -n "PROCESSED_DATA_DIR" *.py

GCP Cloud SQL Diagnostics

# Check instance status
gcloud sql instances describe INSTANCE_NAME --format="value(state)"

# View connection limits
gcloud sql instances describe INSTANCE_NAME --format="value(settings.maxConnections)"

# Update database flags
gcloud sql instances patch INSTANCE_NAME --database-flags=max_connections=100

# Test connection
psql -h IP_ADDRESS -U USERNAME -d DATABASE -c "SELECT 1;"

Process Monitoring

# Monitor script output (success/failure only)
python script.py 2>&1 | grep -E "(SUCCESS|FAILED|ERROR)"

# Check file creation
ls -la data/processed/*.png 2>/dev/null | tail -5

# Real-time log monitoring
tail -f pipeline.log | grep -E "ERROR|WARNING|SUCCESS"

Git Cleanup Operations

# Check uncommitted changes
git status --short

# Stash changes temporarily
git stash

# Move files to archive
mkdir -p archive/temp_files
mv *_fixed.py *_backup.* archive/temp_files/

These commands became my toolkit for systematically identifying and resolving each issue. Each command revealed another piece of the puzzle until the complete picture emerged.


If you enjoyed this article, you can also find it published on LinkedIn and Medium.