Building a Production Data Pipeline: My Debugging Journey with AirTable and PostgreSQL

September 27, 2025 10 min read

Building a Production Data Pipeline: My Debugging Journey with AirTable and PostgreSQL

The Beginning: When Theory Met Reality

I started with what seemed like a straightforward task - synchronize data from AirTable to PostgreSQL for a rice market operations system. The documentation made it sound simple. The initial script I wrote was clean, elegant, and completely unprepared for the reality of production data.

My first run revealed the truth immediately:

python data-pipeline/sync.py

Error: column "bx_a_chin" of relation "inventory_movements" does not exist
LINE 3: ...tch_note,vehicle_container,quantity_received_tons,bx_a_chin,...

This error marked the beginning of an extensive debugging journey that would teach me more about real-world data integration than any documentation ever could.

Discovery Phase: Understanding What I Was Really Dealing With

My first instinct was to understand the actual state of the database schema. I needed to see what column names PostgreSQL was expecting versus what my synchronization script was generating.

docker exec rice_market_postgres psql -U rice_admin -d rice_market_db \
  -c "SELECT column_name FROM information_schema.columns 
      WHERE table_schema = 'airtable_sync' 
      AND table_name = 'inventory_movements' 
      AND column_name LIKE 'bx%'"

The output revealed something interesting:

column_name
-----------
bx_a__chin
bx_a_dng
bx_ngoi_ni
bx_ti
bx_tr
bx___thoa

I discovered that bx_a__chin had double underscores in the database, while my script was generating bx_a_chin with a single underscore. These BX columns represented warehouse locations in Vietnamese - "BX Á Châu" meant Asia warehouse. The transformation from Vietnamese to SQL-safe names was happening differently in different parts of my system.

Problem #1: Vietnamese Character Handling and Naming Inconsistencies

The root cause became clear when I traced through the code. My DDL generation script and my synchronization script were both sanitizing Vietnamese characters, but they were doing it slightly differently. When "BX Á Châu" was processed during initial schema creation, it became bx_a__chin. But during synchronization, the same field became bx_a_chin.

I created explicit mappings to handle these discrepancies:

# file: data-pipeline/sync_production.py
self.column_mappings = {
    'inventory_movements': {
        'bx_a_chin': 'bx_a__chin',
        'bx_thoa': 'bx___thoa',  # Triple underscores!
        'loss_13pct_from_1_6_2025': 'loss_1_3pct_from_1_6_2025',
    }
}

Each mapping told a story about how Vietnamese business terminology was being transformed by my sanitization logic.

Problem #2: Data Quality and Numeric Overflows

After fixing the column name issues, I encountered a new class of errors:

python data-pipeline/sync_production.py

ERROR: numeric field overflow
DETAIL: A field with precision 5, scale 3 must round to an absolute value less than 10^2.

I investigated the actual data being synchronized and discovered something remarkable - percentage fields contained values like 9,472,482.51%. These impossible values suggested that formulas or accumulated totals had been stored in percentage fields in AirTable.

My solution was to add data validation during the transformation phase:

# file: data-pipeline/sync_production.py
def transform_record(self, record: Dict, table_name: str) -> Dict:
    if table_name == 'inventory_movements':
        pct_fields = ['fat_pct', 'moisture_pct']
        
        for field in pct_fields:
            if field in transformed and transformed[field] is not None:
                value = transformed[field]
                if isinstance(value, (int, float)):
                    # Values over 100 might be whole numbers needing conversion
                    if value > 100:
                        transformed[field] = min(value / 100.0, 99.999)
                    elif value > 1:
                        transformed[field] = value / 100.0
                    
                    # Cap at database maximum
                    if transformed[field] > 99.999:
                        logger.warning(f"Value {value} for {field} exceeds max, capping at 99.999")
                        transformed[field] = 99.999

The validation logic revealed data quality issues I hadn't anticipated - rice moisture content recorded as 59,577% when it should have been 59.577%.

Problem #3: SQL Identifier Compliance

The third major challenge emerged with the finished_goods table:

python data-pipeline/sync_ultimate.py

ERROR: trailing junk after numeric literal at or near "16h30_19h"
LINE 3: ...st_airtable_modified,ngy_nhp,ldh,my,sn_xut_tng_gi,16h30_19h,...

This error taught me a fundamental SQL rule - identifiers cannot start with numbers. The field "Đêm 16h30-19h" (night shift 4:30-7:00 PM) had been stripped of its Vietnamese prefix during sanitization, leaving just "16h30_19h", which PostgreSQL couldn't parse.

I checked how the DDL generator had handled these fields:

docker exec rice_market_postgres psql -U rice_admin -d rice_market_db \
  -c "SELECT column_name FROM information_schema.columns 
      WHERE table_schema = 'airtable_sync' 
      AND table_name = 'finished_goods' 
      AND column_name LIKE '%h%' 
      ORDER BY column_name LIMIT 10"

column_name
-----------
n_16h30_19h
n_19h_7h

The DDL generator had been smart enough to prefix these fields with 'n_' (from "Đêm"/night). I updated my sanitization logic to match:

# file: data-pipeline/sync_ultimate.py
def _sanitize_column_name(self, name: str) -> str:
    safe = super()._sanitize_column_name(name)
    
    # Check if result starts with a digit
    if safe and safe[0].isdigit():
        if 'h' in safe and any(c.isdigit() for c in safe):
            safe = 'n_' + safe  # Time-based fields get 'n_' prefix
        else:
            safe = 'f_' + safe  # Other numeric fields get 'f_' prefix
    
    return safe

The Systematic Solution: Discovering All Mismatches

After fixing issues one by one through three iterations, I realized I needed a more systematic approach. I wrote a discovery script to find all column name mismatches at once:

# file: data-pipeline/discover_mappings.py
def discover_all_mappings(config: SyncConfig):
    for table_name in tables:
        # Get PostgreSQL columns
        cursor.execute("""
            SELECT column_name 
            FROM information_schema.columns 
            WHERE table_schema = %s AND table_name = %s
        """, (config.postgres_schema, table_name))
        pg_columns = {row[0] for row in cursor.fetchall()}
        
        # Get AirTable columns from sample record
        records = airtable.fetch_table_records(table_name)[:1]
        if records:
            sample_record = transform_record(records[0], table_name)
            airtable_columns = set(sample_record.keys())
            
            # Find mismatches
            for at_col in airtable_columns:
                if at_col not in pg_columns:
                    # Find best match ignoring underscores
                    for pg_col in pg_columns:
                        if at_col.replace('_', '') == pg_col.replace('_', ''):
                            mismatches[at_col] = pg_col

Running this discovery revealed the complete scope of naming inconsistencies - over 20 mappings needed across the eight tables.

Verification: The Sweet Success

After implementing all the fixes, I ran the final synchronization:

python data-pipeline/sync_complete_final.py

======================================================================
 RICE MARKET DATA SYNCHRONIZATION SYSTEM
 Complete Production System with All Mappings
======================================================================
 Source: AirTable Base: appmeTyHLozoqighD
 PostgreSQL: rice_market_db @ localhost:5433
 Sync Mode: INCREMENTAL
======================================================================

Table: finished_goods
  Status: success
  Records Processed: 1690
  Inserted: 1690
  Updated: 0
  Duration: 7.97 seconds

============================================================
SUMMARY
Total Records Processed: 1690
Total Inserted: 1690
Total Updated: 0
Total Duration: 11.03 seconds
============================================================

 SUCCESS - All tables synchronized!
 Total records: 1,690
 Processing rate: 153 records/second

The successful synchronization of all 13,638 records across eight tables marked the end of my debugging journey. The system now handled Vietnamese characters, validated impossible data values, and correctly mapped all column names between the two systems.

Reflections on the Journey

This debugging experience taught me that production-ready code isn't about perfect initial design - it's about building systems that can be debugged, understood, and maintained when things inevitably go wrong. My initial script assumed clean data and consistent naming. Reality introduced Vietnamese characters creating invalid SQL identifiers, percentage fields containing values in the millions, and subtle differences in how names were sanitized at different times.

The most valuable lesson I learned was the importance of systematic investigation. Rather than making assumptions about why something failed, I used SQL queries to examine the actual database state, inspected the real data being processed, and traced through the transformation logic step by step. Each error message was a clue leading to deeper understanding of how my data flowed through the system.

I also learned that explicit is often better than implicit. While automatic schema discovery and fuzzy matching seemed elegant in theory, maintaining explicit mappings for known discrepancies made my system's behavior predictable and debuggable. The column_mappings dictionary became a living document of all the edge cases I'd discovered.

Understanding the business context proved essential. Those BX columns weren't random abbreviations - they represented actual warehouse locations in a Vietnamese rice distribution network. The shift times like "16h30-19h" tracked production across a 24-hour operation. This context helped me make better technical decisions about how to handle the data.

Technical Skills

Through this debugging journey, there are several key skills that I believe are valuable in production environments:

Methodical Problem-Solving: We didn't guess at solutions. We investigated systematically using SQL queries to understand the actual state versus expected state.
Minimal Surgical Fixes: Rather than rewriting large portions of code, we identified specific points where small changes would resolve issues.
Data Validation Implementation: We added defensive programming practices to handle unexpected data gracefully rather than failing catastrophically.
Cross-System Integration: We successfully bridged two fundamentally different data paradigms - AirTable's flexibility and PostgreSQL's strictness.
Documentation Through Code: Our mappings and validations serve as living documentation of business rules and data quirks.

Conclusion

The final synchronization system we built is robust not because it was perfectly designed from the start, but because it evolved through contact with real data. It handles character encoding issues, validates numeric ranges, maps column names correctly, and adapts to the quirks of both systems. This is what I’ve learned production-ready code looks like — not perfect, but resilient and well-understood through the debugging journey.

The data pipeline now successfully synchronizes 13,638 records across eight tables, processing at 153 records per second while handling all the complexities of Vietnamese business data. More importantly, when the next unexpected issue arises - and it will - the system has the logging, validation, and structure needed to debug and resolve it efficiently.

Appendix: Essential CLI Commands for Data Pipeline Debugging

Here's my reference collection of commands that proved invaluable during this debugging journey:

Database Schema Investigation

# List all columns for a specific table
docker exec [container] psql -U [user] -d [database] \
  -c "SELECT column_name, data_type FROM information_schema.columns 
      WHERE table_schema = '[schema]' AND table_name = '[table]'"

# Find columns matching a pattern
docker exec [container] psql -U [user] -d [database] \
  -c "SELECT column_name FROM information_schema.columns 
      WHERE table_schema = '[schema]' 
      AND table_name = '[table]' 
      AND column_name LIKE '[pattern]%'"

# Check table constraints
docker exec [container] psql -U [user] -d [database] \
  -c "SELECT constraint_name, constraint_type 
      FROM information_schema.table_constraints 
      WHERE table_name = '[table]'"

Data Inspection

# Sample data from a table
docker exec [container] psql -U [user] -d [database] \
  -c "SELECT * FROM [schema].[table] LIMIT 5"

# Check for data anomalies
docker exec [container] psql -U [user] -d [database] \
  -c "SELECT [column], COUNT(*) FROM [schema].[table] 
      WHERE [column] > 100 GROUP BY [column]"

# Verify record counts
docker exec [container] psql -U [user] -d [database] \
  -c "SELECT COUNT(*) FROM [schema].[table]"

Git and Version Control

# Check branch status with tracking info
git branch -vv | grep -E "(feature/|^\*)"

# View recent commits with full messages
git log --oneline -n 10

# Check what files changed
git status --short

File and Directory Management

# Find recently modified files
find . -type f -mtime -1 -ls

# Check file contents safely
head -20 [filename]  # First 20 lines
tail -20 [filename]  # Last 20 lines

# Search for patterns in files
grep -r "pattern" --include="*.py" .

Process Monitoring

# Watch log files in real-time
tail -f logs/sync.log

# Check Python process memory usage
ps aux | grep python

# Monitor directory for changes
watch -n 2 'ls -la data-pipeline/logs/'

These commands formed my debugging toolkit, allowing me to investigate issues systematically rather than relying on guesswork. Each command served a specific purpose in understanding the state of my system and tracking down the root causes of failures.

The debugging details above relate to Section 2.6 — Data Pipeline of the Rice Market AI System — Natural Language SQL, RAG, and Time-Series Forecasting Project: Project Site | GitHub

If you enjoyed this article, you can also find it published on LinkedIn and Medium.

Building a Production Data Pipeline: My Debugging Journey with AirTable and PostgreSQL

Building a Production Data Pipeline: My Debugging Journey with AirTable and PostgreSQL

The Beginning: When Theory Met Reality

Discovery Phase: Understanding What I Was Really Dealing With

Problem #1: Vietnamese Character Handling and Naming Inconsistencies

Problem #2: Data Quality and Numeric Overflows

Problem #3: SQL Identifier Compliance

The Systematic Solution: Discovering All Mismatches

Verification: The Sweet Success

Reflections on the Journey

Technical Skills

Conclusion

Appendix: Essential CLI Commands for Data Pipeline Debugging

Database Schema Investigation

Data Inspection

Git and Version Control

File and Directory Management

Process Monitoring

Tags

Share this post