My Journey Debugging Docker Networking Issues in a Production ERP System

My Journey Debugging Docker Networking Issues in a Production ERP System

When I first encountered the dreaded "Network Error" and "ERR_CONNECTION_REFUSED" messages in my PCVN ERP system, I knew I was in for a challenging debugging session. What started as a simple connectivity issue turned into a comprehensive investigation that taught me valuable lessons about Docker networking, nginx configuration, and the importance of systematic debugging.

The Initial Problem

My React frontend running on localhost:3003 was completely unable to communicate with the Rails backend. The browser console was filled with connection refused errors, and users couldn't log in. The system appeared to be running fine from a Docker perspective - all containers showed as "healthy" - but the application was essentially non-functional.

// Browser console errors
Failed to load resource: net::ERR_CONNECTION_REFUSED
http://localhost:3000/api/v1/auth/login

My Investigation Approach

I started with the basics, systematically checking each layer of the application stack:

1. Verifying Docker Container Status

docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

This revealed all six services were running:

  • pcvn-erp-frontend-prod: Port 3003 → 80 (unhealthy)
  • pcvn-erp-backend-prod: Port 3002 → 3002 (healthy)
  • pcvn-erp-nginx-prod: Port 8080 → 80 (healthy)
  • pcvn-erp-db-prod: PostgreSQL on 5432 (healthy)
  • pcvn-erp-redis-prod: Redis on 6379 (healthy)
  • pcvn-erp-sidekiq-prod: Background jobs (healthy)

2. Testing Container Networking

docker exec pcvn-erp-nginx-prod ping -c 2 backend
# PING backend (172.20.0.5): 56 data bytes
# 64 bytes from 172.20.0.5: seq=0 ttl=64 time=0.664 ms

The containers could communicate internally - this was promising.

3. Checking Backend Process Status

docker exec pcvn-erp-backend-prod ps aux | grep -E "puma|rails"
# rails  1  0.7  1.3 241808 110516 ?  Ssl  puma 6.6.0 (tcp://0.0.0.0:3002)
# rails 18  0.0  1.2 297140 103856 ?  Sl   puma: cluster worker 0
# rails 20  0.0  1.3 297140 104396 ?  Sl   puma: cluster worker 1

The Rails backend was definitely running and listening on port 3002.

The Root Causes I Discovered

Through methodical investigation, I uncovered multiple interconnected issues:

Issue 1: Frontend Hardcoded to Wrong API URL

cat ./erp-frontend/.env.local
# VITE_API_URL=http://localhost:3002/api/v1

The frontend was built with localhost URLs baked into the JavaScript bundles. Inside a Docker container, "localhost" refers to the container itself, not the host machine or other containers.

Issue 2: Missing Nginx Configuration Include

docker exec pcvn-erp-nginx-prod grep -n "include /etc/nginx/conf.d" /etc/nginx/nginx.conf
# (empty output - no include directive found!)

The main nginx.conf file wasn't loading additional configurations from /etc/nginx/conf.d/, meaning my carefully crafted API proxy rules were being completely ignored.

Issue 3: Incorrect Upstream Port Definition

docker exec pcvn-erp-nginx-prod nginx -T 2>/dev/null | grep -A 2 "upstream backend"
# upstream backend {
#     least_conn;
#     server backend:3000 max_fails=3 fail_timeout=30s;

The nginx upstream was pointing to port 3000, but Rails was actually running on port 3002.

My Solution Strategy

I adopted a surgical approach, making minimal changes to fix each issue:

Step 1: Create Proper Nginx Proxy Configuration

I created a comprehensive nginx configuration that would properly route API requests:

# nginx-api-proxy-fixed.conf
server {
    listen 80;
    server_name localhost;
    
    location / {
        root /usr/share/nginx/html;
        try_files $uri $uri/ /index.html;
    }
    
    # Critical fix: Use full container name to bypass upstream definition
    location /api/v1/ {
        proxy_pass http://pcvn-erp-backend-prod:3002/api/v1/;
        proxy_http_version 1.1;
        proxy_set_header Host $http_host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

Step 2: Fix Nginx Configuration Loading

I discovered the main nginx.conf wasn't including files from conf.d:

# Check the structure of nginx.conf
sed -n '288,291p' nginx.conf.current | cat -n
#      1              return 200 "OK";
#      2              add_header Content-Type text/plain;
#      3          }
#      4      }

# Add include directive at the right position
head -n 290 nginx.conf.current > nginx.conf.complete
echo "    include /etc/nginx/conf.d/*.conf;" >> nginx.conf.complete
echo "}" >> nginx.conf.complete

Step 3: Apply Configuration to Running Container

# Copy configuration into container
docker cp nginx-api-proxy-fixed.conf pcvn-erp-nginx-prod:/etc/nginx/conf.d/api-proxy.conf
# Successfully copied 9.22kB

# Disable conflicting default configuration
docker exec pcvn-erp-nginx-prod mv /etc/nginx/conf.d/default.conf /etc/nginx/conf.d/default.conf.disabled

# Reload nginx
docker exec pcvn-erp-nginx-prod nginx -s reload
# 2025/08/17 05:52:30 [notice] 133#133: signal process started

Step 4: Handle File Lock Issues

When trying to update the main nginx.conf, I encountered persistent file lock issues:

docker cp nginx.conf.complete pcvn-erp-nginx-prod:/etc/nginx/nginx.conf
# Error response from daemon: unlinkat /etc/nginx/nginx.conf: device or resource busy

# Solution: Restart container to release locks
docker restart pcvn-erp-nginx-prod

# Then immediately copy the configuration
docker cp nginx.conf.complete pcvn-erp-nginx-prod:/etc/nginx/nginx.conf

Verification and Testing

After implementing the fixes, I validated each change:

Testing Direct Backend Connectivity

docker exec pcvn-erp-nginx-prod wget -O- --timeout=5 http://backend:3002/api/v1/auth/login 2>&1
# Connecting to backend:3002 (172.20.0.5:3002)
# wget: server returned error: HTTP/1.1 404 Not Found

The 404 response was actually good news - it meant the connection was successful (login endpoints typically require POST, not GET).

Checking Rails Logs

docker logs pcvn-erp-backend-prod --tail 10
# [1] Puma starting in cluster mode...
# [1] * Listening on http://0.0.0.0:3002
# [1] - Worker 0 (PID: 18) booted in 0.0s, phase: 0
# [1] - Worker 1 (PID: 20) booted in 0.0s, phase: 0

Rails was confirmed to be listening on the correct port.

Key Lessons Learned

1. Docker Networking Fundamentals

I learned that "localhost" inside a Docker container refers to that container alone. Container-to-container communication must use service names or container names, which Docker's internal DNS resolves to the correct IP addresses.

2. Nginx Configuration Hierarchy

The presence of upstream definitions in nginx creates a namespace that takes precedence over DNS resolution. When I used proxy_pass http://backend:3002, nginx was using the upstream "backend" (pointing to port 3000) rather than resolving "backend" as a hostname.

3. File System Locks in Running Containers

Docker's overlay filesystem can create particularly stubborn file locks when processes have configuration files open. Sometimes a container restart is the only way to release these locks and update critical files.

4. The Importance of Include Directives

A missing include /etc/nginx/conf.d/*.conf directive meant hours of configuration work was being completely ignored. Always verify that configuration files are actually being loaded, not just present in the filesystem.

The Professional Impact

This debugging experience reinforced my belief in methodical problem-solving. Each issue was isolated, diagnosed with specific commands, and resolved with minimal changes. The systematic approach I developed included:

  1. Verify the current state - Never assume, always check
  2. Test at each layer - Network, container, application, configuration
  3. Make surgical changes - One fix at a time, verify each works
  4. Document findings - Every command and output teaches something

The experience highlighted how modern containerized applications require understanding of multiple abstraction layers - from Docker networking to nginx configuration hierarchies to build-time environment variable injection. Mastering these intersections proved crucial for maintaining production systems.

Final Reflection

What started as a frustrating connection error became an intensive learning experience in Docker networking and nginx configuration. Through systematic debugging, I not only solved the immediate problem but gained deep insights into how containerized services communicate.

The investigation taught me that in production systems, the obvious error (connection refused) might be several layers removed from the actual cause (a missing include directive in nginx.conf). My methodical approach - testing each component in isolation before examining their interactions - proved invaluable in untangling this complex web of issues.

Most importantly, I learned that persistence and systematic thinking can overcome even the most stubborn technical challenges. Each "failed" attempt taught me something new about the system, ultimately leading to a complete understanding and resolution.


This debugging session was part of developing a full-stack ERP system for a 450+ employee manufacturing company, demonstrating problem-solving and systematic debugging methodologies in production environments.


If you enjoyed this article, you can also find it published on LinkedIn and Medium.