Docker Stack Recovery: A Troubleshooting Journey
The Challenge
I discovered a partially failed Docker deployment with critical services down. The frontend container was trapped in a restart loop while the backend had mysteriously stopped running. This guide documents my systematic approach to diagnosing and recovering the entire stack to full operational status.
Initial Assessment
When checking the container status, I found a concerning situation that required immediate attention:
$ docker ps -a
Output revealed:
- Frontend: Continuously restarting every 50 seconds
- Backend: Exited 11 hours ago with code 127
- PostgreSQL: Running healthy on port 5433
- Redis: Running healthy on port 6379
Step 1: Diagnosing the Frontend Restart Loop
I investigated why the frontend kept crashing by examining its logs:
$ docker logs pcvn-erp-frontend --tail 20
Output:
[emerg] host not found in upstream "backend" in /etc/nginx/nginx.conf:56
The nginx configuration couldn't resolve the "backend" hostname because the backend container was down.
Step 2: Investigating Backend Failure
Exit code 127 typically indicates "command not found." I checked the backend logs:
$ docker logs pcvn-erp-backend --tail 20
Output:
ActionController::RoutingError (No route matches [POST] "/api/apm/transactions")
{"level":"INFO","msg":"Request","path":"/api/apm/transactions","status":404}
The Rails application had been running successfully earlier, just returning 404s for monitoring routes.
Step 3: Reviving the Backend
I restarted the stopped backend container:
$ docker start pcvn-erp-backend
$ docker ps | grep backend
Result: Backend started successfully and showed healthy status on port 3000.
Step 4: Fixing the Frontend
With the backend running, I restarted the frontend to resolve the nginx upstream error:
$ docker restart pcvn-erp-frontend
Step 5: Verifying Full Recovery
I confirmed all containers were healthy:
$ docker ps
All services showing healthy:
- Frontend:
0.0.0.0:8080->80/tcp
- Backend:
0.0.0.0:3000->80/tcp
- PostgreSQL:
0.0.0.0:5433->5432/tcp
- Redis:
6379/tcp
Step 6: Testing Application Accessibility
I verified the frontend was responding:
$ curl -I http://localhost:8080
Output: HTTP/1.1 200 OK
I tested the backend health endpoint:
$ curl -I http://localhost:3000/up
Output: HTTP/1.1 200 OK
Key Learnings
Container Dependencies Matter
The frontend's nginx configuration depended on the backend being available. When the backend stopped, the frontend couldn't resolve the hostname and entered a restart loop.
Exit Codes Tell Stories
- Exit 0: Clean shutdown
- Exit 127: Command not found
- Exit 1: General errors
Docker DNS Resolution
Containers communicate using service names through Docker's internal DNS. When a container stops, its hostname becomes unresolvable.
Health Checks Are Essential
Health status indicators helped me quickly identify which services were truly operational versus just "running."
Recovery Checklist
When encountering Docker deployment failures:
Assess the situation
docker ps -a
Check logs of failed containers
docker logs [container-name] --tail 50
Identify dependencies
- Map which containers depend on others
Start stopped containers
docker start [container-name]
Restart dependent containers
docker restart [dependent-container]
Verify health status
docker ps
Test connectivity
curl -I http://localhost:[port]/health
Conclusion
What appeared as a complex deployment failure was actually a simple cascade effect. The backend stopped, causing the frontend to lose its upstream connection and enter a restart loop. By methodically checking each component and understanding the dependency chain, I recovered the entire stack without data loss or configuration changes.
This experience reinforced the importance of understanding container orchestration, service dependencies, and systematic troubleshooting. Sometimes the solution is as simple as starting a stopped container and letting the system heal itself.
Tech Stack: Docker, nginx, Ruby on Rails, PostgreSQL, Redis
Environment: WSL 2 Ubuntu on Windows 11
If you enjoyed this article, you can also find it published on LinkedIn and Medium.