Debugging Docker Build Failures in CI/CD: A Real-World Case Study

Debugging Docker Build Failures in CI/CD: A Real-World Case Study

When I encountered multiple failing workflows in my GitHub Actions pipeline, I discovered that what appeared to be TypeScript compilation errors were actually masking a more fundamental Docker configuration issue. This case study demonstrates how I systematically debugged and resolved a Docker build failure that was preventing my CI/CD pipeline from functioning.

The Initial Problem

My GitHub Actions workflows were failing consistently, showing numerous red X marks across different workflow runs. The initial assumption was that TypeScript compilation errors were the culprit, as suggested by the commit messages attempting to fix these issues. However, the real problem lay deeper in the Docker build process.

Discovery Process

Step 1: Verifying Repository State

I first checked my local repository status to understand what had been committed and what was pending:

git status
git branch -vv
git log --oneline -5

This revealed that my local develop and main branches were synchronized with their remote counterparts at commit 477cefc, which claimed to fix TypeScript compilation errors. However, the GitHub Actions were still failing on this exact commit.

Step 2: Investigating the Actual Failure

Rather than assuming the problem based on commit messages, I used the GitHub CLI to fetch the actual error logs:

gh auth status  # Verify GitHub CLI authentication
gh run list --workflow="Basic CI" --limit=1 --json status,conclusion,databaseId,headBranch,event,displayTitle
gh run view $(gh run list --workflow="Basic CI" --limit=1 --json databaseId --jq '.[0].databaseId') --log-failed

This approach revealed the true error:

#20 ERROR: process "/bin/sh -c addgroup -g 101 -S nginx && adduser -S -D -H -u 101 -h /var/cache/nginx -s /sbin/nologin -G nginx -g nginx nginx" did not complete successfully: exit code: 1
addgroup: group 'nginx' in use

Understanding the Root Cause

The error message "addgroup: group 'nginx' in use" indicated that the nginx:alpine base image already contained an nginx user and group. My Dockerfile was attempting to create these again, causing the build to fail. This is a common issue when base images are updated over time but Dockerfiles aren't adjusted accordingly.

Examining the Problematic Code

I examined the failing section of the Dockerfile:

cat erp-frontend/Dockerfile | grep -A 5 -B 5 "addgroup"

This revealed:

# Create non-root user
RUN addgroup -g 101 -S nginx && \
    adduser -S -D -H -u 101 -h /var/cache/nginx -s /sbin/nologin -G nginx -g nginx nginx

The Solution: Defensive Programming

I applied a defensive programming approach similar to checking if a file exists before creating it. The solution was to verify if the nginx user exists before attempting to create it:

# Node.js architect's approach: Check if user exists before creating
# This is like checking if a directory exists before mkdir in Node.js
RUN id -u nginx &>/dev/null || adduser -S -D -H -u 101 -h /var/cache/nginx -s /sbin/nologin -G nginx -g nginx nginx

This pattern uses:

  • id -u nginx to check if the user exists
  • &>/dev/null to suppress output
  • || (logical OR) to only run the adduser command if the check fails

Testing the Fix

Before pushing to CI/CD, I tested the fix locally:

cd erp-frontend && docker build -t test-build . && cd ..

The build completed successfully in 67 seconds, confirming the fix worked.

Deployment and Results

I committed and pushed the fix with a descriptive message:

git add erp-frontend/Dockerfile
git commit -m "fix: resolve Docker build failure in CI/CD pipeline

The nginx:alpine base image already includes nginx user/group, causing
our RUN addgroup/adduser commands to fail with 'group nginx in use' error.

Implemented defensive check using 'id -u nginx' before user creation,
similar to checking file existence before creation in Node.js.
This ensures compatibility with both older and newer nginx:alpine versions."

git push origin develop

Monitoring the Outcome

After pushing, I monitored the workflow results:

gh run list --limit=5

The results showed mixed outcomes, indicating that while the Docker build issue was resolved, other problems remained in the pipeline. This is typical in complex CI/CD systems where fixing one issue often reveals others.

Key Takeaways

  1. Don't assume the error - The commit messages suggested TypeScript issues, but the actual problem was in the Docker configuration.

  2. Use proper debugging tools - The GitHub CLI's ability to fetch actual error logs was crucial in identifying the real issue.

  3. Apply defensive programming - Just as we check for file existence in application code, we should apply similar patterns in Docker configurations.

  4. Test locally first - Building the Docker image locally before pushing saved time and confirmed the fix.

  5. Incremental progress is normal - In complex CI/CD pipelines, fixing one issue often reveals others. Each fix brings you closer to a fully functional pipeline.

Conclusion

This debugging journey demonstrates that successful troubleshooting requires looking beyond surface-level symptoms to find root causes. By using the right tools to gather actual error data and applying defensive programming principles, I transformed a failing Docker build into a robust configuration that works across different environments. The fix not only solved the immediate problem but also made the Dockerfile more resilient to future base image updates.


If you enjoyed this article, you can also find it published on LinkedIn and Medium.