Fixing a Broken CI/CD Pipeline: From 24 Failures to Green Builds
The Problem
Imagine walking into work and discovering your CI/CD pipeline is showing red across the board. That's exactly what happened with our React/TypeScript enterprise application. Out of 24 GitHub Actions checks, only 3-5 were passing. That's an 87% failure rate – essentially a non-functional pipeline.
What made this situation particularly dangerous wasn't just the failures themselves, but how previous attempts to "fix" them had actually made things worse. Instead of addressing root causes, workarounds had been implemented that masked the real problems, creating a false sense of security.
The Investigation Process
When facing a complex problem like this, systematic investigation is crucial. Here's the discovery process I followed to understand what was really happening:
# 1. Examine GitHub workflows
ls -la .github/workflows/
# 2. Identify error masking patterns
grep -n "continue-on-error\||| echo\||| true" .github/workflows/*.yml
# 3. Check for existing error logs
ls -la erp-frontend/*.txt
# 4. Analyze TypeScript errors
head -30 erp-frontend/typecheck-errors.txt
# 5. Review lint issues
head -30 erp-frontend/lint-output.txt
# 6. Audit security vulnerabilities
npm audit --production
Each command revealed a piece of the puzzle. The workflow examination showed multiple CI configurations, some conflicting with others. The grep search exposed a troubling pattern – failures were being systematically hidden rather than fixed.
Root Causes Identified
1. Masked Failures in CI
The most alarming discovery was how extensively failures were being hidden:
# Bad: Hiding failures
run: npm run lint || echo "Linting completed with warnings"
run: npm run typecheck || echo "Type checking completed"
continue-on-error: true
This pattern is like putting tape over your car's check engine light. The problems don't go away – you just can't see them anymore. Every || echo
statement and continue-on-error: true
flag was allowing broken code to appear successful.
2. Outdated Test Dependencies
The testing framework had fallen behind:
"@testing-library/user-event": "^13.5.0" // Missing setup() method
This version mismatch meant tests were using deprecated patterns. The newer version 14+ introduced breaking changes that required the setup()
method for proper event simulation. Without updating, tests would either fail or, worse, pass incorrectly.
3. TypeScript Configuration Issues
The TypeScript errors file was a staggering 20KB – thousands of lines of type errors including:
- Missing type annotations on function parameters
- Implicit
any
types throughout the codebase - Incomplete mock implementations in tests
- Mismatched prop types in React components
These weren't just pedantic compiler complaints. Each error represented a potential runtime crash or unexpected behavior in production.
4. Security Vulnerabilities
The security audit revealed:
- 10 total vulnerabilities
- 4 moderate severity
- 6 high severity
- Most stemming from outdated react-scripts dependencies
These vulnerabilities weren't theoretical – they represented real attack vectors that could compromise the application.
The Fix Strategy
My first instinct might have been to run npm audit fix --force
, but this would have been catastrophic. When I tested this approach in a separate branch, it downgraded react-scripts to version 0.0.0 – completely breaking the project. This taught me an important lesson: automatic fixes without understanding can be worse than the original problem.
Instead, I developed a methodical approach:
1. Restored Clean State
git restore package.json package-lock.json
Starting from a known good state prevents cascading failures from previous fix attempts.
2. Updated Specific Packages
npm install --save-dev @testing-library/user-event@^14.5.2
Rather than bulk updates, I updated packages individually, testing after each change to ensure nothing broke.
3. Fixed CI Workflows
I systematically removed all error masking:
- Deleted all
|| echo
patterns - Removed
continue-on-error: true
flags - Ensured each step properly reported its exit status
This made failures visible again – a necessary step before real fixes could begin.
4. Addressed Root Causes
With failures now visible, I could address each systematically:
- Added proper TypeScript annotations where missing
- Updated test implementations to match new testing library patterns
- Fixed genuine code issues that linting revealed
- Updated vulnerable dependencies with compatible versions
Lessons Learned
This experience reinforced several critical principles:
Visibility is crucial: You can't fix what you can't see. Masking errors creates technical debt that compounds over time.
Understand before fixing: The npm audit fix --force
disaster showed why understanding the problem matters more than quick fixes.
Incremental progress: Fixing one test at a time, one TypeScript error at a time, led to steady progress without overwhelming changes.
CI/CD is your safety net: A properly configured pipeline catches issues before they reach production. Disabling checks removes this protection.
Key Takeaway
Never mask CI failures. The || echo
pattern and continue-on-error: true
create false confidence. These "fixes" are technical debt disguised as solutions. Fix the actual problems, even if it takes longer. Your future self, your team, and your users will thank you.
The path from 3/24 passing checks to a green pipeline wasn't just about making tests pass – it was about building confidence that our code was genuinely production-ready. Each fixed error, each resolved vulnerability, each passing test added to that confidence.
This is part of a series on fixing legacy codebases properly. Follow for more real-world debugging stories and practical DevOps solutions.
If you enjoyed this article, you can also find it published on LinkedIn and Medium.