Fixing a Broken CI/CD Pipeline: From 24 Failures to Green Builds

Fixing a Broken CI/CD Pipeline: From 24 Failures to Green Builds

The Problem

Imagine walking into work and discovering your CI/CD pipeline is showing red across the board. That's exactly what happened with our React/TypeScript enterprise application. Out of 24 GitHub Actions checks, only 3-5 were passing. That's an 87% failure rate – essentially a non-functional pipeline.

What made this situation particularly dangerous wasn't just the failures themselves, but how previous attempts to "fix" them had actually made things worse. Instead of addressing root causes, workarounds had been implemented that masked the real problems, creating a false sense of security.

The Investigation Process

When facing a complex problem like this, systematic investigation is crucial. Here's the discovery process I followed to understand what was really happening:

# 1. Examine GitHub workflows
ls -la .github/workflows/

# 2. Identify error masking patterns
grep -n "continue-on-error\||| echo\||| true" .github/workflows/*.yml

# 3. Check for existing error logs
ls -la erp-frontend/*.txt

# 4. Analyze TypeScript errors
head -30 erp-frontend/typecheck-errors.txt

# 5. Review lint issues
head -30 erp-frontend/lint-output.txt

# 6. Audit security vulnerabilities
npm audit --production

Each command revealed a piece of the puzzle. The workflow examination showed multiple CI configurations, some conflicting with others. The grep search exposed a troubling pattern – failures were being systematically hidden rather than fixed.

Root Causes Identified

1. Masked Failures in CI

The most alarming discovery was how extensively failures were being hidden:

# Bad: Hiding failures
run: npm run lint || echo "Linting completed with warnings"
run: npm run typecheck || echo "Type checking completed"
continue-on-error: true

This pattern is like putting tape over your car's check engine light. The problems don't go away – you just can't see them anymore. Every || echo statement and continue-on-error: true flag was allowing broken code to appear successful.

2. Outdated Test Dependencies

The testing framework had fallen behind:

"@testing-library/user-event": "^13.5.0"  // Missing setup() method

This version mismatch meant tests were using deprecated patterns. The newer version 14+ introduced breaking changes that required the setup() method for proper event simulation. Without updating, tests would either fail or, worse, pass incorrectly.

3. TypeScript Configuration Issues

The TypeScript errors file was a staggering 20KB – thousands of lines of type errors including:

  • Missing type annotations on function parameters
  • Implicit any types throughout the codebase
  • Incomplete mock implementations in tests
  • Mismatched prop types in React components

These weren't just pedantic compiler complaints. Each error represented a potential runtime crash or unexpected behavior in production.

4. Security Vulnerabilities

The security audit revealed:

  • 10 total vulnerabilities
  • 4 moderate severity
  • 6 high severity
  • Most stemming from outdated react-scripts dependencies

These vulnerabilities weren't theoretical – they represented real attack vectors that could compromise the application.

The Fix Strategy

My first instinct might have been to run npm audit fix --force, but this would have been catastrophic. When I tested this approach in a separate branch, it downgraded react-scripts to version 0.0.0 – completely breaking the project. This taught me an important lesson: automatic fixes without understanding can be worse than the original problem.

Instead, I developed a methodical approach:

1. Restored Clean State

git restore package.json package-lock.json

Starting from a known good state prevents cascading failures from previous fix attempts.

2. Updated Specific Packages

npm install --save-dev @testing-library/user-event@^14.5.2

Rather than bulk updates, I updated packages individually, testing after each change to ensure nothing broke.

3. Fixed CI Workflows

I systematically removed all error masking:

  • Deleted all || echo patterns
  • Removed continue-on-error: true flags
  • Ensured each step properly reported its exit status

This made failures visible again – a necessary step before real fixes could begin.

4. Addressed Root Causes

With failures now visible, I could address each systematically:

  • Added proper TypeScript annotations where missing
  • Updated test implementations to match new testing library patterns
  • Fixed genuine code issues that linting revealed
  • Updated vulnerable dependencies with compatible versions

Lessons Learned

This experience reinforced several critical principles:

Visibility is crucial: You can't fix what you can't see. Masking errors creates technical debt that compounds over time.

Understand before fixing: The npm audit fix --force disaster showed why understanding the problem matters more than quick fixes.

Incremental progress: Fixing one test at a time, one TypeScript error at a time, led to steady progress without overwhelming changes.

CI/CD is your safety net: A properly configured pipeline catches issues before they reach production. Disabling checks removes this protection.

Key Takeaway

Never mask CI failures. The || echo pattern and continue-on-error: true create false confidence. These "fixes" are technical debt disguised as solutions. Fix the actual problems, even if it takes longer. Your future self, your team, and your users will thank you.

The path from 3/24 passing checks to a green pipeline wasn't just about making tests pass – it was about building confidence that our code was genuinely production-ready. Each fixed error, each resolved vulnerability, each passing test added to that confidence.


This is part of a series on fixing legacy codebases properly. Follow for more real-world debugging stories and practical DevOps solutions.


If you enjoyed this article, you can also find it published on LinkedIn and Medium.