Case Study

Over 5,000 EKS Nodes Patched in Less Than a Week to Prevent Attacks

How a Fortune 100 platform team patched a critical Kubernetes vulnerability across its fleet in days, not the standard month.

Challenge

When a critical vulnerability drops, can you patch your entire Kubernetes fleet in days, across every team, without taking down production? For one Fortune 100 technology company, that question became real.

The affected infrastructure provides the foundation for business-critical services used by hundreds of millions of customers worldwide.

With exploit code public and active, the pressure on the platform team left no margin for error to quickly apply the patch without any service disruption.

The standard remediation window is 30 days, but given the severity of the vulnerability management requested the fleet patched that week.

OpsWerks Ready for the Challenge

OpsWerks regularly runs planned EKS upgrades and ongoing vulnerability remediation across this fleet. The engineers who responded are the same ones who operate it day to day, so there was no ramp-up and no relearning a system from scratch. That familiarity with the environment, the approval paths, and the OpsWerks-built tooling made an accelerated response possible.

Prior large-scale EKS upgrade experience and expertise let the team take on patching 5,000+ nodes across multiple locations on a very compressed timeline.

OpsWerks teams are certified across a number of technologies including Kubernetes and AWS:

Certified Kubernetes Administrator (CKA)

Certified Kubernetes Application Developer (CKAD)

AWS Certified Solutions Architect - Associate; AWS Certified Cloud Practitioner

OpsWerks Approach

OpsWerks delivered the entire accelerated response inside the existing engagement, compressing a 30-day remediation window to under a week.

No SOW amendment. No change order tickets. No added cost.

End-to-end Ownership

OpsWerks owned coordination, communications, approvals, pre-checks, execution, monitoring, and post-checks. The customer's FTE manager set cluster sequence and patching priority.

Phased Rollout to Minimize Risk

Lower environments first (dev, qa, perf), then cert, staging, and production. US East and US West ran on separate days to minimize risk.

OpsWerks Built Proven Tooling

Deployment ran on OpsWerks-built scripts and each batch was validated by pre-check and post-check health scripts.

Live Monitoring at High-Risk Windows

Grafana and Prometheus dashboards, tuned across prior engagements, tracked every batch. Manual watch during high-risk windows let the team take action before alerts fired.

Coordination and Approvals

Open conference calls and dedicated Slack threads, grouped by team owner, kept teams aligned. OpsWerks managed approvals, escalating delays through the FTE project manager.

Earned Expertise

An OpsWerks engineer caught that the customer's traffic engineering stacks were running a different EKS version that had to be upgraded first. If this upgrade sequence were incorrect, regional traffic failover would have broken at cutover, on services used by hundreds of millions of users. The team raised it, replanned, and executed without disruption.

Saying nothing would have been easier, and a failure at cutover would only have meant more work for OpsWerks. The team raised it because it was right for the customer.

Results

Speed at scale

Over 5,000 EKS nodes patched in less than a week

Zero disruption

Zero P0/P1 incidents caused by the rollout

Weeks to days

Compressed 30-day compliance cycle to under 7 days

Proven reliability

Hundreds of millions of customers continued to use services with zero unplanned outages

No added cost

Delivered inside the existing SOW: no amendment, no change order tickets, no added cost

Impact

Without OpsWerks, the nodes serving hundreds of millions of customers would have been exposed to a public, actively exploited vulnerability for potentially up to a month.

With OpsWerks, the response started immediately: no delayed kickoff, no waiting for resources to be assigned. The exposure window closed in under a week, the failover risk was caught and removed before rollout, and zero P0 or P1 incidents touched production.

OpsWerks absorbed the accelerated request inside the existing engagement, so the customer's cloud engineering, traffic engineering, and application teams kept focus on their roadmaps instead of dropping everything to firefight. This is outcome ownership in practice: OpsWerks is measured by problems removed, not by hours billed or tickets left open.

"Thanks a lot for the amazing work this week. It was no easy feat to patch over 5,000 nodes within a week. Thank you for coming together as a team to make this happen on such short notice and within such a short time frame. Really appreciate it."

- Cloud Engineering leader, Fortune 100 technology company

About OpsWerks

For the past decade, OpsWerks has been the trusted partner to some of the world's most demanding platform and infrastructure DevOps and SRE teams, delivering managed services that operate and support mission-critical systems at scale. You define the outcomes. We own the delivery.

If your team runs Kubernetes at scale, the same readiness applies to your fleet

Discover how we can help. Schedule a discovery call: partnerwithus@opswerks.com