From Cloud Infrastructure to AI Infrastructure: What Transfers and What Doesn't

The views expressed here are my own and do not represent those of any current or former employer.

There’s a narrative floating around that AI infrastructure is a completely new discipline, that you need a PhD in machine learning or five years of CUDA programming to be relevant. I don’t buy it.

I’ve spent twenty years building cloud infrastructure at scale. Multi-region networks, disaster recovery systems, cost optimization at hyperscale. And the more I look at what teams building LLM serving platforms and training pipelines actually need, the more I see the same problems wearing different clothes.

Let me be specific about what transfers and what doesn’t.

What Transfers Directly

Capacity Planning Is Capacity Planning

Whether you’re sizing EC2 fleets or GPU clusters, the fundamental question is the same: how much compute do we need, when do we need it, and what happens when we’re wrong?

I’ve spent years building IPAM tooling, VPC visibility systems, and network capacity models. The skills behind that work, forecasting demand, building headroom without overspending, creating visibility into utilization before it becomes a crisis, apply directly to GPU capacity planning. The resource is different. The math is the same.

Platform Engineering Principles Don’t Change

The best AI infrastructure teams I’ve talked to are essentially platform engineering teams with a GPU budget. They’re solving the same problems I’ve been solving for two decades: how do you make a complex capability accessible to teams that shouldn’t need to understand the underlying infrastructure?

When I built Terraform modules that let service teams deploy into a Cloud WAN network without understanding segment policies or egress routing, that was platform engineering. When an AI platform team builds abstractions that let ML engineers deploy a model without understanding Kubernetes node affinity for GPU scheduling, that’s the same discipline. Golden paths, self-service, guardrails, documentation. The pattern is identical.

Cost Optimization Is Even More Critical

GPU compute is expensive. Really expensive. And the cost optimization instincts that infrastructure engineers develop over years of managing cloud spend are exactly what AI teams need.

I’ve driven multi-million dollar savings by questioning inherited architecture decisions and finding the right tradeoff between cost and capability. That same mindset, “are we using the most expensive resource for something a cheaper resource could handle?”, is even more valuable when the expensive resource costs ten times what an equivalent CPU instance costs.

Reliability Engineering Applies Directly

LLM serving has SLOs. Training pipelines fail. Model inference has tail latency problems. The tools are different (you’re monitoring token throughput instead of HTTP response times), but the discipline of SRE, defining what “healthy” looks like, building observability before you need it, designing for graceful degradation, transfers completely.

What Doesn’t Transfer (And Requires Humility)

This is the section that matters more. What transfers is reassuring. What doesn’t is where the real learning lives.

The Workload Characteristics Are Fundamentally Different

Traditional web services handle many small requests with predictable resource needs. You can model capacity with percentile-based latency targets and horizontal scaling. The relationship between load and resources is roughly linear and well-understood.

LLM inference breaks those assumptions in several ways. A single request’s resource consumption varies dramatically depending on context length, model size, and generation parameters. A 200-token prompt to a 7B model and a 100K-token prompt to a 405B model are the same “request” from a load balancer’s perspective but consume orders of magnitude different resources. Capacity planning tools built for web services don’t account for that variance.

Training jobs are a different beast entirely. They’re long-running (hours to weeks), stateful, failure-sensitive, and consume GPU clusters in ways that make traditional batch scheduling look simple. When a training run fails at hour 47, you need to resume from a checkpoint, not restart from scratch. That recovery pattern doesn’t exist in the web services world I came from.

I’m still learning the details of GPU memory management, model parallelism strategies, and the inference optimization techniques that turn a model from “fits on one GPU” to “needs to be sharded across eight.” Pretending otherwise would be dishonest. But I’ve noticed that having strong systems fundamentals makes learning these specifics faster — you already understand the underlying compute, memory, and networking constraints. The concepts aren’t alien. The magnitudes are.

The Data Movement Problem Is Different in Kind

This is the one that caught me most off guard. In traditional infrastructure, data movement is about network throughput and latency. You’re optimizing for packets per second, bandwidth utilization, jitter. I’ve spent years building networks to move data efficiently. The patterns are well-established.

AI infrastructure introduces data movement problems that traditional networking doesn’t prepare you for:

Model weights are hundreds of gigabytes. Loading a model onto a GPU cluster isn’t a network request — it’s a data migration. When you’re serving multiple models and swapping between them based on request type, the time to load weights becomes a first-order scheduling concern. Imagine if your web server took ninety seconds to boot every time you deployed a new version. That’s the model loading problem.

Training datasets can be petabytes. Moving them to the compute isn’t a one-time operation — data pipelines need to feed training at a rate that keeps GPUs from starving. If GPUs sit idle waiting for data, you’re burning the most expensive compute available on nothing. The networking requirement isn’t throughput in the traditional sense — it’s sustained, uninterrupted throughput to a specific set of devices. The networking patterns I’ve applied at hyperscale (minimize cross-region transfer, use VPC peering for same-region traffic) still apply, but the tolerance for interruption is much lower.

Checkpoint/restore is the one that has no real analogue in traditional infrastructure. Training runs periodically save their state so they can resume after failures. These checkpoints can be tens of gigabytes each, written at regular intervals, to storage that needs to be both fast (don’t slow down training) and durable (don’t lose the checkpoint). The storage architecture for checkpointing is a hybrid of high-performance scratch space and reliable persistent storage that traditional web services never need.

I keep running into situations where my networking instincts are directionally correct but magnitude-wrong. “Minimize data transfer costs” is right, but the cost isn’t measured in cents per gigabyte — it’s measured in GPU-hours wasted while data moves.

The Ecosystem Is Young and Shifting

Cloud infrastructure has mature tooling. Terraform, Kubernetes, and the CNCF ecosystem have been battle-tested for years. You can make architectural commitments with reasonable confidence that the tools will still exist and be supported in three years.

AI infrastructure tooling is younger, more fragmented, and changing fast. The serving framework that’s best practice today might be obsolete in six months. Training orchestrators are still competing for dominance. Even the hardware landscape shifts — what you optimize for H100s may not apply to the next generation.

That uncertainty requires a different approach to architectural decisions. In cloud infrastructure, I optimized for standardization: pick a tool, invest in it, build expertise. In AI infrastructure, I’m learning to optimize for replaceability: make the abstractions clean enough that swapping the underlying tool doesn’t require a rewrite. That’s a genuinely different architectural posture, and it’s one I’m still developing instincts for.

The Bridge

Here’s what I think infrastructure engineers underestimate about our position: we’ve already solved the hard organizational problems that AI teams are just now encountering.

How do you get a hundred teams to adopt a new platform pattern? How do you build self-service infrastructure that doesn’t become a bottleneck? How do you optimize costs at scale without sacrificing reliability? How do you build standards that people actually follow?

Those are infrastructure problems dressed up as AI problems. And the people who’ve solved them before are exactly who AI teams need.

The technical specifics of GPU scheduling or model serving, those can be learned. The judgment about how to build platforms that work at organizational scale, that takes years to develop. If you have it, the AI infrastructure world needs you. Don’t let the hype cycle convince you otherwise.

Co-authored with AI, based on the author's working sessions, dictations, and notes.