Questioning Architectural Defaults: A Transit Gateway Cost Story

The views expressed here are my own and do not represent those of any current or former employer. Technical details have been generalized to share patterns without disclosing proprietary information.

I want to tell you about the most expensive “best practice” I’ve ever seen.

Early in a role leading cloud infrastructure at a large-scale consumer platform, I pulled up the networking cost breakdown for the first time. I’d expected some big numbers. What I didn’t expect was to find that a single architectural decision, one that nobody had questioned in years, was quietly burning through a staggering amount in AWS data transfer fees.

The culprit? AWS Transit Gateway. A service that every architecture blog recommends. A service that AWS themselves promote as the standard. A service that was absolutely the wrong choice for our traffic patterns.

How We Got Here

Here’s the thing about Transit Gateway. It’s genuinely good. It acts as a centralized hub connecting multiple VPCs, simplifies routing, gives you a single point of control. For most organizations, it’s the right call.

But “most organizations” is doing a lot of heavy lifting in that sentence.

Transit Gateway was adopted years before I arrived. The team that set it up made a reasonable decision with the information they had. Traffic was lower. The VPC count was smaller. The centralized model was clean and manageable.

Then traffic grew. And grew. And kept growing. Nobody revisited the decision because it was working. The routing was fine. The connections were stable. Nothing was “broken.” The architecture just happened to be hemorrhaging money in a way that didn’t show up in any operational dashboard.

The Math That Changed Everything

Let me break down why Transit Gateway gets expensive, because this catches a lot of engineers off guard.

Transit Gateway charges you for two things: an hourly attachment fee for each VPC connection, and a per-GB data processing charge for every gigabyte that flows through the gateway.

That per-GB charge? That’s the killer. When you have hundreds of services talking across VPCs, and all of that traffic routes through a central gateway, you’re paying a processing fee on every single byte. At low traffic volumes, it’s negligible. At hyperscale volumes, it compounds into something genuinely alarming.

Direct VPC peering, on the other hand, has no data processing charge. You still pay for cross-AZ transfers, but the gateway toll disappears completely. The tradeoff is management complexity. Peering is point-to-point, so you need more connections and routing gets more distributed.

That tradeoff sounds scary until you actually do the math.

”What Are We Getting for This?”

The first step wasn’t proposing a fix. It was understanding the problem deeply enough to have an honest conversation about it.

We ran a comprehensive analysis on AZ balances and data transfer patterns. Where was traffic actually flowing? Not where we assumed it flowed. Not where the architecture diagram said it flowed. Where the packets were actually going.

Then we asked the uncomfortable question: what exactly are we getting in return for the centralized model? The answer was centralized routing and simplified management. Real benefits. Genuinely useful capabilities.

But when you put a price tag next to those benefits, the conversation changes. Centralized routing is worth a lot. It’s not worth what we were paying.

Making the Switch Without Breaking Everything

This is the part where I have to be honest: I was nervous. Rearchitecting the networking layer of a system serving hundreds of millions of users is not something you do casually over a weekend.

We broke it into careful phases:

Routing complexity. Transit Gateway gives you centralized route tables. Peering means managing routes per connection. We leaned hard on Terraform CI/CD pipelines to manage peering connections declaratively. If your IaC game isn’t strong, this migration gets much harder.

Security boundaries. Centralized connectivity simplifies security policies. Moving to peering forced us to be more intentional about which VPCs could talk to each other. Honestly? This turned out to be a good thing. Explicit boundaries are better than implicit ones, even if they take more work to set up.

Operational risk. We validated each new peering connection before decommissioning the corresponding Transit Gateway route. No big bang cutover. No “flip the switch and pray.” Just patient, methodical migration, one connection at a time.

Team readiness. Could our teams actually operate in a more distributed model? This is where years of investing in IaC standards paid off. The tooling abstracted enough complexity that teams didn’t need to become networking experts overnight.

The Bigger Lesson

The specific savings will vary by organization. They’re proportional to traffic volume, and the numbers I saw reflected hyperscale traffic patterns. Your situation will be different.

But the lesson is universal:

Question your defaults regularly. The architecture decisions that were correct two years ago might not be correct today. Traffic patterns change. Pricing models change. Your team’s capabilities change. Schedule periodic reviews of your core infrastructure assumptions, especially the ones that feel settled. Especially those.

Follow the money. A growing cost line is your architecture trying to tell you something. Listen to it. When the bill keeps climbing and nobody can explain why, there’s usually an architectural assumption underneath that’s expired.

IaC is your safety net. We could not have made this transition safely without mature Terraform pipelines. The ability to declare, review, and test networking changes in code is what made a migration of this scale possible without keeping anyone up at night. Well, not too many nights.

Centralization is a spectrum. The answer wasn’t “centralized bad, distributed good.” It was “centralize what benefits from centralization, distribute what benefits from distribution.” Egress controls? Centralize those. Inter-service chatter? Peer directly. The right answer depends on what you’re actually doing with the traffic.

What I Keep Coming Back To

Every architecture decision has a shelf life. The question is whether you notice when it expires, or whether you keep paying for expired assumptions until someone finally pulls up the cost dashboard and says, “Wait, how much?”

That moment of questioning the default was the highest-leverage thing I did that year. Not the migration itself. Not the cost savings. The act of looking at something everyone accepted as “just how it works” and asking, “But does it have to be?”

That’s the habit worth building.

Co-authored with AI, based on the author's working sessions, dictations, and notes.