It wasn’t the first test run that bothered me, it was the second.
I was still a new developer, building one of my first production backends from the ground up. We were using a cloud provider for Postgres, and my early smoke tests looked clean. Single queries behaved. Endpoints worked. The system felt solid.
Then I wired up the frontend analytics dashboard, and the moment it started fanning out async requests, everything went sideways. My terminal barked PrepareStmt: false pgx errors at me with the cadence of improvisational jazz. The same request could pass or fail on different runs of the exact same smoke test, which was even more maddening. I did what any new developer would do and hunted for graybeard insights on Stack Overflow, changing pooler settings and adding options into the connection string, but the problem remained sticky.
Eventually I learned the provider had PgBouncer enabled by default. My application pool and the provider pool were effectively fighting over the connection lifecycle, a pool-on-pool conflict.
There was nothing wrong with the code I wrote. The service behaved exactly the way I designed it, deterministically. The cloud database did the same. The failure lived in the handoff between them. That’s where distributed systems tend to surprise you, and that experience permanently changed how I think about the gaps between components.
When we plan software, we usually start by drawing boxes.
A service here. A scheduler there. A repository off to the side. We connect them with lines to show how information moves through the system. The control plane talks to the scheduler. The scheduler talks to the repository. The diagram makes sense. The responsibilities feel clear. Then we build it.
Those boxes become binaries. Or containers. Or functions deployed somewhere we don’t fully control. The lines turn into network calls, often crossing machines, namespaces, subnets, tunnels, and eventually physical hardware. We ship faster. We scale. We automate deployments. On paper, everything looks cleaner than ever. And yet, when systems fail, they almost never fail inside the boxes.
They fail in the space between them.
We tend to focus intensely on what happens within a component: function boundaries, data structures, control flow, performance characteristics. We reason carefully about how data moves through our code. But the lines we draw between components are often treated as if they were inert. As if they simply “exist.” They are not.
That line on the diagram is not just a connection. It is an abstraction over an entire world of moving parts, many of which are volatile, ephemeral, and outside our direct control. That line represents network interfaces, routing tables, firewalls, kernel behavior, packet queues, retries, timeouts, buffers, cables, switches, and physical infrastructure living in places you will never see.
No matter how high-level your programming language is, the moment you send data from one process to another over a network, your system becomes low-level again. That message eventually traverses kernel space. It touches iptables or equivalent enforcement layers. It becomes packets. Those packets may become electrical impulses on copper, flashes of light through fiber, or radio waves in the air. And all of that infrastructure is ephemeral.
Cables break. Routes change. Configurations drift. Hardware is replaced. Tunnels flap. Clocks skew. Someone reboots the wrong device. I think we are all acutely aware of the dangers of sharks chewing on undersea cables! None of this is hypothetical, and none of it is reachable from your application code.
From the perspective of a programmer sitting comfortably behind layers of abstraction, this can be easy to forget. The line looks simple on the diagram. But in reality, it is often the most fragile and failure-prone part of the entire system. This is where assumptions break.
If we don’t think carefully about what those lines represent, we implicitly assume they are reliable, ordered, timely, and safe. We assume connectivity. We assume availability. We assume intent. When those assumptions hold, everything feels easy. When they don’t, systems behave in ways that feel surprising, even chaotic.
Writing robust software means taking responsibility for that space between components. Not by trying to control it completely (we can’t) but by designing systems that survive it. Systems that expect partial failure. Systems that preserve integrity when availability degrades. Systems that fail in ways that humans can understand and recover from.
If you only reason about what happens inside your components, you are only reasoning about half of the system. The other half lives in the gaps.
Systems fail between responsibilities
Most failures don’t happen because a component doesn’t do what it was designed to do.
They happen because responsibility was assumed, but not owned.
In early systems, responsibility is usually clear. One service does one thing. One database owns a set of records. One process handles a request from start to finish. When something breaks, there’s a relatively small surface area to inspect, and it’s often obvious where to look.
As systems grow, that clarity erodes, not because engineers become careless, but because scale forces separation. Responsibilities are divided so teams can move independently. Components are isolated to improve reliability. Interfaces are introduced so work can proceed in parallel.
This is all good engineering, but separation creates seams, and seams are where responsibility becomes ambiguous. When a request crosses a boundary, who is responsible for what happens next?
- Is the caller responsible for retries? Or the callee?
- Is authentication validated once, or revalidated downstream?
- If data arrives late, duplicated, or out of order, who is expected to notice?
- If a dependency is unavailable, should the system block, degrade, or redirect human behavior?
These questions are rarely answered explicitly. Instead, they’re answered implicitly through defaults, conventions, and assumptions that accumulate over time.
Most of the systems I’ve worked on didn’t fail because a component misbehaved. They failed because two components behaved correctly according to their own rules, while the system as a whole behaved incorrectly.
One service assumed the network was reliable. Another assumed failures would be surfaced synchronously. A third assumed that a downstream system was immutable once acknowledged. None of those assumptions were unreasonable on their own. Together, they created a failure mode that no single component could see. This is why system bugs feel so different from code bugs.
A code bug usually lives in a place you can point to. A line of logic. A missing condition. An incorrect transformation. Once you find it, the fix is often satisfying and discrete.
A system bug lives in the interaction between correct pieces. There is no single line to change. Fixing it means deciding where responsibility should live, and that decision almost always involves tradeoffs.
- Should we retry here, or let it bubble up?
- Should we fail fast, or attempt recovery?
- Should this system preserve availability, or preserve correctness?
- Should humans be involved when automation breaks down?
These are not questions your compiler can answer for you, nor are the questions that belong cleanly to any one component. They belong to the space between them.
Over time, I’ve noticed a pattern: the more distributed a system becomes, the more likely failures are to emerge at the boundaries where responsibility is split, deferred, or silently duplicated. And the more regulated or compliance-sensitive the domain, the higher the cost of getting those boundaries wrong.
This is where design stops being about correctness in isolation and starts being about coordination under uncertainty.
A system that works perfectly when everything is healthy can still be fragile. A system that anticipates partial failure, ambiguous ownership, and delayed recovery often feels more complex at first, but tends to survive the moments that actually matter. The hard part is not writing components that behave correctly, it is deciding what happens when none of them are fully in charge.
Boundaries are where assumptions break
Boundaries exist to simplify thinking.
We introduce them so we can reason locally. This service does X. That service does Y. The contract between them is small and well-defined. If everyone respects the boundary, the system stays understandable.
The problem is that boundaries don’t just separate responsibilities. They also hide assumptions.
At a boundary, we usually assume things like:
- the network is available
- requests arrive in order
- responses are timely
- identities are stable
- permissions were already checked
- data has not been altered in transit
Most of the time, these assumptions hold well enough that we stop noticing them. They become part of the background. We internalize them without ever writing them down. Until one day, they don’t hold.
When that happens, the boundary stops being a clean interface and starts behaving like a fault line.
I’ve seen systems where every individual component was correct, tested, and well-behaved, yet the system still failed in ways that were hard to diagnose. In almost every case, the failure traced back to an assumption that lived at a boundary and was never made explicit.
A service assumed that authentication was handled upstream, but upstream assumed downstream would revalidate. A process assumed that a successful write meant the data was durable, but durability depended on a network hop that had quietly degraded. A retry mechanism assumed idempotency that was never actually guaranteed.
These aren’t mistakes made by careless engineers. They’re the natural result of abstraction doing its job too well. What makes this especially tricky is that assumptions don’t break cleanly. They degrade.
Latency creeps up. Retries overlap. Partial failures start to look like success from one side and failure from another. Humans intervene in ways that weren’t anticipated. Automation does exactly what it was told, even when the situation has changed.
This is why some of the most consequential design decisions I’ve made didn’t involve adding new functionality. They involved deciding where assumptions should be enforced, where they should be validated, and where they should be treated as untrusted.
Sometimes that meant moving checks closer to the edge of the system. Sometimes it meant adding explicit failure modes where none existed before. Sometimes it meant accepting a little more friction in exchange for clarity and safety.
Boundaries don’t fail because they exist. They fail because we forget to ask what they’re protecting us from, and what they’re hiding from us.
Abstraction leaks live in the gaps
Abstraction is not the enemy. It’s the reason we can build anything at all.
Every useful system depends on layers of abstraction: languages, runtimes, operating systems, containers, networks, orchestration, hardware. Each layer removes details so we can think at a higher level. The mistake is believing that abstraction removes responsibility. In reality, abstraction redistributes it.
When everything is working, abstractions feel solid. When something goes wrong, the details they hide have a way of reappearing, often all at once. This is what people mean when they say abstraction leaks, but the leak rarely happens inside a component. It happens between them.
A timeout that was “reasonable” at one layer becomes catastrophic when combined with retries at another. A permission model that was sufficient inside a service becomes dangerous when requests are replayed or proxied elsewhere. A deployment that was safe in isolation causes cascading restarts when orchestrated at scale. These are not edge cases. They are emergent behavior.
What I’ve learned over time is that the most resilient systems are not the ones with the fewest abstractions, but the ones that treat the gaps between abstractions as first-class design concerns.
That means asking questions like:
- What happens when this dependency is slow but not down?
- What happens when this message arrives twice?
- What happens when automation succeeds but intent has changed?
- What happens when end-users are forced to intervene?
It also means designing failure paths that preserve meaning. A system that fails loudly but clearly is often preferable to one that degrades silently and corrupts state. A manual fallback that is slightly awkward but explicit can be safer than an automated path that no one fully understands under pressure.
One of the most validating moments of my career came when a system I had built failed exactly the way it was designed to. Automation stopped. Humans were guided through a fallback path. Compliance was preserved. The business continued operating. I was not available at the time, and that was the point. That experience fundamentally changed how I think about success.
Success is not just correctness when everything is healthy. Success is preserving intent when parts of the system become unreliable, when assumptions stop holding, and when the abstraction layers you depend on start to show their seams. The space between components is where that intent is either upheld or lost.
You can ignore that space for a long time. Many systems do. But once a system grows beyond a certain point, that space becomes the system. Learning to reason about the gaps is how you avoid getting surprised later.
If the space between components is where systems fail, then the most useful thing we can do is learn how to reason about that space.
There isn’t a universal set of best practices for this. The right choices depend on the system, the domain, the constraints, and the cost of failure. What matters more than the answers is developing the habit of asking the right questions early and revisiting them often.
Treat boundaries as something to interrogate
Not just lines between components
When you introduce a boundary, it’s worth slowing down and asking what assumptions are crossing it along with the data. Not just what the interface looks like, but what is being trusted implicitly.
- What does this component assume about the world when it sends a request?
- What does the receiving side assume has already been validated?
- What happens if those assumptions are wrong, incomplete, or only mostly true?
Another useful question is where responsibility actually lives when something goes wrong.
- If a request fails halfway through a workflow, who is expected to notice?
- If data is duplicated, dropped, or delayed, who is responsible for reconciling it?
- If a dependency becomes unreliable but not fully unavailable, what behavior is preferred?
These questions don’t have universally correct answers. But if they’re never asked, the answers will be supplied by defaults, retries, and timeouts chosen in isolation.
It can also be helpful to think about failure before thinking about success.
- If this boundary fails, how will that failure surface?
- Will it be obvious, or silent?
- Will it preserve intent, or corrupt state?
- Will an end-user be able to intervene without understanding the entire system?
Designing failure paths often feels like pessimism, but in practice it’s an act of respect for the people who will eventually have to operate the system under pressure.
Another lens that has proven useful to me is to ask how many times responsibility is duplicated across a boundary.
- Are multiple components making the same decision independently?
- Are checks being performed in more than one place, and if so, are they consistent?
- If one side changes behavior, how quickly does the other side notice?
Duplication can increase resilience, but it can also hide drift. Over time, duplicated responsibility tends to diverge unless there is a clear reason for it to exist.
Finally, it’s worth asking which parts of the system are being protected by abstraction, and which parts are being hidden by it.
- Which details are intentionally removed to simplify reasoning?
- Which details are missing simply because they’re inconvenient to think about?
- If the abstraction breaks, will the system fail in a way that reveals what went wrong?
These questions don’t require deep infrastructure knowledge to ask. They require curiosity about how software behaves once it leaves the whiteboard and starts interacting with reality.
The goal isn’t to eliminate boundaries or abstractions. It’s to become fluent in what they cost, what they buy, and where they are most likely to surprise you.
Over time, this way of thinking tends to surface patterns that are specific to your systems and your domain. Those patterns become conventions. Those conventions become guardrails. And those guardrails tend to hold best when they were chosen deliberately, rather than inherited by accident.
You don’t need to answer all of these questions at once. But if you make a habit of asking them, the space between components starts to feel less mysterious, and failures start to feel less arbitrary.
So the next time you’re illustrating an architecture with boxes and you draw a line between components, it’s worth pausing for a moment and asking what that line actually represents, and how it will behave once it leaves the diagram and meets reality.
That’s usually a sign that you’re reasoning about the system, not just the code.