I learned to program on a calculator
In middle school, we were required to buy TI-83 calculators for math class. Somewhere along the way, I discovered that they could be programmed.
The language was TI-BASIC, an adaptation of BASIC that even 25 years ago was considered a fossil. You had a small screen, a keypad that fought you at every step, and a very constrained model of computation. There was no filesystem. No processes. No operating system in any meaningful sense.
There was memory, though, all 24KB of it!
Specifically, there was a predefined set of named locations: A through Z, plus θ. If you wanted to store data, you put it in one of those using the strangly esoteric STO button (short for store) which represented itself with a → symbol.
The Disp function was my favorite. You could calculate something, call Disp, see the result, and the program would exit with a resounding Done. Stateless. Ephemeral. Clean.
That was my first mental model of data.
Data lived somewhere simple. You touched it. You saw it. And touching it was the point.
Nothing about that environment suggested that how you accessed data mattered. A read was a read. A write was a write. The rest was just syntax.
That model stuck with me longer than I realized.
Touching the data felt like the work
As I moved into more serious programming, the surface area grew, but the intuition stayed the same.
You read bytes. You parsed them. You transformed them. You wrote results somewhere else.
If something was slow, you assumed you were doing too much work with the data. Inefficient algorithms. Unnecessary allocations. Extra copies you could optimize away if you were careful enough.
Performance problems felt like local problems.
Then I learned about pointers.
The first time you really get references, it feels like you’ve discovered a trap door under the language.
Before that, everything is values. You pass things around, they get copied, you mutate something and it changes here but not there. It’s all concrete. You can almost see it.
Pointers break that.
Suddenly you can hold a thing that points to another thing. You can have two names for the same underlying data. You can pass something “by reference” and mutate it from somewhere else. You can build linked structures and avoid copies. You can make APIs that feel fast and elegant.
It was the first time I felt like I had challenged the calculator model and won.
Oh. the data doesn’t have to move. The name can move.
And for a while, that felt like the whole secret.
I remember having that very specific kind of early confidence: not the bootcamp kind where you’re excited to learn, but the kind where you think you’ve finally uncovered the real rules. The hidden layer. The part that separates people who can code from people who understand what’s happening.
And it did open a door.
It just wasn’t the last door.
Because pointers taught me that “touching the data” wasn’t always necessary.
But they also quietly reinforced the next assumption: that the important boundary was inside the language.
As long as I understood references and allocation, I thought I understood the cost.
A read is not just a read
The first time that assumption broke for me, nothing crashed.
The system worked. It was correct. Throughput plateaued, CPU usage climbed, and profiling results pointed to places that felt… insulting. Time spent “just reading.” Time spent “just copying.”
It was tempting to dismiss this as noise, or to assume there was a trick I hadn’t learned yet. Some clever buffering strategy. Some concurrency pattern that would unlock the next tier of performance.
What I was missing was more fundamental.
In the calculator world, a read was a read. There was no operating system. No protection boundaries. No distinction between where data lived and who was allowed to touch it.
Pointers were my first hint that the location of data mattered.
But on a real machine, location isn’t just a matter of heap vs stack or “did this allocate.” It’s a matter of privilege boundaries, address spaces, and what the kernel has to do to let you see anything at all.
- A read from a socket is not the same thing as a read from a file.
- A read that copies data into your address space is not the same thing as a read that hands ownership elsewhere.
- A read that wakes the kernel, switches privilege levels, and disrupts the CPU’s execution context is not interchangeable with one that doesn’t.
They all look like “reading bytes” in code. They are not equivalent events in the system.
Once you see that, it becomes hard to unsee. And it forces a much less comfortable question:
What if the slow part isn’t what I’m doing with the data… but the fact that I insist on touching it at all?
Pointers taught me that data didn’t always have to move.
What they did not teach me was that seeing data has a cost even when it stays where it is.
That distinction didn’t click for a long time, mostly because nothing in everyday programming forces you to confront it directly. Languages work hard to make access feel uniform. You call a function. You read some bytes. You keep going.
The machine does not experience it that way.
On a real system, the act of reading data often begins with an interruption.
Not a metaphorical one. A literal one.
When user code asks the operating system to do something on its behalf, the CPU has to stop what it is doing, save its current state, and cross a protection boundary. Registers are written out. Execution context is swapped. Privilege levels change. The instruction pipeline is disrupted so completely that the processor has to rebuild its momentum from scratch.
This happens before a single byte of your payload is touched.
If the data lives somewhere the kernel controls: a socket buffer, the page cache, a device. The CPU cannot simply reach out and grab it. It has to ask. And asking is expensive in ways that don’t show up as obvious bugs or incorrect results.
One of the more subtle costs shows up in the places programmers rarely look.
Modern CPUs rely heavily on cached translations between virtual and physical memory. Those translations live in a small, extremely fast structure that assumes execution stays within a relatively stable context. When that context changes abruptly, some of that cached knowledge becomes invalid.
The processor loses its bearings.
Addresses it resolved a moment ago now require extra work. Page tables have to be consulted. Memory that used to be quick to reach suddenly isn’t. The system still works, but it does so with more hesitation.
None of this means you should avoid system calls. That would be absurd. It means that awareness itself carries a tax.
Every time your program insists on being involved in the movement of data, it asks the machine to stop, switch gears, and pay that tax again.
And that tax is not just time.
When you pull bytes into user space, you are competing for the most constrained resource in the system: the CPU caches.
Those caches are supposed to hold your program’s hot instructions and working data: the loops, counters, and state that let the code make forward progress. When large volumes of payload data stream through user space, they don’t politely wait their turn. They evict.
Touching the data doesn’t just cost cycles. It kicks your application’s hot instructions out of the L1 cache to make room for bytes you’re just going to throw away.
CPU caches are the most contended shared resource in the system, and streaming payload data through them is one of the fastest ways to lose the very locality your program depends on.
Zero-copy as restraint
The first time I encountered zero-copy techniques, they were presented as optimizations.
Faster networking. Higher throughput. Fewer allocations.
All of that is true, but it undersells what is actually happening.
Zero-copy is not about cleverness. It is about restraint.
Instead of pulling data into your own address space so you can look at it, you tell the operating system what should be connected to what, and then you step aside. Data moves without ever becoming yours.
This is a fundamentally different posture.
You are no longer transforming data. You are delegating its movement.
A baseline: the obvious loop
Most of us start with some version of this. It’s not wrong. It’s just involved.
// io.Copy is the canonical version of this.
// The structure matters: read into user memory, then write out.
func proxyCopy(dst io.Writer, src io.Reader) error {
buf := make([]byte, 32*1024)
for {
n, rerr := src.Read(buf)
if n > 0 {
if _, werr := dst.Write(buf[:n]); werr != nil {
return werr
}
}
if rerr != nil {
if errors.Is(rerr, io.EOF) {
return nil
}
return rerr
}
}
}
That buffer lives in your address space. Every chunk of data crosses into user space, touches your caches, and becomes your problem.
Sometimes that’s exactly what you want because you need to inspect, redact, compress, encrypt, validate, or log.
And sometimes… you don’t.
sendfile: file → socket without copying into user space
If your source is a file and your sink is a socket, Linux can do the handoff directly:
// sendFile delegates the copy from file -> socket to the kernel.
// It avoids pulling the payload into user space.
func sendFile(outFd int, inFile *os.File) (int64, error) {
// Note: sendfile arguments are (out, in, offset, count).
// Using nil offset means the file offset is advanced.
var sent int64
for {
n, err := unix.Sendfile(outFd, int(inFile.Fd()), nil, 1<<20) // 1MiB chunks
if n > 0 {
sent += int64(n)
}
if err == nil {
continue
}
if errors.Is(err, unix.EINTR) {
continue
}
if errors.Is(err, unix.EAGAIN) {
// Non-blocking socket: caller should wait for writable and retry.
return sent, err
}
// EOF on the input file typically shows up as n==0 with err==nil,
// depending on platform details.
return sent, err
}
}
This is one of those moments where code looks almost boring, and that’s the point. You’re not “processing” bytes anymore. You’re specifying a path.
splice: plumbing between file descriptors
splice is more general. It uses a pipe as an in-kernel handoff point.
That pipe is not a byte buffer in the usual sense.
It is a circular buffer of page references.
The kernel is not copying your payload into the pipe. It is enqueueing pointers to physical memory pages into a ring buffer that describes ownership and ordering. The bytes themselves stay where they already live.
The key detail is that the pipe is not a “user buffer.” It’s a kernel mechanism for moving references to pages around.
// spliceLoop moves data from srcFd -> pipe -> dstFd without copying into user space.
// This is Linux-specific and intentionally low-level.
func spliceLoop(dstFd, srcFd int) error {
// Create an in-kernel pipe.
var p [2]int
if err := unix.Pipe2(p[:], unix.O_CLOEXEC); err != nil {
return err
}
defer unix.Close(p[0])
defer unix.Close(p[1])
const chunk = 1 << 20 // 1MiB
for {
// 1) splice src -> pipe
// SPLICE_F_MOVE: a hint to the kernel that we want to move pages, not copy them.
n, err := unix.Splice(srcFd, nil, p[1], nil, chunk, unix.SPLICE_F_MOVE)
if n > 0 {
// 2) splice pipe -> dst
left := n
for left > 0 {
m, werr := unix.Splice(p[0], nil, dstFd, nil, left, unix.SPLICE_F_MOVE)
if m > 0 {
left -= m
continue
}
if werr != nil {
if errors.Is(werr, unix.EINTR) {
continue
}
return werr
}
}
}
if err == nil {
continue
}
if errors.Is(err, unix.EINTR) {
continue
}
if errors.Is(err, unix.EAGAIN) {
// Caller should poll and retry.
return err
}
if errors.Is(err, io.EOF) || n == 0 {
return nil
}
return err
}
}
This is the first time you feel the posture shift.
You’re no longer thinking “what do I do with these bytes?”
You’re thinking “how do I avoid making them my problem?”
But the real cost shows up elsewhere.
When you choose not to look at the data, you give something up.
The price of blindness
Delegation comes with blindness.
Once data moves through the system without entering your address space, you cannot see it. You cannot log it. You cannot redact it. You cannot enforce policies that depend on inspecting payloads.
In regulated or security-sensitive environments, this matters immediately. Masking, auditing, validation, all of these require awareness. You cannot partially inspect a stream you never touch.
This is why zero-copy is not a universal best practice. It is incompatible with entire classes of logic.
The decision is not “fast or slow.” It is “control or momentum.”
In practice, most systems end up with a mix. Some paths are designed for inspection and correctness. Others are designed for throughput and minimal interference. Knowing which is which is an architectural decision, not a micro-optimization.
Doing the same thing to your own heap
At some point, the same idea shows up closer to home.
Even when data lives entirely in user space, touching it repeatedly has a cost. Allocation creates work for the garbage collector. Poor locality causes cores to fight over cache lines. Memory that could have stayed warm gets churned out and pulled back in.
Pooling memory changes the relationship.
Instead of constantly allocating and discarding buffers, you keep a small working set alive. Memory stays mapped. Cache lines stay relevant. The processor sees familiar patterns instead of a stream of novelties.
A practical pool: keeping a hot set of buffers
var bufPool = sync.Pool{
New: func() any {
// Pick a size that matches your hot path.
b := make([]byte, 32*1024)
return &b
},
}
func pooledCopy(dst io.Writer, src io.Reader) error {
bp := bufPool.Get().(*[]byte)
buf := *bp
defer func() {
// Optional: zero out if buffers can hold sensitive data.
// for i := range buf { buf[i] = 0 }
bufPool.Put(bp)
}()
for {
n, rerr := src.Read(buf)
if n > 0 {
if _, werr := dst.Write(buf[:n]); werr != nil {
return werr
}
}
if rerr != nil {
if errors.Is(rerr, io.EOF) {
return nil
}
return rerr
}
}
}
This is not glamorous, and it’s not magic.
It’s just a way of saying: these allocations are not part of my product.
The runtime is free to drop pooled objects under memory pressure — sync.Pool is a cache, not a contract — but under steady throughput it tends to keep a warm working set around.
And that warm working set is the point.
Under the hood, Go goes to surprising lengths to make this safe and fast. Pools are sharded so that cores mostly interact with their own local state.
Just as importantly, the runtime uses careful cache alignment and padding so that different processors are not fighting over the same 64‑byte cache line.
This avoids false sharing — a pathological case where two cores modify unrelated variables that happen to live on the same cache line, forcing constant invalidation traffic even though the data itself is independent.
The effect is the same as zero-copy, just applied inward instead of outward.
You are not trying to be clever with memory. You are trying not to disturb it unnecessarily.
When delegation collides with lifecycle
There is a moment where this approach stops feeling elegant and starts feeling dangerous.
Delegated work continues even when your code is not running.
If a goroutine blocks while the kernel moves data on its behalf, your program may be alive without being aware. Signals can arrive. Shutdown can begin. From the outside, the process looks healthy.
Inside, nothing is observing intent.
This is where lifecycle discipline matters.
If you want to shut down cleanly, you need a way to periodically regain control. Time bounds. Deadlines. Checkpoints where execution returns to user space and intent can be evaluated.
The problem: blocked in the kernel
In Go, graceful shutdown is often modeled as a context.Context that gets cancelled on SIGTERM.
That works beautifully as long as your goroutines are actually running.
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()
// Somewhere else:
select {
case <-ctx.Done():
// begin shutdown
}
But if your hot path is sitting inside a syscall — read, splice, sendfile, anything that can block — you don’t get to run that select until the kernel returns.
So the question becomes: how do you force the kernel to return often enough that “intent” can be observed?
Deadlines: turning a blind loop into periodic checkpoints
For network connections, Go gives you a very practical lever: deadlines.
A deadline is a time budget you hand to the kernel. If the call would block longer than that budget, it returns control to you.
That means your program periodically wakes up, checks whether shutdown has been requested, and either continues or exits.
// readLoopWithDeadline periodically returns to user space so ctx cancellation is observable.
func readLoopWithDeadline(ctx context.Context, c net.Conn) error {
for {
// Force a return every few seconds even if no data arrives.
_ = c.SetReadDeadline(time.Now().Add(2 * time.Second))
buf := make([]byte, 4096)
_, err := c.Read(buf)
if err != nil {
// Timeout is expected — it is our checkpoint.
if ne, ok := err.(net.Error); ok && ne.Timeout() {
select {
case <-ctx.Done():
return ctx.Err()
default:
continue
}
}
return err
}
// ... process bytes or forward them ...
}
}
There’s a larger lesson here: deadlines are a control plane.
They are how you reassert intent in a system that is otherwise moving forward without you.
Draining and exit discipline
If you use a pipe-based zero-copy path (like splice), shutdown has one more sharp edge.
Data can be “in flight” inside the kernel — sitting in a pipe buffer that your code never sees — right when you decide to exit.
A clean shutdown needs a final discipline:
- stop producers
- stop accepting new work
- drain what’s already in the plumbing
- then exit
It’s the same principle as graceful HTTP shutdown. You don’t stop the world in the middle of a request.
Zero-copy just makes it easier to forget that requests can exist in places you don’t have visibility.
Delegation does not absolve responsibility. It shifts where responsibility must be reasserted.
Closing perspective
Learning not to touch data was not about becoming faster.
It was about learning when involvement adds value, and when it only adds friction.
Early on, touching the data was the work. Then references taught me that names could move instead of bytes. Later still, systems work taught me that sometimes even awareness is too expensive.
The fastest systems I’ve worked on were not the ones with the most insight. They were the ones that knew when insight was unnecessary.
That kind of restraint is not obvious, and it is not taught early. It comes from watching systems behave under load, under shutdown, and under failure, and noticing that the quietest paths are often the most reliable.
The fastest way to process data is to never look at it.
The hard part is knowing when you still need to.