On Stopping: Writing Software That Runs Under Orchestration

2/13/2026 · systems

I spend a lot of time thinking about what happens when a system stops.

Not in a dramatic sense. Most of the time nothing is on fire. A deploy rolls out. A process gets replaced. A node is drained. The system keeps going, and from the outside everything looks fine.

That’s the whole point.

Modern container-orchestrated CD workflows sit on top of distributed systems. Work is split across processes, machines, and network boundaries, and progress is made through handoffs rather than single, continuous execution. In this model, processes are expected to stop while work is still in progress, and other parts of the system are expected to continue.

None of this is especially unusual. If you’ve worked in systems like this, you’ve already seen it. What tends to matter is not that processes end, but that they rarely do so at a clean boundary.

A connection that was open a moment ago may be cut off mid-request. A database transaction may be left incomplete. A piece of work that already passed validation may never reach the point where it is committed or explicitly rolled back.

Some systems are designed to absorb that interruption cleanly. Others are not. In those cases, the interruption changes what happens next. Work is duplicated. State becomes inconsistent. A retry occurs without enough information to know whether it is safe.

The effects are rarely immediate. They tend to surface later, often far from where the process actually stopped, and usually in ways that are harder to trace back to the original decision.

And that’s where application code enters the picture.

Why container lifecycle matters inside application code

Containers are intentionally ephemeral. They are created, scheduled, rescheduled, restarted, and terminated as part of normal operation. This is not an edge case. It is the steady state.

From the application’s perspective, lifecycle events are not abstract platform behavior. They surface directly as process signals, routing changes, deadlines enforced externally, and resources disappearing underneath active work.

Treating lifecycle as someone else’s problem creates implicit assumptions. Processes exit only on fatal error. Connections live indefinitely. Work finishes because it started. These assumptions often hold during development and quietly fail once orchestration is involved.

Lifecycle is part of the runtime contract. Application code that acknowledges this tends to fail smaller, fail cleaner, and recover faster.

SIGTERM and SIGINT: What actually happens during shutdown

When a container is asked to stop, the process is not immediately killed. The orchestrator follows a sequence:

Traffic is stopped or drained via readiness changes and endpoint removal.
A termination signal is sent, typically SIGTERM
A grace period begins
If the process has not exited, a SIGKILL is sent when the grace period expires

SIGINT is common during local development, while SIGTERM is the signal of record in production orchestration. Applications should usually treat them similarly.

The important detail is not the signal itself, but the implication.

You have time, but not unlimited time, to shut down intentionally.

If the process ignores the signal or blocks indefinitely, the environment will eventually force termination. That forced termination is where dropped requests, partial writes, and corrupted state tend to originate.

Sidebar: terminationGracePeriodSeconds

Kubernetes does not shut containers down instantly. When a Pod is terminated, the kubelet sends a SIGTERM to each container and waits for a fixed amount of time before forcefully killing the process.
spec:
  terminationGracePeriodSeconds: 30
During this period:

The process is expected to begin shutdown on SIGTERM

In-flight work can continue

Readiness changes should signal that no new traffic should be sent

If the process is still running when the grace period expires, Kubernetes sends SIGKILL, which cannot be handled or deferred.

Two failure modes are common here:

The grace period is shorter than the application’s real shutdown time, leading to forced termination mid-request or mid-transaction

The application never attempts to shut down at all, assuming it will be restarted cleanly

There is no universally correct value. Longer grace periods reduce the risk of data loss but slow down rollouts and failure recovery. Shorter periods improve responsiveness at the cost of tighter shutdown discipline.

The important point is alignment. Application shutdown behavior should be designed with an explicit time budget in mind, not an implicit hope that the process will finish in time.

Health checks: readiness vs liveness

Health checks are one of the main coordination points between application code and the scheduler.

Common patterns include:

Liveness: “Is this process fundamentally broken?”
Readiness: “Should this instance receive traffic right now?”

Problems arise when these are conflated.

A liveness check that fails during transient overload can cause unnecessary restarts. A readiness check that never reflects shutdown intent can keep routing traffic to an instance that is already trying to exit.

A useful mental model:

Liveness protects the platform from wedged processes
Readiness protects users from partial availability

During shutdown, readiness should usually flip before the process stops accepting work. Liveness often remains true until the process exits.

Misconfigured health checks do not just cause restarts. They can amplify failures by repeatedly removing and reintroducing instances at the worst possible time.

Graceful shutdown patterns

Graceful shutdown is not a single mechanism. It is a sequence of coordinated decisions.

1. Stop accepting new work

The first goal is to prevent new requests from entering the system once shutdown has begun.

This usually means:

Marking the instance as not ready
Closing listeners or refusing new connections
Letting upstream routing drain naturally

2. Drain in-flight requests

Work that has already started should usually be allowed to finish, within a bounded time.

This requires:

Tracking active requests
Propagating cancellation signals
Avoiding new background work once shutdown begins

3. Close database connections safely

Connection pools deserve explicit attention during shutdown.

Key concerns include:

Avoiding mid-transaction termination
Allowing in-flight queries to complete or cancel cleanly
Preventing new work from acquiring connections

Data integrity failures often surface here. Not because the database is unreliable, but because shutdown ignored transactional boundaries.

4. Respect timeouts

Graceful shutdown exists within a deadline. Applications should cooperate with it.

This means:

Enforcing timeouts on draining
Logging when work is abandoned
Exiting decisively once the grace period is exhausted

Hanging indefinitely is not safer than exiting. It simply defers failure to a less controlled point.

Minimal Go examples

The following examples illustrate progression, not completeness.

Handling `SIGTERM` with `context.Context`

ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()

go func() {
    <-ctx.Done()
    log.Println("shutdown signal received")
    // trigger shutdown sequence
}()

This establishes a single cancellation source that can be shared across servers, workers, and background routines.
Graceful HTTP server shutdown

srv := &http.Server{
    Addr:    ":8080",
    Handler: handler,
}

go func() {
    if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
        log.Fatalf("server error: %v", err)
    }
}()

<-ctx.Done()

shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

if err := srv.Shutdown(shutdownCtx); err != nil {
    log.Printf("server shutdown incomplete: %v", err)
}

Shutdown stops accepting new connections and waits for in-flight requests to complete, bounded by the timeout.

Basic connection pool draining

func shutdownDB(ctx context.Context, db *sql.DB) error {
    done := make(chan struct{})
    go func() {
        db.Close()
        close(done)
    }()

    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-done:
        return nil
    }
}

This pattern ensures database cleanup participates in the same shutdown deadline as the rest of the service.

Logging: good, better, best

Logging during shutdown is often the only record of what was happening in the system at the moment it stopped. The difference between “something happened” and “we understand what happened” is usually context.

Good

log.Println("shutting down")

2026/02/13 14:02:11 shutting down

This confirms that shutdown began, but it does not explain why, how urgently, or what the process intends to do next.

Better

log.Println("shutdown initiate: stopping new requests")

2026/02/13 14:02:11 shutown initiated: stopping new requests

This communicates intent. A reader can infer that readiness is changing and that the process is transitioning out of active service.

Best

log.WithFields(log.Fields{
    "signal":       "SIGTERM",
    "grace_period": "10s",
    "phase":        "drain",
}).Info("shutdown initiated")

time=2026-02-13T14:02:11Z level=info msg="shutdown initiated" signal=SIGTERM grace_period=10s phase=drain

This captures why shutdown started, what constraints apply, and which stage of shutdown is in progress.

Shutdown logs are not about verbosity. They are about making intent explicit at the moment the system is most constrained. Failure modes when shutdown is ignored

When lifecycle is not handled intentionally, failures tend to cluster:

Dropped requests during instance termination
Partial writes or abandoned transactions
Corrupted caches or inconsistent state
Thundering herds as instances restart simultaneously
False health signals causing traffic oscillation

These failures rarely show up in unit tests. They appear under load, during deploys, or when unrelated systems are already degraded.

A note on agentic systems and intent

Some modern systems use agent-style execution to coordinate multi-step behavior, even in domains that could be modeled deterministically. Whether this is an appropriate design choice is a separate question.

From a lifecycle perspective, the important detail is that agentic execution often carries implicit intent across time. The process may be reasoning, planning, or invoking tools over multiple steps, with intermediate state held in memory.

In some designs, restarting the process resets the interaction and begins a new execution. In others, the agent is expected to continue or resume work that has already affected external systems.

The distinction matters during shutdown. Terminating an agent mid-execution is not equivalent to cancelling a stateless request. Partial intent may already have been applied, and restarting does not necessarily restore the world to a known state.

This does not make agentic systems uniquely fragile. It places them in the same category as background workers, job processors, or transaction coordinators. The state may be ephemeral, but the effects are not.

Systems that adopt this execution model need to decide explicitly what guarantees exist around interruption, abandonment, and restart, especially when graceful shutdown deadlines are involved.

Blast radius and intent preservation

Graceful shutdown is fundamentally about limiting blast radius.

Intent preservation means:

Requests that start have a chance to finish
Data changes either complete or clearly fail
The system degrades proportionally, not catastrophically

This does not require perfection. It requires acknowledging that shutdown is a normal execution path, not an exceptional one.

Designing for shutdown is designing for controlled loss.

Observability considerations

Lifecycle behavior should be visible.

Useful signals include:

Counters for rejected vs accepted requests during shutdown
Timers for drain duration
Logs that mark readiness changes
Traces that show cancellation propagation

Observability is not just about performance. It is about understanding why the system behaved the way it did when it was under constraint.

Closing perspective

Containerized environments make failure and replacement routine. Application code that assumes permanence will eventually collide with that reality.

Writing services that shut down correctly is not an infrastructure concern and not optional polish. It is part of writing correct software.

Graceful shutdown is not about avoiding failure. It is about choosing how failure happens.