Operations runbook — DistMemory¶
This document is for operators running the pkg/backend.DistMemory
distributed backend in production. It assumes the design background in
distributed.md. Sections are deliberately short — each
one stands on its own and links to code.
At a glance¶
| Concern | First place to look |
|---|---|
| Node not receiving traffic | dist.members.alive, /health |
| Writes failing | dist.write.quorum_failures, sentinel.ErrDraining, sentinel.ErrQuorumFailed |
| Replicas falling behind | dist.hinted.queued, dist.hinted.replayed, dist.hinted.dropped |
| Bandwidth pressure | DistHTTPLimits.CompressionThreshold |
| Spurious peer flapping | dist.heartbeat.indirect_probe.refuted, WithDistIndirectProbes |
| Slow rebalance | dist.rebalance.throttle, dist.rebalance.last_ns |
| Anti-entropy backlog | dist.merkle.last_diff_ns, dist.auto_sync.last_ns |
Live metric values come from DistMemory.Metrics() (Go struct),
/dist/metrics (JSON, when wrapped in hypercache.HyperCache), or
the OpenTelemetry pipeline you wired via WithDistMeterProvider.
The OTel names use the dist. prefix.
Wiring observability¶
Three opt-in entry points, all defaulting to no-op:
- Logging —
backend.WithDistLogger(*slog.Logger)routes background loops (heartbeat, hint replay, rebalance, merkle sync) and operational errors into your logger. Records are pre-bound withcomponent=dist_memoryandnode_id=<id>. - Tracing —
backend.WithDistTracerProvider(trace.TracerProvider)opens spans onGet/Set/Removeplus per-peerdist.replicate.*child spans. Cache key values are never put on spans (they can be PII); onlycache.key.length. - Metrics —
backend.WithDistMeterProvider(metric.MeterProvider)exposes every field onDistMetricsas an observable instrument.
Wire all three to the same otel.SetTracerProvider /
otel.SetMeterProvider your application uses; the logger inherits via
slog.Default() if you want a one-liner.
Failure mode — split-brain¶
Symptom. Two subsets of the cluster lose connectivity to each other. Each subset elects local primaries for the keys it owns. Writes from clients on subset A land on A-side primaries; writes from B-side clients land on B-side primaries. When the partition heals, the versions diverge.
Detection. dist.heartbeat.failure rises on both sides during the
partition. After healing, dist.version.conflicts increments as
anti-entropy reconciles.
Resolution. DistMemory uses last-write-wins by (version, origin)
ordering — the higher version wins, ties broken by origin string. This
is automatic. Anti-entropy via SyncWith (manual) or
WithDistMerkleAutoSync (background) closes the gap. There is no
manual reconciliation step today.
Mitigation. Run an odd number of nodes with quorum writes
(WithDistWriteConsistency(ConsistencyQuorum)); a partition that
isolates a minority leaves only the majority side accepting writes
because the minority cannot reach quorum. The minority returns
ErrQuorumFailed (sentinel.ErrQuorumFailed) on Set.
Failure mode — hint queue overflow¶
Symptom. A peer is unreachable for a long time. Every replicated
write to that peer turns into a queued hint. Eventually the queue
hits WithDistHintMaxPerNode or WithDistHintMaxBytes and new hints
get dropped.
Detection. dist.hinted.bytes (gauge) climbs steadily.
dist.hinted.global_dropped increments when caps are exceeded.
dist.hinted.dropped (a different metric — replay errors) also rises
if the peer is reachable but rejecting writes (auth, schema mismatch).
Resolution.
- Restore the unreachable peer; the replay loop drains automatically
(
dist.hinted.replayedrises). - If the peer is permanently gone, remove it from membership
(
DistMemory.RemovePeer(addr)); queued hints expire on theWithDistHintTTLtimer. - If hints are dropping faster than they replay, raise
WithDistHintMaxPerNode/WithDistHintMaxBytes— but understand that the cap exists to bound process memory under sustained failure. Raising it without fixing the underlying peer just delays the bound.
Phase B note. Migration failures during rebalance now also funnel
through the hint queue (Phase B.2). A surge in dist.hinted.queued
during a rolling deploy is expected; it should drain as the new node
becomes reachable.
Failure mode — rebalance under load¶
Symptom. Adding a node triggers a rebalance scan that migrates
keys to their new primary. Under sustained write load the migration
saturates and dist.rebalance.throttle increments — batches queue
behind the configured concurrency cap.
Detection. dist.rebalance.last_ns (gauge — last full scan
duration) climbs. dist.rebalance.throttle (counter) increments when
the concurrency limit blocks a batch dispatch. dist.rebalance.batches
should still climb steadily.
Resolution.
- Raise
WithDistRebalanceMaxConcurrent(default 1) if CPU and network headroom allow. - Lower
WithDistRebalanceBatchSize(default 64) so individual batches finish faster and concurrency slots cycle more often — counter-intuitively, smaller batches sometimes throughput-win. - Pause writes (drain a subset of clients via your LB) until the scan finishes. The dist backend has no built-in write-throttling — that's the application's job.
Phase C note. Drain (POST /dist/drain) does not trigger an
expedited rebalance today; the next scheduled
WithDistRebalanceInterval tick does the work. If you need to force
a faster ownership transfer, call Stop after Drain to cancel
in-flight work and let restart-time rebalance handle migration.
Failure mode — replica loss¶
Symptom. A replica node dies hard (kernel panic, hardware
failure). Its keys still have other replicas (when replication >= 2),
but until membership notices, writes try to fan out to it and
silently retry via the hint queue.
Detection. dist.heartbeat.failure increments steadily for the
lost peer. After WithDistHeartbeat's deadAfter window, the peer
is pruned (dist.nodes.removed increments) and ring lookups stop
including it.
Resolution.
- Wait for the heartbeat to detect the dead peer. With default timing, this is on the order of seconds.
- Spin up a replacement node with the same membership (or let gossip discover it).
- The new node's rebalance scan pulls its assigned keys from surviving replicas via Merkle anti-entropy.
Indirect probes. WithDistIndirectProbes(k, timeout) filters
caller-side network blips that would otherwise mark a healthy peer
suspect. dist.heartbeat.indirect_probe.refuted rising indicates
indirect probes are saving you from spurious flapping; rising
dist.heartbeat.indirect_probe.failure indicates the peer is
genuinely unreachable from multiple vantage points.
Operational tasks¶
Drain a node¶
After drain:
/healthreturns 503; load balancers should stop routing.- New
Set/Removecalls returnsentinel.ErrDraining. Getcontinues to serve until the process exits.
Drain is one-way. Restart the process to clear it.
Inspect cluster state¶
# Membership snapshot.
curl http://node-A:8080/cluster/members
# Key enumeration (paginated, shard-by-shard since Phase C.2).
curl 'http://node-A:8080/internal/keys'
curl 'http://node-A:8080/internal/keys?cursor=1'
# ... follow next_cursor until empty.
Force anti-entropy sync¶
WithDistMerkleAutoSync(interval) runs this on a timer; manual calls
are useful for debugging.
Capacity planning notes¶
- Each shard mutex is independent — write throughput scales with shard count up to CPU saturation.
- Hint queue memory is approximately
HintedBytes+ 64 bytes of bookkeeping per queued hint. Cap viaWithDistHintMaxBytesto bound total process memory under partition. - Merkle tree storage scales O(N/chunk) for N keys at
WithDistMerkleChunkSize(default 128). For a million keys, the default chunk gives ~8K leaf hashes per node — negligible. - Replication factor 3 with quorum reads/writes tolerates 1 failure; raise to 5 for tolerating 2 failures, at 5× the storage cost.
Where things are¶
| Concern | File |
|---|---|
| Public surface | pkg/backend/dist_memory.go |
| Transport interface | pkg/backend/dist_transport.go |
| HTTP transport | pkg/backend/dist_http_transport.go |
| HTTP server | pkg/backend/dist_http_server.go |
| Membership / ring | internal/cluster/ |