A reference manual for self-healing distributed coordination.
Built & illustrated by Abhishek Aditya.
Ed. 0.1 · alpha · 2026
Every Kubernetes control loop, every Cassandra ring, every distributed lock rests on a coordination service. They are battle-tested, yet operationally brutal: split-brain incidents, lease starvation, slow-follower cascades, and endless consensus-parameter tuning consume disproportionate on-call effort. Postmortems lag the incidents that produced them.
Recent proposals to inject LLM reasoning into the consensus data plane have been rightly rejected by practitioners. Consensus protocols (Paxos, Raft, ZAB) rest on a few narrow correctness invariants and are acutely sensitive to non-determinism. A non-deterministic language model in the commit path breaks the guarantee that two replicas applying the same log entry compute the same state. That guarantee is the safety.
The right place for an LLM agent is the control plane: observing, recommending, and documenting, never deciding. AEGIS is the open-source artifact that makes this argument concrete and testable.
The data plane is deterministic Raft: Apache Ratis on Java 25, exposed over gRPC, with leader-stamped wall-clock time so that TTL math is reproducible across replicas.
The control plane is an agentic Python sidecar (LangGraph plus the Anthropic SDK) that observes telemetry, recommends config changes as GitHub PRs, and drafts postmortems as GitHub Issues.
The two share a telemetry surface. They do not share a mutation surface. A human holds every merge bit.
An open loop says "anomaly → PR → hope it helped." AEGIS closes it without ever touching the data-plane invariant.
Before a PR is opened, a verifier replays the exact chaos trace in an ephemeral sandbox cluster under both the current and proposed config, and embeds the before/after delta in the PR. A static safety envelope rejects any patch that violates a consensus-safety constraint, so unsafe configs are structurally impossible to propose.
M11 counterfactual verify · M12 safety envelope · M13 RAG root-cause · the closed-loop modules, in build.
Five nodes, one elected leader, on-disk log + snapshot. The leader replicates an append-only log to its followers; a quorum acknowledgement commits each entry. Kill the leader and a new term elects a successor.
Every command carries leader-stamped wall-clock time in its proto envelope, so lease and TTL math is identical on every replica: determinism by construction.
Ten shipped (M1–M10, real code + a 99-test agent suite); three in build to close the loop (M11–M13).
Apple Silicon · Linux · Docker Desktop. Java 25, Python 3.10+. Build, verify, bring up a live 5-node cluster, then run the agent pipeline against it.
Open the repo ↗# 1 · clone $ git clone https://github.com/abhishek-aditya/aegis && cd aegis # 2 · build + test the Java reactor (Ratis · locks · KV · telemetry) $ mvn -B verify # 3 · bring up the 5-node cluster + observability stack $ docker compose up --build → grafana http://localhost:3000 → dashboard http://localhost:4400 # 4 · run the control-plane agent pipeline (dry-run, no GitHub call) $ cd agents && pip install -e ".[dev]" $ aegis-classifier --once | aegis-proposer --dry-run ✓ 99 tests pass · 7/7 fixture traces classify correctly
The contribution is the architectural invariant, plus the open-source artifact that makes it concrete. Every config the agent proposes is logged with rationale, including the bad ones. Negative results are documented, not hidden.
Counterfactual verification, safety red-team, and root-cause accuracy benchmarks are coming soon.
Yes. The reasoning earns its keep where it helps: classifying anomalies from noisy multi-signal telemetry, drafting tuning PRs with rationale and rollback, and writing readable postmortems from a tool-bounded view. The agent simply never gets to mutate consensus. That separation is the contribution.
Apache Ratis is battle-tested in Apache Ozone and IoTDB. The novelty is the agentic ops layer and the control-plane invariant, not the consensus algorithm. Reinventing Raft is a different project.
A human closes the PR, and the cluster is unchanged. The safety envelope (M12) rejects unsafe patches before a PR is even opened. The agent's only mutation pathway is the review queue; bad proposals become logged evidence for the paper.
Determinism. The replay test that anchors M6 needs the same answer every run. The LLM genuinely earns its keep in diagnosis and postmortem narration (M13, M8): comparison, lesson-extraction, ranked root-cause, and that path is opt-in.