Coordination, self-healed

AEGIS

Raft-grade locks, leases & config with an LLM operations agent in the control plane — never the data plane.

Built & illustrated by Abhishek Aditya.

Ed. 0.1 · alpha · 2026

Chaos in, postmortems out. The LLM lives in the control plane, never the data plane.

Read the paper (PDF) → Quickstart → Benchmark results ↓

§ 01 · The Problem

Coordination is the load-bearing wall of the cloud.

Chubby · ZooKeeper
etcd · Consul
and the on-call rota behind them.

Every Kubernetes control loop, every Cassandra ring, every distributed lock rests on a coordination service. They are battle-tested, yet operationally brutal: split-brain incidents, lease starvation, slow-follower cascades, and endless consensus-parameter tuning consume disproportionate on-call effort. Postmortems lag the incidents that produced them.

Recent proposals to inject LLM reasoning into the consensus data plane have been rightly rejected by practitioners. Consensus protocols (Paxos, Raft, ZAB) rest on a few narrow correctness invariants and are acutely sensitive to non-determinism. A non-deterministic language model in the commit path breaks the guarantee that two replicas applying the same log entry compute the same state. That guarantee is the safety.

The right place for an LLM agent is the control plane: observing, recommending, and documenting, never deciding. AEGIS is the open-source artifact that makes this argument concrete and testable.

§ 02 · How it works

Two planes.
One invariant.

The data plane is deterministic Raft: Apache Ratis on Java 25, exposed over gRPC, with leader-stamped wall-clock time so that TTL math is reproducible across replicas.

The control plane is an agentic Python sidecar (LangGraph, OpenRouter-backed) that observes telemetry, recommends config changes as GitHub PRs, and drafts postmortems as GitHub Issues.

The two share a telemetry surface. They do not share a mutation surface. A human holds every merge bit.

Data plane → Apache Ratis · Java 25
Control plane → Python · LangGraph · OpenRouter
Mutation path → GitHub PRs + Issues

[ The two planes ]

[ The safe closed loop ]

§ 03 · The closed loop

Propose, and prove.

An open loop says "anomaly → PR → hope it helped." AEGIS closes it without ever touching the data-plane invariant.

Before a PR is opened, a verifier replays the exact chaos trace in an ephemeral sandbox cluster under both the current and proposed config, and embeds the before/after delta in the PR. A static safety envelope rejects any patch that violates a consensus-safety constraint, so unsafe configs are structurally impossible to propose.

safety envelope & RAG root-cause, shipped · counterfactual sandbox verification on the roadmap.

§ 04 · Consensus mechanics

Raft, leader-elected.

Five nodes, one elected leader, on-disk log + snapshot. The leader replicates an append-only log to its followers; a quorum acknowledgement commits each entry. Kill the leader and a new term elects a successor.

Every command carries leader-stamped wall-clock time in its proto envelope, so lease and TTL math is identical on every replica: determinism by construction.

Wrapping Apache Ratis 3.x
Production-proven in Apache Ozone & IoTDB.

[ Raft leader election ]

§ 05 · Capabilities

What AEGIS provides.

A complete coordination service plus an operations agent confined to the control plane — backed by a 99-test agent suite and a reproducible benchmark.

Data plane · deterministic

▪Raft consensus core (Apache Ratis · 5 nodes)

▪Distributed locks + session leases with fencing tokens

▪Versioned key–value config store with watches

▪Mirror-image Java + Python client SDKs

▪Telemetry pipeline (Prometheus + Redis Streams)

Control plane · agentic

▪Deterministic anomaly classifier

▪Safety-enveloped config proposer (GitHub PRs)

▪LLM-with-retrieval root-cause diagnosis

▪Tool-using postmortem drafter (GitHub Issues)

▪Chaos harness + live operator dashboard

On the roadmap → counterfactual sandbox verification that embeds an empirical before/after proof in every proposed config change.

§ 06 · Quickstart

One script to a cluster.

Docker + Compose v2; Python 3.10+ for the agents. The aegis-*.sh scripts bring up a live 5-node cluster with observability and the dashboard, run the test suites, and tear it all down. ./aegis.sh is the interactive launcher.

Open the repo ↗

aegis · zsh~/code/AEGIS

# 1 · clone
$ git clone https://github.com/Abhishek-Aditya-bs/Aegis && cd Aegis

# 2 · bring up a 5-node cluster + observability + dashboard
$ ./aegis-up.sh --cluster locks   # or: --cluster kv
  → dashboard  http://localhost:4400
  → grafana    http://localhost:3000  (admin/admin)
  → prometheus http://localhost:9090

# 3 · run the test suites (offline, $0)
$ ./aegis-test.sh --fast   # agents + benchmark + chaos
  ✓ 99 agent tests pass · benchmark baseline 33/33

# 4 · score ConsensusOps-Bench with a real model (~$0.01)
$ agents/.venv/bin/python benchmark/run.py --diagnoser rag \
    --provider openrouter --model google/gemini-3.1-flash-lite
  ✓ classification 1.000 · root-cause top-1 0.970 · top-3 1.000

# tear down (add --volumes to wipe Raft state)
$ ./aegis-down.sh --cluster all

interactive launcher → ./aegis.sh · chaos → chaos/slow-follower.sh · tail -f chaos/events.jsonl

§ 07 · Evaluation

Results, honestly.

AEGIS ships ConsensusOps-Bench — 33 labelled Raft incidents scoring anomaly classification and ranked root-cause. The deterministic baseline is already near-perfect, and a one-cent LLM-with-retrieval run on a cheap Gemini Flash model matches it exactly. The determinism lives in the free classifier; the metered model is confined to advisory diagnosis.

Full run · 33 incidents

model · google/gemini-3.1-flash-lite

wall-clock · 51.4 s (~1.6 s / incident)

API spend · $0.0088 (under one cent)

ConsensusOps-Bench · scorecard

Metric

Baseline · $0

LLM · $0.01

Anomaly classification

1.000

Root-cause · top-1

0.970

Root-cause · top-3

1.000

Perfect classification across all six anomaly classes. Root-cause top-1 misses exactly once — a cascade where a proximate election storm masks a disk-bound follower — and recovers it at top-3. Both columns share the same deterministic classifier; the LLM never classifies.

Read the paper (PDF) → Benchmark harness ↗

§ 08 · Frequently doubted

Honest answers.

Is this really an LLM project if the LLM doesn't decide consensus?

Yes. The reasoning earns its keep where it helps: classifying anomalies from noisy multi-signal telemetry, drafting tuning PRs with rationale and rollback, and writing readable postmortems from a tool-bounded view. The agent simply never gets to mutate consensus. That separation is the contribution.

Why not write your own Raft?

Apache Ratis is battle-tested in Apache Ozone and IoTDB. The novelty is the agentic ops layer and the control-plane invariant, not the consensus algorithm. Reinventing Raft is a different project.

What if the agent proposes a bad config?

A human closes the PR, and the cluster is unchanged. The safety envelope rejects unsafe patches before a PR is even opened. The agent's only mutation pathway is the review queue; bad proposals become logged evidence for the paper.

Why heuristic classification, not LLM classification?

Determinism. The replay test that anchors the classifier needs the same answer every run. The LLM genuinely earns its keep in diagnosis and postmortem narration — comparison, lesson-extraction, and ranked root-cause — and that path is opt-in.