AEGIS GitHub
Coordination, self-healed

AEGIS

Raft-grade locks, leases & config with an LLM operations agent in the control plane — never the data plane.

Built & illustrated by Abhishek Aditya.

Ed. 0.1 · alpha · 2026

Chaos in, postmortems out. The LLM lives in the control plane, never the data plane.
Read the paper (PDF) → Quickstart → Benchmark results ↓
§ 01 · The Problem

Coordination is the load-bearing wall of the cloud.

Chubby · ZooKeeper
etcd · Consul
and the on-call rota behind them.

Every Kubernetes control loop, every Cassandra ring, every distributed lock rests on a coordination service. They are battle-tested, yet operationally brutal: split-brain incidents, lease starvation, slow-follower cascades, and endless consensus-parameter tuning consume disproportionate on-call effort. Postmortems lag the incidents that produced them.

Recent proposals to inject LLM reasoning into the consensus data plane have been rightly rejected by practitioners. Consensus protocols (Paxos, Raft, ZAB) rest on a few narrow correctness invariants and are acutely sensitive to non-determinism. A non-deterministic language model in the commit path breaks the guarantee that two replicas applying the same log entry compute the same state. That guarantee is the safety.

The right place for an LLM agent is the control plane: observing, recommending, and documenting, never deciding. AEGIS is the open-source artifact that makes this argument concrete and testable.

§ 02 · How it works

Two planes.
One invariant.

The data plane is deterministic Raft: Apache Ratis on Java 25, exposed over gRPC, with leader-stamped wall-clock time so that TTL math is reproducible across replicas.

The control plane is an agentic Python sidecar (LangGraph, OpenRouter-backed) that observes telemetry, recommends config changes as GitHub PRs, and drafts postmortems as GitHub Issues.

The two share a telemetry surface. They do not share a mutation surface. A human holds every merge bit.

Data plane → Apache Ratis · Java 25
Control plane → Python · LangGraph · OpenRouter
Mutation path → GitHub PRs + Issues
[ The two planes ] The two planes of AEGIS Top: the deterministic data plane, with gRPC clients feeding a row of five Apache Ratis Raft nodes with one elected leader, exposing locks, leases and a key-value store. A telemetry-only boundary separates it from the bottom control plane, where a Python agent opens GitHub pull requests and issues for a human reviewer. The LLM never crosses the boundary line. DATA PLANE · DETERMINISTIC · LLM-FREE APACHE RATIS · JAVA 25 · 5 NODES gRPC CLIENTS lock·lease·get node1 follower node2 follower node3 LEADER node4 follower node5 follower LOCKS LEASES KV+WATCH Prometheus scrape + Redis Streams TELEMETRY ONLY ↓ CONTROL PLANE · AGENTIC · ADVISORY ONLY opens only PYTHON AGENT LangGraph · SDK GITHUB PR · CONFIG GITHUB ISSUE · P.M. HUMAN REVIEWER holds merge bit The LLM never crosses this line. read-only telemetry up · pull requests & issues down · no gRPC, no lock, no key
[ The safe closed loop ] The safe closed loop A left-to-right pipeline of five stages (detect, diagnose, propose-with-proof, constrain, human merges), each with a short caption, and a return loop arrow from the last stage back to the first. DETECT → DIAGNOSE → PROPOSE-WITH-PROOF → CONSTRAIN → HUMAN MERGES DETECT classifier DIAGNOSE RAG PROPOSE WITH-PROOF verifier CONSTRAIN safety HUMAN MERGES heuristic classifier retrieval-aug. root-cause counterfactual sandbox replay safety envelope PR / Issue gate PROVES THE FIX HELPS BEFORE THE PR merged config → cluster heals → re-observe CHAOS IN chaos harness the loop above POSTMORTEM OUT drafter → Issue
§ 03 · The closed loop

Propose, and prove.

An open loop says "anomaly → PR → hope it helped." AEGIS closes it without ever touching the data-plane invariant.

Before a PR is opened, a verifier replays the exact chaos trace in an ephemeral sandbox cluster under both the current and proposed config, and embeds the before/after delta in the PR. A static safety envelope rejects any patch that violates a consensus-safety constraint, so unsafe configs are structurally impossible to propose.

safety envelope & RAG root-cause, shipped · counterfactual sandbox verification on the roadmap.

§ 04 · Consensus mechanics

Raft, leader-elected.

Five nodes, one elected leader, on-disk log + snapshot. The leader replicates an append-only log to its followers; a quorum acknowledgement commits each entry. Kill the leader and a new term elects a successor.

Every command carries leader-stamped wall-clock time in its proto envelope, so lease and TTL math is identical on every replica: determinism by construction.

Wrapping Apache Ratis 3.x
Production-proven in Apache Ozone & IoTDB.
[ Raft leader election ] Raft cluster and leader election Five Raft nodes arranged in a ring with one central leader replicating its append-only log to four followers. A term counter reads term 42. A side schematic shows the append-only log with a committed entry boundary, and a chaos-in to postmortem-out motif. RAFT CLUSTER · 5 NODES · LOG REPLICATION CURRENT TERM 42 node1 FOLLOWER node2 FOLLOWER node4 FOLLOWER node5 FOLLOWER node3 LEADER stamps leader_timestamp_ms on every log entry → replica-deterministic TTL APPEND-ONLY LOG 7 8 9 10 11 commit idx CHAOS IN P.M. OUT
§ 05 · Capabilities

What AEGIS provides.

A complete coordination service plus an operations agent confined to the control plane — backed by a 99-test agent suite and a reproducible benchmark.

Data plane · deterministic
Raft consensus core (Apache Ratis · 5 nodes)
Distributed locks + session leases with fencing tokens
Versioned key–value config store with watches
Mirror-image Java + Python client SDKs
Telemetry pipeline (Prometheus + Redis Streams)
Control plane · agentic
Deterministic anomaly classifier
Safety-enveloped config proposer (GitHub PRs)
LLM-with-retrieval root-cause diagnosis
Tool-using postmortem drafter (GitHub Issues)
Chaos harness + live operator dashboard
On the roadmap → counterfactual sandbox verification that embeds an empirical before/after proof in every proposed config change.
§ 06 · Quickstart

One script to a cluster.

Docker + Compose v2; Python 3.10+ for the agents. The aegis-*.sh scripts bring up a live 5-node cluster with observability and the dashboard, run the test suites, and tear it all down. ./aegis.sh is the interactive launcher.

Open the repo ↗
aegis · zsh~/code/AEGIS
# 1 · clone
$ git clone https://github.com/Abhishek-Aditya-bs/Aegis && cd Aegis

# 2 · bring up a 5-node cluster + observability + dashboard
$ ./aegis-up.sh --cluster locks   # or: --cluster kv
   dashboard  http://localhost:4400
   grafana    http://localhost:3000  (admin/admin)
   prometheus http://localhost:9090

# 3 · run the test suites (offline, $0)
$ ./aegis-test.sh --fast   # agents + benchmark + chaos
   99 agent tests pass · benchmark baseline 33/33

# 4 · score ConsensusOps-Bench with a real model (~$0.01)
$ agents/.venv/bin/python benchmark/run.py --diagnoser rag \
    --provider openrouter --model google/gemini-3.1-flash-lite
   classification 1.000 · root-cause top-1 0.970 · top-3 1.000

# tear down (add --volumes to wipe Raft state)
$ ./aegis-down.sh --cluster all
interactive launcher → ./aegis.sh · chaos → chaos/slow-follower.sh · tail -f chaos/events.jsonl
§ 07 · Evaluation

Results, honestly.

AEGIS ships ConsensusOps-Bench — 33 labelled Raft incidents scoring anomaly classification and ranked root-cause. The deterministic baseline is already near-perfect, and a one-cent LLM-with-retrieval run on a cheap Gemini Flash model matches it exactly. The determinism lives in the free classifier; the metered model is confined to advisory diagnosis.

Full run · 33 incidents
model · google/gemini-3.1-flash-lite
wall-clock · 51.4 s  (~1.6 s / incident)
API spend · $0.0088  (under one cent)
ConsensusOps-Bench · scorecard
Metric
Baseline · $0
LLM · $0.01
Anomaly classification
1.000
1.000
Root-cause · top-1
0.970
0.970
Root-cause · top-3
1.000
1.000

Perfect classification across all six anomaly classes. Root-cause top-1 misses exactly once — a cascade where a proximate election storm masks a disk-bound follower — and recovers it at top-3. Both columns share the same deterministic classifier; the LLM never classifies.

§ 08 · Frequently doubted

Honest answers.

Is this really an LLM project if the LLM doesn't decide consensus?

Yes. The reasoning earns its keep where it helps: classifying anomalies from noisy multi-signal telemetry, drafting tuning PRs with rationale and rollback, and writing readable postmortems from a tool-bounded view. The agent simply never gets to mutate consensus. That separation is the contribution.

Why not write your own Raft?

Apache Ratis is battle-tested in Apache Ozone and IoTDB. The novelty is the agentic ops layer and the control-plane invariant, not the consensus algorithm. Reinventing Raft is a different project.

What if the agent proposes a bad config?

A human closes the PR, and the cluster is unchanged. The safety envelope rejects unsafe patches before a PR is even opened. The agent's only mutation pathway is the review queue; bad proposals become logged evidence for the paper.

Why heuristic classification, not LLM classification?

Determinism. The replay test that anchors the classifier needs the same answer every run. The LLM genuinely earns its keep in diagnosis and postmortem narration — comparison, lesson-extraction, and ranked root-cause — and that path is opt-in.