ADK Agent Playground
Seven multi-agent systems on Google ADK, plus seven research primitives layered on top: vendor-neutral observability, a failure-mode classifier with a Reflexion loop, a paraconsistent debate judge, capability-based tool security, reflective prompt evolution, structured cognitive memory, and a Lean 4 + Z3 capability-closure proof layer. Python, model-agnostic via OpenRouter. In a final round the primitives compose into two ideas from 2026 agent research — measuring whether each harness component earns its place (an ablation with confidence intervals), and letting the system safely improve its own configuration (a governed self-evolution loop). Interactive demos run live on HuggingFace.
The agents started as a way to learn ADK. The seven primitives layered on top are what kept the project interesting. Each one targets a real gap in mainstream agent tooling: flat session memory, ML-only permission filters, classical reasoning that collapses contradictions, prompt sets that never improve. They compose, and each produces a measurable artifact. The final round turns that lens on the harness itself, and adds a discipline the field often skips: every synthetic number is labelled as synthetic, and an improvement only counts when it survives a held-out, statistically-significant test.
Architecture
Each layer pairs with at least one earlier layer. Mathematics older than 2024 (Hebbian, Rescorla-Wagner, Curry-Howard, Belnap-Dunn) does the structural work; LLM calls handle the noisy interface to language.
The Seven Agents
Each agent is a SequentialAgent or ParallelAgent composing leaf LlmAgent instances. The earlier hierarchical-delegation orchestrator was retired in favor of deterministic pipelines plus heterogeneous model routing: cheap models for simple roles, capable models for hard ones. Provider swap is one line via LiteLLM.
| Agent | Pattern |
|---|---|
| deep_researchPlan, research, compress, [fact-check ∥ analyze], critique, write | Sequential pipeline |
| writing_assistantOutline, draft, edit, fact-check, revise | Sequential pipeline |
| summarizerExtract content (web, PDF, YouTube, raw text), summarize | Sequential pipeline |
| market_research[competitor ∥ trend ∥ market sizer], report | Sequential pipeline |
| debate[pro ∥ con], classical moderator or Belnap 4-valued judge | Sequential pipeline |
| code_reviewerLLM orchestrator + parallel review team (security ∥ performance ∥ style) | Parallel review |
| hello_worldMinimal smoke test, verifies ADK + OpenRouter setup | Single LlmAgent |
The Seven Layers
1. OpenTelemetry GenAI observability
Vendor-neutral tracing via the OTel GenAI semantic conventions. Self-hosted Langfuse via Docker Compose, split web and worker. Decorator-based instrumentation, idempotent init_tracing() with a NoOp-safe fallback. Every later layer consumes traces from this substrate.
2. MAST failure classifier + Reflexion
Berkeley Sky's 14-mode failure taxonomy as an LLM-as-judge classifier emitting mas.failure_label events on offending child spans. A Reflexion verbal-reinforcement loop writes reflections to episodic memory and re-injects them via a {prior_reflections} template placeholder.
3. Belnap paraconsistent debate
Replaces the classical Pro/Con/Moderator collapse with a Belnap-Dunn 4-valued lattice judge {T, F, B, N}. Each debater votes independently on each atomic claim. The lattice retains contradictions (B = dialetheia) and gaps (N = neither side commits) instead of forcing collapse. Vote isolation via fresh sessions per call prevents leakage. Opt-in via BELNAP_DEBATE=1.
4. CaMeL capability security
Two-LLM architecture from Google DeepMind. The Privileged LLM emits a restricted-Python plan and never sees tool outputs; the Quarantined LLM parses untrusted text into Pydantic schemas with no tool access. AST whitelist plus runtime resolver block all classic Python sandbox-escape vectors, verified against 23 attack snippets. Capability metadata propagates through every Assign / Call / JoinedStr / Subscript. The PolicyEngine default-denies any tool call with untrusted args. AgentDojo subset reproduction: ASR 77.5% to 10.0%.
5. GEPA self-optimizing prompts
Reflective Pareto prompt evolution via DSPy 3.x. Per-agent dev sets (24 train, 12 val, 12 holdout, seven agents). A WeightedScoreMetric adapter wraps the MAST evaluator into DSPy's ScoreWithFeedback. A holdout-regression gate writes any v2 that drops below v1 on holdout to prompts/v2_rejected/. Cost cap aborts at $5 per run.
6. HippoCycle (cognitive memory)
Working, episodic, and semantic memory layers with a sleep-cycle consolidation loop. NREM strengthens co-retrieved nodes via a saturating delta rule (Rescorla-Wagner) and applies global synaptic downscaling. REM abstracts correlated episode clusters into semantic invariants with provenance pointers to source episodes. A forget pass prunes low-value, low-co-occurrence nodes with a min-confidence floor.
Contradicting episodes produce semantic nodes tagged value=B (Belnap integration). In secure mode, episodic content is treated as untrusted and HippoCycle is rejected with a warning to preserve the CaMeL trust boundary.
7. FormalGuard (compositional safety proofs)
Third safety layer above CaMeL. Capability closure verification via Z3 (set-membership v1) and Lean 4 proof obligation generation. Reproduces the Spera Theorem 9.2 result: two individually-safe agents compose into a forbidden state with a concrete Z3 witness. Lake project pinned to leanprover/lean4:v4.15.0 with Mathlib v4.15.0; the proofs compile with lake build. Three-way result handling: sat to Rejected, unsat to Verified, unknown to Inconclusive.
Capstone — Harness science + governed self-evolution
The seven primitives are the foundation; the final round composes them into two ideas from 2026 agent research. The premise (Harness-Bench): an agent's capability is a property of its harness configuration — reflexion, secure mode, memory, prompt version, model route — not the model alone. So the harness becomes a tunable object you can measure and search.
8. Within-harness ablation
A frozen, hashable HarnessConfig genome unifies every runner flag. A factorial sweep measures each primitive's marginal contribution and every pairwise interaction with cell-clustered bootstrap confidence intervals — the cell, not the example, is the independent unit, so correlated per-example scores don't inflate significance. Non-significant effects are reported as such: a null result is a result.
9. Governed self-evolution
A population of harness genomes mutates across generations — a Darwin Gödel Machine-style lineage archive with parent pointers, plus a Group-Evolving-Agents-style shared-experience archive. A candidate is admitted only when its paired improvement over its parent is CI-significant; the winner is re-scored on a held-out validation partition to defuse the winner's curse; and every candidate is vetted through CaMeL secure mode, a filesystem denylist, and a fail-closed FormalGuard Z3 closure proof before it can enter the archive.
10. Bayesian harness-fitness surrogate
A Bayesian linear regression over effect-coded genome features plus all fifteen pairwise interactions predicts a configuration's fitness without running it, carrying a posterior mean and variance that drive Expected-Improvement search. Its coefficients map directly onto the ablation's quantities (marginal contribution and difference-in-differences), so the predictive model and the statistical study become one object. An optional online-Bayesian-optimization proposer uses it to choose the next genome to evaluate.
Live demos & artifacts
Five artifacts run on HuggingFace, all deterministic, free, and offline. Every number is seeded-synthetic and labelled as such; a provenance ledger in the repo separates demonstration from measurement, and real numbers require a cost-capped live run.
- Interactive Space — three Plotly tabs: the ablation forest plot, the governed-evolution fitness trace, and a surrogate explorer.
- Harness-fitness surrogate — the Bayesian model above, with a card that stays honest about being synthetic-trained until run on a live ablation sweep.
- Datasets: belnap-contested-questions (110 propositions), mast-failure-mode-gold (25 labelled traces), swe-bench-mini (22 tasks).