ADK Agent Playground

Project · 2025–2026

Most multi-agent demos are single-purpose toys. The thesis here is different: seven agents on a shared substrate, then seven research-grade primitives layered on top to push past what Claude Code, Cursor, and Gemini Code Assist ship today — structured cognitive memory instead of flat session storage, machine-checkable proof obligations instead of an ML permission classifier, and a paraconsistent debate moderator that keeps contradictions instead of collapsing them. Each primitive cites a 2025–2026 paper and produces a measurable artifact.

Repository: github.com/barissozudogru/adk-agent-playground · 884 tests, ruff check clean, CI green on Python 3.11 + 3.12.

The Thesis in One Sentence

Agents that reason with proofs and remember with structure, grounded in the two architectural gaps documented for mainstream developer agent tools (arXiv:2604.14228): flat session memory and ML-only permission classifiers. Round 1 hardened the agents (observability, self-healing, paraconsistent reasoning, security, optimization). Round 2 gave them cognition (HippoCycle structured memory, FormalGuard formal verification).

Layered Architecture

Each new chunk pairs with at least one earlier chunk via constructive composition. Mathematics older than 2024 (Hebbian 1949, Banach 1922, Curry-Howard 1934, Belnap-Dunn 1976) does the load-bearing work; LLM calls handle the noisy interface to language.

The Seven Agents

Each agent is a SequentialAgent or ParallelAgent composing leaf LlmAgent instances. The orchestrator-style "hierarchical delegation" was retired in favor of deterministic pipelines plus heterogeneous model routing (cheap models for simple roles, capable models for hard ones). Switching providers is a one-line change via LiteLLM.

Agent	Pattern
deep_researchPlan → research → compress → [fact-check ∥ analyze] → critique → write	Sequential pipeline
writing_assistantOutline → draft → edit → fact-check → revise	Sequential pipeline
summarizerExtract content (web, PDF, YouTube, raw text) → summarize	Sequential pipeline
market_research[competitor ∥ trend ∥ market sizer] → report	Sequential pipeline
debate[pro ∥ con] → classical moderator or Belnap 4-valued judge (opt-in)	Sequential pipeline
code_reviewerLLM orchestrator + parallel review team (security ∥ performance ∥ style)	Parallel review
hello_worldMinimal smoke test — verifies ADK + OpenRouter setup	Single LlmAgent

Seven Sophistication Layers

Each chunk shipped with paper grounding, a measurable artifact, an audit gate, and integration with at least one earlier chunk. Together they target the architectural gaps in mainstream developer agent tools.

1. OpenTelemetry GenAI observability

Vendor-neutral tracing via the March 2026 OTel GenAI semantic conventions. Self-hosted Langfuse via Docker Compose (split web + worker). Decorator-based instrumentation in shared/telemetry.py; idempotent init_tracing() with NoOp-safe fallback. Every later chunk consumes traces from this substrate. OTel GenAI semconv

2. MAST failure classifier + Reflexion loop

Berkeley Sky's 14-mode failure taxonomy as an LLM-as-judge classifier emitting mas.failure_label events on offending child spans. Reflexion verbal-reinforcement loop writes reflections to episodic memory and re-injects via {prior_reflections} template placeholder — the classic-paper pattern adapted to deterministic pipelines. arXiv:2503.13657 · arXiv:2303.11366

3. Belnap paraconsistent debate

Replaces the classical Pro/Con/Moderator collapse with a Belnap-Dunn 4-valued lattice judge {T, F, B, N}. Each debater independently votes on each atomic claim; the lattice retains contradictions (B = dialetheia) and gaps (N = neither side commits). Vote isolation via fresh sessions per call prevents leakage. Opt-in via BELNAP_DEBATE=1. Belnap-Dunn 4-valued logic · SEP entry

4. CaMeL capability security

Two-LLM architecture from Google DeepMind. The Privileged LLM emits a restricted-Python plan and never sees tool outputs; the Quarantined LLM parses untrusted text into Pydantic schemas with no tool access. AST whitelist plus runtime resolver block all classic Python sandbox-escape vectors (verified against 23 attack snippets). Capability metadata propagates through every Assign / Call / JoinedStr / Subscript. PolicyEngine default-denies any tool call with untrusted args. arXiv:2503.18813 · AgentDojo subset reproduction: ASR 77.5% → 10.0%

5. GEPA self-optimizing prompts

Reflective Pareto prompt evolution via DSPy 3.x. Per-agent dev sets (24 train / 12 val / 12 holdout, seven agents = 336 hand-curated examples). A WeightedScoreMetric adapter wraps chunk 2's evaluator into DSPy's ScoreWithFeedback. Holdout-regression gate writes to prompts/v2_rejected/ if the v2 holdout score drops below v1 (Goodhart mitigation). Cost cap aborts at $5/run. arXiv:2507.19457

6. HippoCycle (cognitive memory)

Closes the documented gap of flat session memory. Working / episodic / semantic memory layers with a sleep-cycle consolidation loop:

NREM: strengthens co-retrieved nodes via a saturating delta rule (Rescorla-Wagner) and applies global synaptic downscaling.
REM: an LLM abstracts correlated episode clusters into semantic invariants with provenance pointers to source episodes.
Forget: prunes low-value, low-co-occurrence nodes with a min-confidence floor.

Belnap dialetheia detection: contradicting episodes produce semantic nodes tagged value=B (chunk 3 integration). Secure-mode mutex preserves CaMeL's trust boundary — episodic content is treated as untrusted, so hippocycle=True is rejected with a warning when secure=True. arXiv:2604.20943 SCM · arXiv:2604.16839 HeLa-Mem

7. FormalGuard (compositional safety proofs)

Third safety layer above CaMeL. Capability closure verification via Z3 (set-membership v1) and Lean 4 proof obligation generation. Reproduces Spera Theorem 9.2: two individually-safe agents compose into a forbidden state with a concrete Z3 witness. Lake project pinned to leanprover/lean4:v4.15.0 + Mathlib v4.15.0 — the proofs compile with lake build; without the toolchain installed, the .lean files are syntactically valid but unverified. Three-way result handling: sat → Rejected, unsat → Verified, unknown → Inconclusive (soundness fix). arXiv:2603.15978 Spera · arXiv:2602.20064 LLMbda

Reproducibility

git clone https://github.com/barissozudogru/adk-agent-playground
cd adk-agent-playground
python -m venv .venv && source .venv/bin/activate
pip install -e ".[all,dev]"          # ~6 min, ~1.2 GB
cp .env.example .env                 # add OPENROUTER_API_KEY
make demo-mast                       # MAST heatmap (DRY-RUN, no key)
make demo-camel                      # CaMeL AgentDojo subset (DRY-RUN)
make demo-hippocycle                 # HippoCycle recall curves (DRY-RUN)
make demo-formalguard                # FormalGuard Spera Theorem 9.2

Demos that depend on a live LLM are explicitly DRY-RUN by default and labeled as such on every emitted artifact (yellow banner, CSV provenance column, embedded watermark on PNGs). Live runs are gated behind --live CLI flags and OPENROUTER_API_KEY; AgentDojo behind AGENTDOJO_INSTALLED=1.

Honest Scope

This is research-grade engineering, not a productized framework. Specifically:

FormalGuard's Z3 encoding is set-membership v1; path-sensitive encoding (per arXiv:2603.20449) is roadmap work.
Lean proofs are machine-checkable but only when lake build resolves the Mathlib pin; without that step the .lean files are syntactically valid but unverified.
HippoCycle's eval is currently DRY-RUN with synthetic data seeded from md5(agent_name) for cross-process reproducibility; LONGMEMEVAL_PATH drops in the live dataset.
The CaMeL AgentDojo subset uses 40 of 97 user tasks from the published suite. Aggregate ASR 77.5% → 10.0%; utility 85.0% → 72.5%. Live reproduction requires pip install -e ./infra/agentdojo.