TECHNICAL WHITE PAPER
Mermaid Diagram Driven Development: Verifiable Semantic Networks for Agentic Software Engineering
Abbreviation: MDD
Date: May 2026
Classification: Open Technical Standard Proposal
Target Audience: AI Systems Architects, Agent Runtime Teams, Formal Methods Engineers, Software Engineering Researchers
Abstract
Current LLM-based software engineering workflows rely heavily on ambiguous Markdown documents, loosely structured prompts, and post-hoc tests. These artifacts are readable but weakly verifiable. They allow semantic drift between requirements, diagrams, agent plans, code, tests, and explanations.
Mermaid Diagram Driven Development (MDD) is a graph-first methodology for transforming source documents into auditable, verifiable development networks. MDD treats Mermaid diagrams as the first structured abstraction above source material. Mermaid nodes and edges preserve provenance back to original text so humans and agents can audit architectural, behavioral, and explanatory claims.
Mermaid is not the final formal system. Mermaid is compiled into a typed intermediate representation, MDD-IR, which becomes the canonical substrate for deterministic verification:
- LLMs extract semantics, propose finite domains, identify ambiguity, explain counterexamples, and suggest graph repairs.
- TLA+ verifies state integrity, temporal behavior, safety/liveness, deadlock freedom, and legal transitions.
- Z3 checks bounded paths, reachability, forbidden-route UNSAT results, and coverage witnesses.
- Lean4 checks semantic extrapolations, refinement claims, and explanation rules.
mdd-engineperforms heavyweight compilation and verification.mddis the lightweight deterministic traversal CLI for agents using prebuilt verified bundles.
MDD reframes autonomous software development as a pipeline of source-backed graph abstraction, state-space verification, path solving, semantic proof, proof-carrying compilation, and deterministic runtime traversal.
1. Core Thesis
Mermaid diagrams are the first auditable abstraction from source documents. LLMs provide semantic interpretation. Deterministic tools verify the consequences of that interpretation.
The core split is:
LLM = semantics
Mermaid = auditable graph abstraction
MDD-IR = typed canonical graph
TLA+ = state integrity
Z3 = verifiable paths
Lean4 = semantic extrapolation/proof
mdd-engine = heavyweight compiler/verifier
mdd = lightweight runtime traversal CLI
Compactly:
LLMs give states meaning.
Mermaid makes meaning visible.
MDD-IR makes meaning typed.
TLA+ keeps states honest.
Z3 proves paths exist or do not.
Lean4 checks extrapolated meaning.
mdd keeps agents inside the verified design space.
MDD should not claim to prove a whole implementation correct. It verifies a design-level abstraction and emits artifacts that guide tests, assertions, explanations, and agent traversal.
2. System Architecture
Source docs / code / issues
↓
LLM semantic extraction with provenance
↓
Mermaid audit graph
↓
Typed MDD-IR
↓
TLA+ state integrity checks
↓
Z3 path witnesses / forbidden-route checks
↓
Lean4 semantic proof obligations
↓
Verified .mddbundle
↓
mdd runtime traversal for AI agents
The pipeline has two CLIs:
mdd-engine = heavy semantic/formal compiler
mdd = lightweight verified graph traversal VM
The split is essential. Full verification may require LLMs, Java/TLA+, Apalache, Z3, Lean4, Lake, theorem-proving libraries, and expensive model-checking runs. Many AI agent environments cannot install or run all of that. They can often run a small CLI and sometimes a Python Z3 package. Therefore mdd-engine performs verification once and emits a portable .mddbundle; mdd lets many downstream agents safely reuse it.
3. Mermaid as Auditable Abstraction
Mermaid is valuable because it is compact, readable, diffable, and LLM-friendly. It gives humans a reviewable surface between prose and formal models.
However:
Valid Mermaid syntax is not semantic correctness.
Every graph element should preserve provenance:
node:
id: Authenticated
label: Authenticated Session
provenance:
file: docs/auth.md
lines: [42, 47]
status: source_explicit
edge:
id: refresh_invalid_logout
from: TokenExpired
to: Unauthenticated
event: Refresh token invalid, revoked, replayed, or malformed
guard: refresh_case != valid
provenance:
file: docs/auth.md
lines: [52, 55]
status: source_inferred
Required provenance labels:
source_explicit = directly supported by source
source_inferred = inferred from source, not directly stated
llm_hypothesis = proposed by LLM; needs review/checking
human_confirmed = confirmed by reviewer/user
4. MDD-IR
MDD-IR is the canonical typed representation. It separates Mermaid syntax from formal semantics.
MDD-IR should contain:
- graph kind: state machine, workflow, dependency graph, protocol, explanation graph
- stable node IDs and edge IDs
- labels and human descriptions
- source spans and provenance status
- actors/events
- guards, preconditions, postconditions
- finite semantic domains
- environment actions
- safety invariants
- liveness candidates and fairness assumptions
- forbidden transitions
- coverage obligations
- backend projection hints for TLA+, Z3, Lean4, tests, and runtime
Example finite domain:
refresh_case:
values: [valid, expired, revoked, replayed, malformed]
properties:
- revoked token never reaches Authenticated
- replayed token reaches Unauthenticated
- valid token may re-enter Authenticating
This finite-domain step is crucial. Vague prose such as “handle auth edge cases” is not checkable. MDD turns it into bounded state/path obligations.
5. LLM Role
LLMs are semantic assistants, not proof engines.
Good LLM responsibilities:
- extract candidate states, events, actors, and edges
- map source text to graph elements
- identify ambiguity and missing cases
- propose finite edge-case domains
- propose safety/liveness properties
- explain model-checker counterexamples
- suggest graph repairs
- generate human-facing documentation from checked artifacts
Forbidden LLM responsibilities:
- inventing unsupported requirements without provenance labels
- treating Mermaid syntax as proof
- declaring a path valid without deterministic checking
- treating a semantic extrapolation as verified without Lean4 or another checker
- bypassing
mddwhen a transition is rejected
Prompt rule:
Your output is a semantic proposal, not proof.
Preserve provenance and produce typed obligations for deterministic checking.
6. TLA+ for State Integrity
TLA+ is the state-space integrity layer. MDD projects MDD-IR into TLA+ when the graph describes behavior over time.
Mapping:
Mermaid nodes → finite States
Mermaid edges → actions in Next
current node → state variable
previous node → prev variable when needed
guards/domains → constants and variables
failure events → environment actions
safety claims → invariants
liveness claims → temporal properties/fairness
TLA+ checks questions like:
- Can the system deadlock?
- Can a forbidden state be reached?
- Can a terminal state have outgoing transitions?
- Can a retry loop continue forever under stated assumptions?
- Can an agent deploy without test success?
- Does every allowed path preserve the declared invariant?
TLA+ should not model everything. Following the AWS lesson, the model should capture the smallest dangerous state space that can expose design bugs.
7. Z3 for Verifiable Paths
Z3 handles bounded path satisfiability and route constraints.
Z3 answers:
- Is there a path from A to B under these guards?
- Is this forbidden route UNSAT?
- What witness path covers this semantic case?
- What minimal path set covers all domain values?
- Can an agent reach the requested target without violating constraints?
Example result:
{
"query": "cover refresh_case = replayed",
"result": "sat",
"witness_path": ["Authenticated", "TokenExpired", "Unauthenticated"],
"edges": ["ttl_exceeded", "refresh_invalid_logout"]
}
Z3 is also useful in mdd when lightweight runtime path checks are needed and the certified bundle allows bounded queries.
8. Lean4 for Semantic Extrapolation
LLMs often generalize from examples:
All non-valid refresh token cases log out.
MDD treats such statements as proof obligations, not conclusions. Lean4 checks whether the generalized claim follows from formalized definitions.
Example obligation:
inductive RefreshCase where
| valid | expired | revoked | replayed | malformed
deriving DecidableEq, Repr
inductive AuthState where
| unauthenticated | authenticating | authenticated | tokenExpired
deriving DecidableEq, Repr
def refreshTransition (r : RefreshCase) : AuthState :=
match r with
| RefreshCase.valid => AuthState.authenticating
| _ => AuthState.unauthenticated
theorem non_valid_refresh_logs_out
(r : RefreshCase)
(h : r ≠ RefreshCase.valid) :
refreshTransition r = AuthState.unauthenticated := by
cases r <;> simp [refreshTransition] at *
Lean4 is not a semantic oracle. It checks only formalized definitions and theorems. The LLM may propose those definitions, but deterministic tooling must check them.
9. AWS-Derived Principles
The AWS paper “Use of Formal Methods at Amazon Web Services” (Newcombe et al., CACM 2015, https://cacm.acm.org/research/how-amazon-web-services-uses-formal-methods/) teaches that formal methods work in industry when framed as debugging designs.
AWS found that prose, diagrams, reviews, tests, and fault injection were necessary but not sufficient. Subtle bugs often lived in the design itself and appeared only under rare interleavings of failures, retries, crashes, recoveries, message reorderings, and operator actions.
MDD adopts this lesson:
MDD is not diagram drawing.
MDD is source-backed, exhaustively testable design debugging for agentic systems.
9.1 Start with “what must go right?”
Before drawing the happy path, MDD prompts should ask for:
- safety invariants
- liveness candidates
- forbidden transitions
- recovery obligations
- environment assumptions
- source provenance for each claim
Prompt block:
Before producing Mermaid, identify what must go right.
List safety properties, liveness candidates, forbidden transitions, and assumptions.
Mark every item as source_explicit, source_inferred, llm_hypothesis, or human_confirmed.
9.2 Model the environment with the system
A model containing only the application happy path is misleading. Environment actions must be first-class.
For agentic software, the environment includes:
network_timeout
message_loss
message_duplicate
message_reorder
process_crash
process_restart
external_service_error
operator_action
user_correction
permission_denial
tool_failure
malformed_tool_output
stale_memory
conflicting_instruction
llm_invalid_intent
retry/backoff
repair/recovery
Prompt block:
Expand the graph with environment actions.
Do not model only the happy path.
For each environment action, specify actor, trigger, affected state, guard, postcondition, and provenance.
9.3 Prefer small bounded dangerous models
AWS found serious bugs in relatively small models. MDD should not reproduce every implementation detail. It should capture the essential state space where correctness can fail.
Prompt block:
Build the smallest finite model that can expose the dangerous behavior.
Omit implementation details unless they affect a stated property.
State every abstraction decision and omission.
9.4 Counterexamples are product artifacts
MDD should translate verifier results into useful artifacts:
TLA+/Z3 counterexample
→ Mermaid sequence/state trace
→ violated-property explanation
→ graph repair proposal
→ regression test
→ runtime assertion
→ updated bundle after re-verification
Counterexamples should be classified as:
product_design_bug
graph_bug
formalization_bug
abstraction_gap
9.5 Generate tests and assertions
Formal models do not prove the implementation matches the model. They do help generate better assertions and edge-case tests.
MDD should emit:
- tests from Z3 witnesses
- tests from TLA+ counterexamples
- tests from coverage gaps
- assertions from verified invariants
- runtime checks for transition legality
9.6 Know the scope
MDD is strongest for:
state validity
path validity
forbidden transitions
workflow correctness
recovery logic
agent traversal constraints
concurrency/interleaving risks
semantic edge-case coverage
source-backed design documentation
MDD is weaker unless separately modeled:
emergent performance collapse
queueing/load dynamics
hard real-time latency
probabilistic availability
economic/adversarial incentives
implementation correctness without tests/assertions/instrumentation
10. Reusable LLM Prompt Preamble
You are assisting with Mermaid Diagram Driven Development (MDD).
Your job is to convert source material into auditable graph semantics that can be checked by deterministic tools.
Do not treat your output as proof. Your output is a semantic proposal that must preserve provenance and be compiled into typed MDD-IR, then checked by TLA+, Z3, Lean4, or deterministic CLI logic.
Prefer small bounded models that capture dangerous state/path behavior over large implementation-level models.
Always model the system together with its environment: failures, retries, crashes, restarts, external services, operator actions, user corrections, tool errors, and LLM decision points.
Start from “what must go right?” before enumerating “what might go wrong?”
Extraction prompt:
Given the source material, extract an MDD graph proposal.
Return:
1. Mermaid diagram with stable node/edge IDs.
2. Node table with type, label, terminal flag, and source provenance.
3. Edge table with event, actor, guard, postcondition, and source provenance.
4. Environment actions.
5. Safety invariants.
6. Liveness candidates and fairness assumptions.
7. Forbidden transitions.
8. Finite semantic domains.
9. Ambiguities and assumptions.
10. Suggested TLA+/Z3/Lean obligations.
Rules:
- Do not invent unsupported requirements.
- Mark unsupported proposals as llm_hypothesis.
- Prefer finite domains and explicit guards over vague prose.
- Model the environment together with the system.
- Mermaid is audit syntax, not proof.
Runtime prompt for agents using mdd:
You are an LLM agent operating under MDD runtime constraints.
Use the verified .mddbundle as the authority for legal traversal.
You may propose an intent, but mdd determines whether the transition is allowed.
If mdd rejects a transition, do not bypass it. Explain the rejection, inspect allowed alternatives, and choose a legal next step or ask for human review.
Before each action:
1. State current graph node/state.
2. State intended transition and why.
3. Ask mdd whether the transition is allowed.
4. If allowed, proceed and record trace.
5. If rejected, use the rejection reason and choose a legal path.
11. mdd-engine
mdd-engine is the heavyweight compiler/verifier.
Responsibilities:
- ingest source docs, code, tickets, and diagrams
- call LLMs for semantic extraction
- generate Mermaid audit graphs
- enforce provenance requirements
- parse Mermaid deterministically
- compile typed MDD-IR
- generate TLA+, Z3, and Lean4 artifacts
- run model checking and solver queries
- collect counterexamples and witnesses
- render counterexamples back into Mermaid
- emit tests and assertions
- sign and package
.mddbundle
Example:
mdd-engine ingest docs/auth.md --out graphs/auth.mmd
mdd-engine compile graphs/auth.mmd --out build/auth.ir.json
mdd-engine verify build/auth.ir.json --tla --z3 --lean
mdd-engine bundle build/auth.ir.json --out dist/auth.mddbundle
mdd-engine sign dist/auth.mddbundle --cert dist/auth.mddcert
The output is a proof-carrying bundle containing compiled graph data, verification metadata, transition tables, path constraints, provenance, tests, assertions, and optional audit artifacts.
12. mdd
mdd is the lightweight runtime traversal CLI. It is not a full verifier. It consumes already-verified bundles and enforces deterministic traversal.
Responsibilities:
- load
.mddbundle - validate bundle hash/signature
- expose current states, nodes, edges, and allowed actions
- check proposed transitions
- reject invalid paths
- explain rejection reasons
- record runtime traces
- optionally run bounded Z3 queries if available and certified
- return machine-readable JSON to agents
Example commands:
mdd state dist/auth.mddbundle
mdd next dist/auth.mddbundle \
--state TokenExpired
mdd check-transition dist/auth.mddbundle \
--from TokenExpired \
--to Authenticated \
--event refresh_token_valid
mdd explain dist/auth.mddbundle \
--edge refresh_invalid_logout
mdd trace dist/auth.mddbundle \
--input runtime-trace.jsonl
Runtime rule:
The LLM proposes intent.
mdd computes legal transitions.
The LLM chooses among legal transitions.
mdd records the trace.
If mdd rejects a transition, an agent must not bypass it. It should inspect allowed alternatives, explain the rejection, repair the graph through mdd-engine, or ask for human review.
13. .mddbundle
A .mddbundle is the portable verified artifact emitted by mdd-engine and consumed by mdd.
Suggested contents:
bundle.json metadata, hashes, versions
source-map.json source spans and provenance
mdd-ir.json canonical typed graph
transition-table.json deterministic allowed transitions
guards.json guard expressions or compiled predicates
path-constraints.json Z3-ready constraints or precomputed witnesses
invariants.json checked safety properties
liveness.json declared liveness/fairness assumptions
coverage.json semantic coverage obligations/results
counterexamples/ traces and explanations
tests/ generated regression tests
assertions/ generated runtime assertions
certs/ verification certificates / signatures
audit/ optional Mermaid, TLA+, Lean4, solver logs
Runtime-critical files should be small and deterministic. Heavy formal artifacts may be included for audit but are not required for mdd traversal.
14. Proof-Carrying Semantic Coverage
MDD introduces proof-carrying semantic coverage: each meaningful edge case is tracked from source text through graph representation, formal checks, generated tests, and runtime traces.
Example:
{
"semantic_case": "refresh_case = replayed",
"source": "docs/auth.md:52-55",
"mermaid_edge": "refresh_invalid_logout",
"tla_status": "safe",
"z3_witness_path": ["Authenticated", "TokenExpired", "Unauthenticated"],
"lean_theorem": "non_valid_refresh_logs_out",
"generated_test": "refresh_replayed_logs_out.test.ts",
"runtime_assertion": "no_authenticated_after_replayed_refresh",
"status": "covered"
}
This is stronger than ordinary code coverage. It connects meaning to source, verification, tests, and runtime behavior.
15. Related Work and Differentiation
MDD sits at the intersection of several existing areas.
Spec-driven AI development
Projects such as GitHub Spec Kit, OpenSpec, Kiro Specs, and spec-workflow-mcp validate the demand for structured AI-native development. They organize requirements, designs, and tasks, but generally remain Markdown/workflow systems rather than graph-compiled formal verification systems.
MDD difference:
Spec tools organize intent.
MDD verifies and executes intent as state/path constraints.
Agent graph runtimes
LangGraph, XState, and Stately Agent show the value of graph/state-machine control for agents and applications.
MDD difference:
Graph runtimes execute workflows.
MDD verifies source-provenanced semantic graphs before agents traverse them.
Formal methods
TLA+, Apalache, P, FizzBee, Alloy, Lean4, and related tools provide the verification backbone. AWS’s industrial use of TLA+ demonstrates that formal design models find important bugs missed by conventional methods.
MDD difference:
Formal tools require formal authoring.
MDD provides a source-doc → Mermaid → MDD-IR → formal backend → runtime bundle pipeline.
LLMs for formalization
Research on LLM-assisted formal specification, TLA+ proof generation, Lean theorem proving, and proof-carrying code completions shows that LLMs can help produce formal artifacts.
MDD difference:
LLM formalization work often targets specs/proofs directly.
MDD inserts an auditable graph layer and produces runtime traversal constraints for agents.
GraphRAG and knowledge graphs
GraphRAG extracts document graphs for retrieval and synthesis.
MDD difference:
GraphRAG answers questions from document graphs.
MDD verifies and executes state/path semantics from document graphs.
16. Contribution Claims
MDD contributes:
- Source-provenanced Mermaid abstraction: a human-auditable layer between natural language and formal models.
- Typed MDD-IR: a canonical graph representation with guards, domains, invariants, coverage obligations, and provenance.
- Multi-backend formal verification: TLA+ for state integrity, Z3 for path witnesses, Lean4 for semantic extrapolation.
- Proof-carrying semantic coverage: edge cases linked from source text to verification, tests, assertions, and runtime traces.
- Engine/runtime split:
mdd-engineperforms expensive verification;mddenforces lightweight deterministic traversal. - Agent traversal protocol: LLMs may propose intent, but deterministic runtime logic approves or rejects transitions.
- Environment-aware modeling discipline: failures, retries, tool errors, user corrections, operator actions, and LLM invalid intents are modeled with the system.
Novelty statement:
Unlike spec-driven development tools, MDD does not stop at structured Markdown artifacts. Unlike graph runtimes, MDD does not merely execute workflows. Unlike formal methods tools, MDD does not require engineers to directly author raw formal specifications. MDD introduces a source-provenanced Mermaid layer and typed graph IR that compile into formal backends and lightweight runtime bundles for AI agents.
17. Limitations
MDD has important limits:
- A verified model is not the real system.
- Source documents may be incomplete or wrong.
- LLM extraction may mislabel semantics.
- Formal checks only cover encoded assumptions and bounded domains.
- Lean4 checks formalized definitions, not informal intent.
mddverifies transitions under a bundle, not global reality.- Implementation correctness still needs code review, tests, assertions, monitoring, and integration discipline.
- Performance collapse, queueing dynamics, probabilistic availability, and economic/adversarial behavior must be explicitly modeled or marked out of scope.
MDD’s mature claim is:
MDD verifies design-level state/path semantics and constrains agent traversal under declared assumptions.
Not:
MDD proves the entire software system correct.
18. Implementation Roadmap
Phase 1: Mermaid ingestion and deterministic parsing
Phase 2: typed MDD-IR schema
Phase 3: provenance and audit UI
Phase 4: TLA+ backend for state integrity
Phase 5: Z3 backend for path witnesses
Phase 6: Lean4 proof obligations for semantic extrapolation
Phase 7: .mddbundle packaging and signatures
Phase 8: lightweight mdd runtime traversal
Phase 9: counterexample-to-Mermaid rendering
Phase 10: generated tests, assertions, and CI integration
CI gate example:
mdd-engine ci
It should fail if:
- Mermaid cannot parse
- graph elements lack provenance
- MDD-IR schema validation fails
- TLA+ invariants fail
- forbidden paths are SAT
- required coverage obligations lack witnesses
- Lean4 proof obligations fail
- generated bundle is stale relative to source hashes
19. Conclusion
MDD transforms ambiguous prose into auditable graphs, graphs into typed formal models, models into verified tests and obligations, verified artifacts into runtime bundles, and runtime traversal into logic-guided agent behavior.
The essential idea is:
mdd-engine compiles proof-carrying Mermaid diagrams.
mdd lets LLM agents traverse those diagrams without violating verified state/path constraints.
MDD is best understood as:
Mermaid-based, source-provenanced, exhaustively testable design debugging for AI-assisted software engineering.
The goal is not prettier diagrams. The goal is to find design bugs that prose, reviews, tests, and ordinary diagrams miss — and then give agents a verified design space they can traverse without inventing invalid paths.