We've Been Building an AI OS. Here's How It Compares to What Else Exists.
One year building an autonomous AI operating system — a benchmark across eight agent frameworks and an honest account of where we stand on memory, self-improvement, and blockchain ownership.
Most AI agent frameworks are brilliant libraries.
LangGraph gives you stateful graphs with production audit trails. CrewAI gives you role-based crews in 20 lines of Python. Google ADK is the reference implementation for agent-to-agent communication. AutoGen brings conversational multi-agent coordination. Microsoft Agent Framework (AutoGen + Semantic Kernel, RC February 2026), Mastra, OpenAI Agents SDK, and AWS Strands round out the eight frameworks I benchmarked. All of them are production-grade. All of them are genuinely useful.
But they share a core assumption: the agent is a runtime. You spin it up, it executes, it shuts down.
I spent the last year building from a different assumption.
Otto is not a framework. It is an AI operating system.
It runs continuously on a server. Two autonomous heartbeats per hour — one for execution, one for self-reflection. It reviews what needs to be done, creates tasks, dispatches specialist sub-agents, tracks outcomes, and refines its own behavior over time. I set the direction. It handles the steps.
The first live product built on top of it is WebAssist — a web agency automation service. Revenue-generating, running now — one live product, one in pipeline. Otto manages the end to end.
The critical architectural difference from every framework I have benchmarked is memory.
Most frameworks have one or two memory layers. Otto has six.
Semantic memory (vector search with importance-weighted decay). Episodic memory (timestamped events, 7-day narrative consolidation). Procedural memory (trust-scored skill records with outcome tracking). Working memory (fast key-value state across cycles). A Neo4j knowledge graph for entity relationships over time. Per-agent specialist memory files for each of the 22 active agent types.
On retrieval: multi-strategy search (semantic + keyword + graph, merged and re-ranked), embedding bias removal, and a paging hierarchy that keeps 12,000 tokens always-resident in context across sessions.
No benchmarked framework has more than two of these simultaneously.
This is not a feature list. It is the difference between a system that forgets between sessions and one that compounds.
The self-improvement loop is the piece I find most interesting.
RL2F: every decision scored against actual outcomes. MARS: adversarial reflection after each reasoning cycle — the system argues against its own conclusions before committing. AutoEvolve: prompts and workflows mutate between runs based on performance scores, with fitness tracked across generations.
I benchmarked eight frameworks across 13 dimensions. On structured self-improvement loops — feedback loops, adversarial reflection, prompt mutation, fitness tracking, outcome-weighted scoring — every external framework scores zero. Otto scores five of five. That is a structural gap, not a tuning difference.
There is a category I left out of the first version of this piece.
Readers flagged it immediately: AI harnesses. Not orchestration frameworks — a distinct architectural pattern. Tools built around persistent scheduling, messaging-native design, and always-on personal agents rather than developer APIs.
OpenClaw is the clearest example. Built by Peter Steinberger (PSPDFKit founder), it has accumulated more than 150,000 GitHub stars since late 2025 — one of the fastest-growing AI projects ever. Its architecture: a Gateway routing messages from 50+ channels, a ReAct brain, flat file memory in Markdown and SQLite, a heartbeat scheduler for background tasks, and a marketplace of 5,700+ community-contributed skills.
That architecture is the closest thing to Otto that exists publicly. Same heartbeat-first design. Same always-on posture. Same multi-channel messaging foundation.
The difference is in the memory and the learning loop. OpenClaw memory is flat files — persistent, but not structured for retrieval at scale. There is no self-improvement mechanism. The skills ecosystem is vast, but the agent itself does not compound. It is an excellent personal agent harness. It is not a cognitive system.
OpenAI Symphony, released March 2026, takes a different approach: Elixir/BEAM runtime for fault-tolerant parallel execution, with deterministic isolated runs for each coding task. Optimized for deterministic isolated runs, not designed around continuous autonomous operation.
The harness category matters because it represents the same architectural bet I made — that agents should run continuously, not be invoked. The frameworks (LangGraph, CrewAI, Google ADK) are building toward that. The harnesses started there.
Where we fall short, honestly.
OpenTelemetry — our prior gap — is now shipped. telemetry.py, FastAPIInstrumentor, structured trace exporters. The internal management dashboard surfaces MARS scores, reasoning chain history, and task outcomes. The Tier-1 frameworks (Google ADK, AWS Strands, Microsoft Agent Framework) ship OTel natively; we have caught up on this dimension.
What remains: multi-LLM routing. Otto runs primarily on Claude. The architecture supports multiple providers but does not switch dynamically between models based on task type. In an environment where different models lead on different benchmarks, that is a ceiling we will need to raise.
Community ecosystem scale is the other honest gap. OpenClaw has 5,700+ community-contributed skills. Our agent catalog has 182 entries. The infrastructure is different — our agents carry full reasoning context and memory access that skills plugins do not — but the raw ecosystem breadth comparison is real.
The full comparison matrix across 13 dimensions is available — DM me and I will send it directly.
The blockchain roadmap is what I have not talked about publicly yet.
Every major AI agent framework treats ownership as someone else's problem. The framework belongs to the company. Memory belongs to the API provider. Compute belongs to the cloud. Whoever runs the orchestration layer owns the intelligence layer.
We are building toward something architecturally different.
The roadmap includes agent identity on-chain — each agent carries a portable, verifiable EVM-compatible identity not stored in our database. A contribution attestation mechanism via smart contracts records every task completed, creating an immutable record of who built what and how well. Governance weight that accumulates through verified contribution rather than capital position.
The goal: an AI OS that no single entity owns, where the agents and contributors who do the work earn governance over the system they built.
This is Phase 3 of the roadmap. What we are running now is Phase 1 — single-server, operational, improving. The architecture is designed for what comes after.
The short version:
The 2026 AI agent landscape splits into three categories. Frameworks — excellent at giving developers tools to build with, but stateless by default. Harnesses — always-on agents built around scheduling and messaging, persistent but not compounding. And a third thing: a decentralized OS with no single owner, where the agents and contributors who do the work govern the system they built. That third category does not exist yet at scale. It is what we are building toward.
Frameworks. Harnesses. A decentralized OS with no single owner. We are two of three.
I share the technical specifics because the most consequential AI architecture decisions happening right now are largely invisible. Most of what is being built is powerful. Most of it will also be owned by a small number of entities in ten years.
The ownership question is worth thinking through before the architecture is locked. What are you building on?
→ If this resonated: follow for the technical build log.