2026-06-04 · locomo10_full_report.json
LoCoMo-10 benchmark report
Condensate scored on the public LoCoMo long-memory suite alongside transcript replay, observation-list baselines, and published leaderboard references. This page summarizes the June 2026 full run: 10 conversations, about 2,000 questions, one ingest method per conversation.
What LoCoMo measures
Failure modes the suite targets:
- Missed facts: the retrieved context never contains the answer, or holds an older version.
- Stale contradictions: the user updates a detail; both old and new values stay in memory.
- Token load: sending the full transcript each turn scales cost and latency with thread length.
Targets and June 2026 scores
Internal target: at least 85% retrieval recall with under 7,000 tokens of memory per question.
Published reference (LoCoMo leaders): about 92.5% recall, about 6,956 tokens per question.
3.0 points below 85% target; transcript baseline 80.4%
Under 7k cap; transcript average ~20,476
Baseline definitions
| Approach | Mechanism | Recall | Tokens per question |
|---|---|---|---|
| Full transcript | Attach the full chat log on each turn | 80.4% | ~20,500 |
| Observation list | Append-only extracted bullets | 69.2% | ~6,300 |
| Structured notes | Organized store without retiring stale rows | 80.4% | ~22,100 |
| Industry reference | Figures from published LoCoMo leaderboard entries | ~92.5% | ~7,000 |
| Condensate | Dated facts with supersession; fair per-conversation ingest | 82.0% | ~1,749 |
Per-category recall
- Token use: about 12x lower than transcript at similar overall recall on this run.
- Open-domain: 93.3% (reference 76.0%).
- Temporal: 93.5% (reference 92.8%).
- Supersession and provenance covered separately in ContradictionBench.
| Type | Industry | Condensate | Transcript |
|---|---|---|---|
| Open-domain | 76.0% | 93.3% | 98.3% |
| Temporal | 92.8% | 93.5% | 86.0% |
| Single-hop | 92.3% | 81.2% | 92.9% |
| Multi-hop | 93.3% | 70.8% | 55.2% |
| Adversarial | n/a | 55.4% | 40.1% |
Remaining work on this dataset: multi-hop 70.8%, adversarial 55.4%, plus 3.0 points to the 85% overall recall target.
Operational notes
- Inference cost: ~1,750 tokens of memory per question vs ~20,500 for transcript replay at 82% vs 80.4% recall.
- Mutable user state: supersession removes facts the user has replaced.
- Long threads: temporal and open-domain categories led this run; multi-hop is still behind the reference.
- Audit trail: raw JSON and scripts live in the Condensates repo; regenerate with the Makefile targets below.
Native end-to-end answer match (not just recall-in-context) was 76.4% on this run. Leaderboard references may use different cost accounting; compare token columns only when methodologies match.
Regenerate the report
# WSL, from Condensates repo root
make test-locomo-full # Full LoCoMo-10 + HTML/JSON artifacts
make test-contradiction # 50 supersession cases
make test-all # Unit, integration, benchmarks, contradiction
Recall means the benchmark answer text appeared in memory returned before the model answered. Ignore older QA-only partial runs (~50.5%); they used a different scope.
Summary table
| Approach | Right info in context | Memory per question | Updates on correction |
|---|---|---|---|
| Full transcript | 80.4% | ~20k tokens | No |
| Fact list | 69.2% | ~6k tokens | Add-only |
| Industry leaderboard | ~92.5% | ~7k tokens | Varies |
| Condensate (June 2026) | 82.0% | ~1.7k tokens | Yes |