Condensate

2026-06-04 · locomo10_full_report.json

LoCoMo-10 benchmark report

Condensate scored on the public LoCoMo long-memory suite alongside transcript replay, observation-list baselines, and published leaderboard references. This page summarizes the June 2026 full run: 10 conversations, about 2,000 questions, one ingest method per conversation.

What LoCoMo measures

Assistants that run for weeks must recall prior user statements without attaching the full chat history on every request.

Failure modes the suite targets:

Targets and June 2026 scores

Internal target: at least 85% retrieval recall with under 7,000 tokens of memory per question.

Published reference (LoCoMo leaders): about 92.5% recall, about 6,956 tokens per question.

82.0%
Condensate recall

3.0 points below 85% target; transcript baseline 80.4%

1,749
Memory tokens per question

Under 7k cap; transcript average ~20,476

Baseline definitions

Approach Mechanism Recall Tokens per question
Full transcriptAttach the full chat log on each turn80.4%~20,500
Observation listAppend-only extracted bullets69.2%~6,300
Structured notesOrganized store without retiring stale rows80.4%~22,100
Industry referenceFigures from published LoCoMo leaderboard entries~92.5%~7,000
CondensateDated facts with supersession; fair per-conversation ingest82.0%~1,749

Per-category recall

Type Industry Condensate Transcript
Open-domain76.0%93.3%98.3%
Temporal92.8%93.5%86.0%
Single-hop92.3%81.2%92.9%
Multi-hop93.3%70.8%55.2%
Adversarialn/a55.4%40.1%

Remaining work on this dataset: multi-hop 70.8%, adversarial 55.4%, plus 3.0 points to the 85% overall recall target.

Operational notes

Native end-to-end answer match (not just recall-in-context) was 76.4% on this run. Leaderboard references may use different cost accounting; compare token columns only when methodologies match.

Regenerate the report

# WSL, from Condensates repo root
make test-locomo-full    # Full LoCoMo-10 + HTML/JSON artifacts
make test-contradiction  # 50 supersession cases
make test-all            # Unit, integration, benchmarks, contradiction

Recall means the benchmark answer text appeared in memory returned before the model answered. Ignore older QA-only partial runs (~50.5%); they used a different scope.

Summary table

Approach Right info in context Memory per question Updates on correction
Full transcript80.4%~20k tokensNo
Fact list69.2%~6k tokensAdd-only
Industry leaderboard~92.5%~7k tokensVaries
Condensate (June 2026)82.0%~1.7k tokensYes