2026-06-10 · locomo10_condensate_v53_fair.json

LoCoMo-10 benchmark report

Condensate scored on the public LoCoMo long-memory suite alongside transcript replay, observation-list baselines, and published leaderboard references. This page summarizes the latest fair full run: 10 conversations, 1,986 questions, fresh ingest per conversation, session-scoped retrieve.

What LoCoMo measures

Assistants that run for weeks must recall prior user statements without attaching the full chat history on every request.

Missed facts: the retrieved context never contains the answer, or holds an older version.
Stale contradictions: the user updates a detail; both old and new values stay in memory.
Token load: sending the full transcript each turn scales cost and latency with thread length.
Adversarial traps: questions with false premises must not surface trap answers from plausible but wrong context.

Targets and June 2026 scores

Internal target: at least 85% retrieval recall with under 7,000 tokens of memory per question.

Published reference (LoCoMo leaders): about 92.5% recall, about 6,956 tokens per question.

83.6%

Condensate recall

1.4 pts below 85% goal; transcript baseline 80.4%. Peak fair run: 85.4% (June 2026).

1,647

Memory tokens per question

Under 7k cap; transcript average ~20,476 (~12× Condensate)

Native end-to-end answer match (not recall-only): 72.6%.

Baseline definitions

Approach	Mechanism	Recall	Tokens per question
Full transcript	Attach the full chat log on each turn	80.4%	~20,500
Observation list	Append-only extracted bullets	69.2%	~6,300
Structured notes	Organized store without retiring stale rows	80.4%	~22,100
Industry reference	Published LoCoMo leaderboard entries	~92.5%	~7,000
Condensate (fair v5.3)	Dated facts with supersession; per-conversation ingest	83.6%	~1,647

Per-category recall

Token use: about 12× lower than transcript at competitive overall recall.
Open-domain: 92.5% (reference 76.0%).
Temporal: 95.6% (reference 92.8%).
Adversarial: 58.3% - improved via query-only entity-swap trap suppression (LOC-018, in progress).
Supersession and provenance covered separately in ContradictionBench.

Type	Industry	Condensate	Transcript
Open-domain	76.0%	92.5%	98.3%
Temporal	92.8%	95.6%	86.0%
Multi-hop	93.3%	81.2%	55.2%
Single-hop	92.3%	84.0%	92.9%
Adversarial	n/a	58.3%	40.1%

Remaining work: multi-hop 81.2%, adversarial 58.3% (target 75%+), overall +1.4 pts to 85% goal, GTM gate 95% (LOC-020).

Operational notes

Fair ingest contract: each LoCoMo conversation ingested fresh; retrieve scoped by session; benchmark env verified via check_benchmark_mode.sh.
Inference cost: ~1,647 tokens of memory per question vs ~20,500 for transcript replay.
Mutable user state: supersession removes facts the user has replaced.
Astrocyte retrieval: recall gate, source-turn hydration, entity-swap trap filtering - see Astrocyte Memory.

Regenerate the report

# WSL, from Condensates repo root
make test-locomo-v53-fair   # Fair full run (force-recreates API with bench env)
make test-locomo-report     # Merge + comparative MD/HTML + failure analysis
make test-contradiction     # 50 supersession cases

Artifacts: benchmarks/results/locomo10_condensate_v53_fair.json, locomo10_comparative_report.html (also mirrored here). Do not cite QA-only partial runs (~50.5%) or runs without RETRIEVE_BENCHMARK_MODE=1.

Open user-facing comparative report (HTML) →

Summary table

Approach	Right info in context	Memory per question	Updates on correction
Full transcript	80.4%	~20k tokens	No
Fact list	69.2%	~6k tokens	Add-only
Industry leaderboard	~92.5%	~7k tokens	Varies
Condensate (June 2026 fair)	83.6%	~1.6k tokens	Yes