This is a well-structured breakdown of a problem most RAG teams discover the hard way. Vector search recall degrades fast when document corpora grow large or semantically dense. The three-stage pipeline (vector > semantic ranker > GraphRAG prominence via RRF) is a sound architecture, and the decision to use citation references instead of LLM-extracted entities for graph construction is pragmatic engineering. It avoids the noise and cost of extraction while keeping the graph meaningful.
The recall jump from 40% (vector) to 70% (GraphRAG pipeline) is meaningful, though it's worth noting this dataset has an unusually clean graph signal. Legal citations are explicit, structured, and semantically load-bearing. Applying the same pattern to enterprise corpora (Confluence, SharePoint, JIRA) where link structure is noisy or shallow will require more investment in graph extraction quality upstream.
Curious whether the multi-level entity summarization step from the MS Research GraphRAG library scales cleanly on the 0.5M case dataset, specifically how token costs and latency behave at that volume during ingestion. That's usually where GraphRAG implementations hit friction in production.