The Essential RAG Book

Streaming RAG

Streaming Retrieval Augmented Generation (Streaming RAG) integrates continuously updating data sources with the retrieval step. Instead of indexing only static corpora, the system consumes append-only feeds, pub-sub topics, or event streams and maintains a near real-time index...

TL;DR

Streaming Retrieval Augmented Generation (Streaming RAG) integrates continuously updating data sources with the retrieval step. Instead of indexing only static corpora, the system consumes append-only feeds, pub-sub topics, or event streams and maintains a near real-time index for retrieval. The generator can then a...

Key Takeaways

  • Streaming Retrieval Augmented Generation (Streaming RAG) integrates continuously updating data sources with the retrieval step.
  • Ingestion. Use a streaming substrate such as Kafka, Kinesis, or Pub/Sub to capture new documents and deltas.

Streaming Retrieval Augmented Generation (Streaming RAG) integrates continuously updating data sources with the retrieval step. Instead of indexing only static corpora, the system consumes append-only feeds, pub-sub topics, or event streams and maintains a near real-time index for retrieval. The generator can then answer questions with the freshest available context.

┌─────────────────────────────────┐
│ Producers: APIs, Webhooks, Logs │
└─────────────────────────────────┘
                 ↓
         ┌───────────────┐
         │ Stream Ingest │
         └───────────────┘
                 ↓
          ┌────────────┐
          │ Preprocess │
          └────────────┘
                 ↓
             ┌───────┐
             │ Embed │
             └───────┘
                 ↓
       ┌──────────────────┐
       │ ANN Index Update │
       └──────────────────┘
           // incremental
          ┌────────────┐
          │ User Query │
          └────────────┘
                 ↓
           ┌───────────┐
           │ Retriever │
           └───────────┘
                 ↓
             ┌───────┐
             │ Top-k │
             └───────┘
                 ↓
           ┌───────────┐
           │ Generator │
           └───────────┘
                 ↓
            ┌────────┐
            │ Answer │
            └────────┘
        ┌─────────────────┐
        │ Freshness Guard │
        └─────────────────┘
     TTL, watermark, late data
Figure 11: Ingests events continuously, updates vector index incrementally, and enforces freshness

Ingestion. Use a streaming substrate such as Kafka, Kinesis, or Pub/Sub to capture new documents and deltas. Preprocess with lightweight parsers and chunkers that operate in micro-batches to bound latency. Persist raw events to object storage for replay and backfill to keep the index consistent. Index maintenance. Maintain an incremental ANN pipeline (HNSW, IVF-Flat, PQ) that supports fast upserts and deletes. Keep metadata columns for timestamps, source, and ACL. Partition or time-slice large tables to accelerate pruning and TTL expiry. Freshness controls. Enforce a time watermark on retrieval that drops stale chunks beyond a TTL. Add a recency prior to the score, e.g., score = sim - lambda * age_hours. For safety-critical answers, require at least one context updated within a freshness window. Serving path. The retriever consults both the hot streaming index and a colder historical index. A policy decides which to use based on query type and freshness requirements. Answers include citations with timestamps to improve trust and traceability. Backfill and reindex. When schemas or embeddings change, run a background reindex job while continuing incremental updates. Use versioned embeddings and a dual-read policy during migration to avoid downtime. When to use. Streaming RAG is ideal for real-time monitoring, news summarization, fraud and risk signals, and operational analytics. It trades additional ops complexity for the

ability to answer questions about the latest events.

People also ask

Related Pages