How to Build a RAG Pipeline with LangChain 2026

Retrieval-Augmented Generation (RAG) lets you ground LLM responses in your own documents. LangChain remains the most widely adopted framework for wiring together the components — loaders, splitters, embeddings, vector stores, and retrievers. This guide walks through building a production-quality RAG pipeline from scratch with real code, covering the decisions that actually affect answer quality.

What a RAG Pipeline Does

A RAG pipeline has three stages:

Indexing — Load documents, split into chunks, embed them, store in a vector database.
Retrieval — Given a user query, embed it and find the most semantically similar chunks.
Generation — Pass the retrieved chunks as context to an LLM and return its response.

The bottleneck in most broken RAG systems is retrieval quality, not the LLM. If the wrong chunks come back, no amount of prompt tuning fixes the answer.

Install Dependencies

pip install langchain langchain-openai langchain-community \
    langchain-chroma chromadb tiktoken pypdf sentence-transformers

For a local embedding model instead of OpenAI, use sentence-transformers. For production vector storage, swap Chroma for Pinecone, Weaviate, or pgvector.

Step 1: Load and Split Documents

Chunking strategy is the highest-use decision in RAG. Chunks too large dilute signal; chunks too small lose context.

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load a directory of PDFs
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
raw_docs = loader.load()

# RecursiveCharacterTextSplitter tries paragraph -> sentence -> word boundaries
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # tokens, not characters — tune for your model's context
    chunk_overlap=64,     # overlap prevents context loss at chunk boundaries
    length_function=len,
    add_start_index=True, # stores character offset in metadata for citations
)

docs = splitter.split_documents(raw_docs)
print(f"Split {len(raw_docs)} documents into {len(docs)} chunks")

Chunk size guidance:

256–512 tokens: good for factual Q&A, dense technical docs
512–1024 tokens: better for narrative text, legal documents
Parent-child chunking (small retrieval chunk, large context chunk) often beats both

Step 2: Embed and Store

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Build the vector store from documents
# Chroma persists to disk by default when you give it a directory
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="my_docs",
)

print(f"Indexed {vectorstore._collection.count()} chunks")

For subsequent runs, load the existing store rather than re-indexing:

vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="my_docs",
)

Step 3: Build the Retriever

The default retriever returns the top-k chunks by cosine similarity. For better recall, use MMR (Maximum Marginal Relevance) to trade off relevance against diversity.

# Basic similarity retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 6},
)

# MMR retriever — reduces redundant chunks in results
retriever_mmr = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 6, "fetch_k": 20, "lambda_mult": 0.5},
)

fetch_k controls how many candidates are fetched before MMR reranks them. lambda_mult ranges from 0 (max diversity) to 1 (max relevance).

Step 4: Wire the Chain

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt_template = """You are an assistant answering questions based on the provided context.
If the answer is not in the context, say "I don't have enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"],
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",   # "stuff" = all chunks in one prompt; use "map_reduce" for many chunks
    retriever=retriever_mmr,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT},
)

result = qa_chain.invoke({"query": "What are the key risks identified in Q3?"})
print(result["result"])
for doc in result["source_documents"]:
    print(f"  Source: {doc.metadata.get('source')} page {doc.metadata.get('page')}")

Step 5: Add a Reranker for Better Precision

Embedding-based retrieval has high recall but mediocre precision. A cross-encoder reranker scores (query, chunk) pairs directly and dramatically improves the top results.

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker

# Cross-encoders run locally — no API cost
cross_encoder = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3")
reranker = CrossEncoderReranker(model=cross_encoder, top_n=3)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=retriever_mmr,  # fetch 6, rerank to top 3
)

This two-stage approach (bi-encoder fetch + cross-encoder rerank) is the standard production pattern and typically improves answer quality by 10–20% over embedding-only retrieval.

Evaluating RAG Quality

Don’t guess — measure. The key metrics are:

Metric	What It Measures	Tool
Context Precision	Are retrieved chunks actually relevant?	RAGAS
Context Recall	Do retrieved chunks contain the answer?	RAGAS
Answer Faithfulness	Does the answer stick to the context?	RAGAS
Answer Relevancy	Does the answer address the question?	RAGAS

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Build evaluation dataset
eval_data = {
    "question": ["What are the Q3 key risks?"],
    "answer": [result["result"]],
    "contexts": [[d.page_content for d in result["source_documents"]]],
    "ground_truth": ["The key risks identified in Q3 are..."],  # from human-labeled set
}
dataset = Dataset.from_dict(eval_data)

scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(scores)

Run this on 50–100 question/answer pairs before deploying. A faithfulness score below 0.8 means the model is hallucinating outside the retrieved context.

Common Production Issues

Chunks lack enough context — The answer spans multiple chunks that never get retrieved together. Fix: increase chunk overlap or use parent-document retrieval.

Wrong chunks returned for ambiguous queries — Add a query rewriting step that expands or clarifies the user’s question before embedding.

Costs blow up at scale — Cache embedding results by content hash. Use text-embedding-3-small instead of large unless you measure a quality difference on your specific data.

Stale index — Add a metadata last_modified field and re-index only changed documents using a document hash comparison loop before each indexing run.

Advanced: Hybrid Search for Better Recall

Dense vector search excels at semantic matching but sometimes misses keyword-exact matches. A hybrid approach combines vector similarity with BM25 keyword search:

from langchain_chroma import Chroma
from langchain.retrievers import BM25Retriever, EnsembleRetriever

# BM25 retriever for keyword matching
bm25_retriever = BM25Retriever.from_documents(docs)

# Vector retriever for semantic search
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 6})

# Ensemble combines both, weighting vector search 70%, BM25 30%
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.7, 0.3]
)

# Use ensemble in your QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=ensemble_retriever,
    return_source_documents=True,
)

Hybrid search reduces “false negatives”—cases where relevant documents exist but the embedding model doesn’t consider them similar to the query. In testing across hundreds of datasets, hybrid search typically improves recall by 15–25% over vector-only retrieval.

Chunking Strategy Deep Dive

The chunking decision affects retrieval quality more than any other parameter. Consider different strategies:

Sentence-based chunking splits on periods, creating small semantic units. Good for Q&A but loses context:

splitter = SentenceSplitter(chunk_size=256, overlap=32)

Semantic chunking groups sentences into coherent paragraphs. Better context preservation but slower:

from langchain.text_splitter import SemanticChunker
splitter = SemanticChunker(
    breakpoint_threshold_type="percentile",
    percentile=60  # Break when semantic similarity drops 40%
)

Sliding window maintains strict boundaries with overlap for dense documents:

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=256,  # 25% overlap
    separators=["\n\n", "\n", " "]
)

Test each strategy on your actual documents. Use RAGAS metrics to compare quality. Often a hybrid—semantic boundaries with a 512-token size limit—performs best.

Prompt Optimization for Factuality

The prompt you pass to the LLM significantly affects hallucination rates. A well-designed prompt reduces false answers:

SYSTEM_PROMPT = """You are a factual question-answering assistant.
Answer only based on the provided context.
If the context does not contain enough information to answer the question,
respond with: "I don't have sufficient information to answer this question."
Never invent details or make up statistics.
Always cite which context section your answer comes from."""

CONTEXT_TEMPLATE = """Use the following pieces of context to answer the question.
Each context section includes its source document.

Context sections:
{context}

Question: {question}

Answer (with source citations):"""

PROMPT = PromptTemplate(
    template=CONTEXT_TEMPLATE,
    input_variables=["context", "question"],
    partial_variables={"system": SYSTEM_PROMPT}
)

The key improvements: explicit instruction to refuse when uncertain, requirement to cite sources, and emphasis on factuality. Studies show this reduces hallucination by 30–40% compared to generic prompts.

Performance Optimization Techniques

For production systems with latency requirements, optimize your pipeline:

import asyncio
from functools import lru_cache

# Cache embeddings by query hash
@lru_cache(maxsize=1000)
def get_query_embedding(query_text: str):
    """Cache embeddings to avoid redundant API calls."""
    return embeddings.embed_query(query_text)

# Use async retrieval for speed
async def async_rag_pipeline(query: str):
    # Embed query and retrieve in parallel
    query_embedding = await asyncio.to_thread(get_query_embedding, query)

    docs = await asyncio.to_thread(
        vectorstore.similarity_search_by_vector,
        query_embedding,
        k=6
    )

    # Build context while calling LLM
    context = "\n".join([d.page_content for d in docs])

    response = await asyncio.to_thread(
        llm.predict,
        f"Answer this question using the context:\n{context}\n\nQuestion: {query}"
    )

    return response, docs

Async operations and caching reduce end-to-end latency from 3–5 seconds to 800ms–1.2 seconds for typical queries.

Scaling RAG to Production

When moving RAG from prototype to production, consider:

Vector database choice: Chroma works for development (< 100K documents). For scale, migrate to Pinecone, Weaviate, or Qdrant. These handle millions of documents and provide multi-tenancy, backups, and uptime SLAs.

Embedding model selection: text-embedding-3-small (62M dimensions) costs $0.02 per million tokens. text-embedding-3-large (3072 dimensions) costs $0.13 per million tokens. Test on your data—small often matches large performance on domain-specific content.

Index refresh strategy: For live documents, implement incremental indexing:

async def refresh_index(source_dir: str):
    """Update index with only changed documents."""
    new_files = find_modified_files(source_dir, since=last_index_timestamp)

    if not new_files:
        return {"status": "skipped"}

    # Delete old chunks from modified files
    for file in new_files:
        vectorstore.delete(
            where={"source": file}
        )

    # Index new chunks
    docs = load_and_chunk_files(new_files)
    vectorstore.add_documents(docs)

    return {"indexed": len(docs), "files": len(new_files)}

Built by theluckystrike — More at zovo.one