How to Build a RAG Pipeline with LangChain and OpenAI in 2026 ⏱️ 9 min read

Retrieval-Augmented Generation (RAG) went from research paper to production staple in under two years. The idea is simple: instead of asking an LLM to recall facts from training data, you feed it relevant context at query time. The result is accurate, grounded answers over your own documents—without fine-tuning, without hallucination spirals. Here’s how to build a working RAG pipeline from scratch using LangChain and OpenAI’s API, with the patterns that actually hold up in production.

What You’re Building

By the end of this tutorial, you’ll have a pipeline that:

  • Ingests PDF or text documents and splits them into chunks
  • Embeds those chunks using OpenAI’s text-embedding-3-small model ($0.02/1M tokens)
  • Stores embeddings in a local Chroma vector database
  • Accepts user queries, retrieves the top-k relevant chunks, and passes them to GPT-4o for a grounded response

This is the architecture behind most production RAG systems. We’ll skip the complexity theater and build something that actually runs.

Setup: Install Dependencies

Start with a clean Python 3.11+ environment:

pip install langchain langchain-openai langchain-community chromadb pypdf python-dotenv

Create a .env file in your project root:

OPENAI_API_KEY=sk-your-key-here

LangChain’s modular design means you only install what you use. The langchain-openai package handles both chat models and embeddings; chromadb is the vector store we’ll use locally (no server needed for development).

Step 1: Load and Chunk Your Documents

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("your-document.pdf")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")

The chunk_overlap=200 setting is important—it prevents context from being cut at chunk boundaries. I’ve seen RAG quality drop noticeably when overlap is set to zero. For a 50-page technical document, you’ll typically get 200–400 chunks at this chunk size.

If you’re loading plain text files, swap PyPDFLoader for TextLoader. LangChain has loaders for Notion, Confluence, GitHub, web URLs, and dozens of other sources—same interface, different loader class.

Step 2: Embed and Store in Chroma

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from dotenv import load_dotenv

load_dotenv()

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
print("Vector store created and persisted.")

This step calls the OpenAI embeddings API once per chunk. For 300 chunks at ~500 tokens each, that’s roughly 150k tokens—about $0.003 at current pricing. The persist_directory saves the database to disk so you don’t re-embed on every restart.

One thing worth getting right: text-embedding-3-small produces 1536-dimensional vectors and is 5x cheaper than the older text-embedding-ada-002 while being more accurate. There’s no reason to use ada-002 in new projects.

Step 3: Build the Retrieval Chain

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small")
)

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

prompt_template = (
    "You are a helpful assistant. Answer the question based only on the provided context. "
    "If the context does not contain the answer, say you do not have enough information.\n\n"
    "Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"
)

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

llm = ChatOpenAI(model="gpt-4o", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What are the main conclusions?"})
print(result["result"])
for doc in result["source_documents"]:
    print(f"  Page {doc.metadata.get('page', 'N/A')}: {doc.page_content[:100]}...")

The k=5 retrieval setting fetches the 5 most similar chunks. Four to six is the sweet spot—fewer misses important context, more dilutes it. Setting temperature=0 on the LLM ensures deterministic, factual responses rather than creative improvisation.

Step 4: Production Considerations

This tutorial builds a working prototype. Before shipping, address these four things:

  • Vector store at scale: Chroma is fine for thousands of documents locally. For production, consider Pinecone, Weaviate, or pgvector. Pinecone’s serverless tier starts free and scales to billions of vectors.
  • Reranking: Add a reranker (Cohere Rerank or a cross-encoder model) after initial retrieval. This step alone improves answer accuracy by 15–25% on complex queries in my experience.
  • Metadata filtering: Tag chunks with document IDs, dates, or categories and filter at retrieval time. This is essential in multi-tenant or multi-domain applications.
  • Observability: Log every query, retrieved chunks, and final answer. LangSmith has a free tier and integrates with one import.

Final Verdict

RAG is not complicated—the complexity is in tuning chunk size, retrieval k, and prompt design for your specific content. The pipeline above runs out of the box on most document types. Start here, measure answer quality against a set of 20–30 test questions, then iterate on the components that fail.

Your next step: Run the code above against a PDF you actually care about—an API doc, a research paper, your product handbook. You’ll have a working RAG system in under an hour. From there, the path to production is metadata filtering, a hosted vector store, and a reranker—in roughly that order of impact.

Similar Posts