Introduction

Imagine having a personal research assistant that can instantly answer questions about neurodevelopmental disorders by reading through hundreds of medical papers. That's exactly what we're building today!

RAG (Retrieval-Augmented Generation) is a powerful technique that combines two superpowers:

  1. Information Retrieval - Finding relevant information from your documents
  2. Text Generation - Using AI to craft intelligent answers based on that information

Think of it like having a super-smart librarian (the retrieval part) who not only finds the right books but also reads them and answers your questions in plain English (the generation part).

In this tutorial, we'll build a RAG system that can answer questions about neurodevelopmental disorders using research papers as its knowledge base. Don't worry if you're new to this - we'll break everything down into bite-sized pieces!


What We're Building

By the end of this tutorial, you'll have:

  • A system that reads PDF research papers
  • A searchable vector database of medical information
  • An interactive chat interface where you can ask questions
  • AI-powered answers based solely on your documents

Tech Stack: Our Toolkit

Before we dive in, let's understand the tools we'll use and why we need them:

LibraryPurposeWhy We Need It
LangChainFramework for LLM applicationsSimplifies connecting different AI components together
ChromaDBVector databaseStores our documents in a searchable format
Sentence TransformersCreates embeddingsConverts text into numbers that computers can compare
Google GeminiLarge Language ModelGenerates human-like answers to questions
StreamlitWeb frameworkCreates our chat interface without complex web development
PyPDFPDF processorExtracts text from research papers

Think of it like cooking:

  • PDFs are your ingredients
  • PyPDF is your knife (cuts/extracts text)
  • Sentence Transformers is your spice mix (adds flavor/meaning)
  • ChromaDB is your refrigerator (stores everything organized)
  • Gemini is your chef (creates the final dish)
  • Streamlit is your dining table (presents it beautifully)

System Architecture: The Big Picture

Here's how data flows through our RAG system:

graph TB A[πŸ“„ PDF Documents in data/ folder] --> B[PyPDFDirectoryLoader] B --> C[πŸ“ Raw Text Extracted] C --> D[RecursiveCharacterTextSplitter] D --> E[🧩 Text Chunks<br/>chunk_size=500, overlap=50] E --> F[HuggingFace Embeddings<br/>all-MiniLM-L6-v2] F --> G[πŸ”’ Vector Embeddings<br/>numbers that represent meaning] G --> H[(ChromaDB Vector Store<br/>chroma_db/ folder)] I[πŸ‘€ User Question] --> J[Embedding Model<br/>same as above] J --> K[Question Vector] K --> H H --> L[πŸ” Similarity Search<br/>finds top 5 relevant chunks] L --> M[Retrieved Context] M --> N[πŸ€– Gemini LLM] I --> N N --> O[✨ AI-Generated Answer] style A fill:#e1f5ff style H fill:#fff4e1 style N fill:#f0e1ff style O fill:#e1ffe1

Understanding the flow:

  1. Left side (Ingestion): We process PDFs into searchable chunks and store them
  2. Right side (Query): When you ask a question, we find relevant chunks and generate an answer

Step 1: Data Ingestion - Building Our Knowledge Base

The ingest.py script is like a librarian organizing books on shelves. Let's build it piece by piece!

1.1 Setting Up the Environment

First, we import our tools and load environment variables:

python
import os from dotenv import load_dotenv from langchain_community.document_loaders import PyPDFDirectoryLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_huggingface import HuggingFaceEmbeddings from langchain_chroma import Chroma load_dotenv()

What's happening here?

  • os: Helps us work with file paths
  • load_dotenv(): Loads secret keys from a .env file (for API access)
  • The other imports are our specialized tools we discussed earlier

1.2 Loading PDF Documents

python
# 1. Load PDFs from data directory data_path = os.path.join(os.path.dirname(__file__), "data") loader = PyPDFDirectoryLoader(data_path) documents = loader.load() if not documents: print("No documents found!!") exit()

Breaking it down:

  • We point to a data/ folder in the same directory as our script
  • PyPDFDirectoryLoader reads ALL PDF files in that folder
  • Each PDF becomes a "document" object with text and metadata
  • We check if we actually found any files (safety first!)

πŸ’‘ Tip: Put all your research PDFs in the data/ folder before running this script.


1.3 Chunking: Breaking Text into Digestible Pieces

python
# 2. Split text text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = text_splitter.split_documents(documents) print(f"Split into {len(chunks)} text chunks.")

Why do we chunk?

Imagine trying to find a recipe in an entire cookbook versus in a recipe card. Smaller pieces are easier to search!

  • chunk_size=500: Each piece contains ~500 characters (about 1-2 paragraphs)
  • chunk_overlap=50: We overlap chunks by 50 characters to avoid cutting sentences awkwardly
  • This creates context-rich, searchable units

Example:

Original: "ADHD is a neurodevelopmental disorder. Symptoms include inattention..."

Chunk 1: "ADHD is a neurodevelopmental disorder. Symptoms include..."
Chunk 2: "...Symptoms include inattention and hyperactivity. Treatment..."
         ↑ overlap ensures continuity

1.4 Creating Embeddings: Teaching Computers to Understand Meaning

python
# 3. Create Embeddings on CPU embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={"device": "cpu"}, # we can use cuda here if we use GPU )

What are embeddings?

Embeddings are like GPS coordinates for words and sentences. They convert text into numbers that represent meaning.

For example:

  • "autism spectrum disorder" β†’ [0.23, -0.45, 0.78, ...] (384 numbers)
  • "ASD" β†’ [0.22, -0.44, 0.79, ...] (very similar numbers!)
  • "banana recipe" β†’ [-0.67, 0.12, -0.33, ...] (very different numbers)

The model all-MiniLM-L6-v2 is a pre-trained AI that knows how to create these meaningful number representations.


1.5 Storing in Vector Database

python
# 4. Save to Chroma Vector DB vector_store = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="chroma_db" ) print("Success! Database created in 'chroma_db' folder.")

What's happening:

  • ChromaDB takes each chunk and its embedding (those 384 numbers)
  • Stores them in a special database optimized for similarity searches
  • Saves everything to chroma_db/ folder on your hard drive
  • Now we can find similar content lightning-fast!

πŸŽ‰ Ingestion Complete! You now have a searchable knowledge base of all your PDFs.


Step 2: The Application - Answering Questions

The app.py script is where the magic happens. When you ask a question, it searches the database and generates an answer!

2.1 Setting Up the Foundation

python
import os from dotenv import load_dotenv from langchain_google_genai import ChatGoogleGenerativeAI from langchain_chroma import Chroma from langchain_huggingface import HuggingFaceEmbeddings from langchain_classic.chains import create_retrieval_chain from langchain_classic.chains.combine_documents import create_stuff_documents_chain from langchain_core.prompts import ChatPromptTemplate import streamlit as st load_dotenv()

Similar to before, but now we're importing:

  • ChatGoogleGenerativeAI: To connect to Google's Gemini AI
  • streamlit: To create our web interface
  • chains: LangChain's way of connecting retrieval β†’ generation

2.2 Loading the Vector Database

python
@st.cache_resource def get_resource(): try: # 1. Load the embedding model (same one we used for ingestion!) embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={"device": "cpu"}, ) # 2. Connect to the vector database we created chroma_db_path = os.path.join(os.path.dirname(__file__), "chroma_db") vector_store = Chroma( persist_directory=chroma_db_path, embedding_function=embeddings ) return vector_store except Exception as e: st.error(f"{str(e)}") st.stop()

Key points:

  • @st.cache_resource: Loads the database once and reuses it (faster!)
  • We use the same embedding model as ingestion (critical for consistency!)
  • We connect to our existing chroma_db/ folder

2.3 Building the RAG Chain

This is where retrieval meets generation:

python
def get_chain(vector_store): try: # 1. Connect to LLM (Gemini in this case) llm = ChatGoogleGenerativeAI( model="gemini-2.5-flash-lite", temperature=0.3 ) # 2. Turn the DB into a search engine retriever = vector_store.as_retriever(search_kwargs={"k": 5}) # 3. The Prompt Template prompt = ChatPromptTemplate.from_template( """ You are a helpful medical assistant. Answer the user's question based ONLY on the context provided below. If the answer is not in the context, reply: "I cannot find this information in the provided documents." <context> {context} </context> Question: {input} """ ) # 4. Create the thinking chain document_chain = create_stuff_documents_chain(llm, prompt) retrieval_chain = create_retrieval_chain(retriever, document_chain) return retrieval_chain except Exception as e: st.error(f"{str(e)}") st.stop()

Let's break down each part:

πŸ”Ή The LLM (Line 4-7):

  • gemini-2.5-flash-lite: Google's fast, efficient AI model
  • temperature=0.3: Lower = more focused answers, higher = more creative (we want accuracy!)

πŸ”Ή The Retriever (Line 10):

  • as_retriever(): Turns our database into a search tool
  • k=5: Fetch the 5 most relevant chunks for each question

πŸ”Ή The Prompt (Line 13-25): This is crucial! We instruct the AI to:

  • Act as a medical assistant
  • Only use information from retrieved documents
  • Admit when it doesn't know (avoiding hallucinations)

πŸ”Ή The Chain (Line 28-29):

  • document_chain: Combines the LLM + prompt
  • retrieval_chain: Adds the retriever to the mix
  • Now questions flow: Question β†’ Retrieve docs β†’ Generate answer

2.4 The Chat Interface

python
# --- UI --- if "messages" not in st.session_state: st.session_state.messages = [] vector_store = get_resource() rag_chain = get_chain(vector_store) st.markdown("### πŸ’¬ Conversation") for message in st.session_state.messages: with st.chat_message(message["role"]): st.markdown(message["content"]) st.markdown("<br>", unsafe_allow_html=True) # Input Box if user_input := st.chat_input("Ask about Neurodevelopmental Disorders..."): # 1. Show User Message st.session_state.messages.append({"role": "user", "content": user_input}) with st.chat_message("user"): st.markdown(user_input) # 2. Generate AI Response with st.chat_message("assistant"): with st.spinner("Thinking..."): try: response = rag_chain.invoke({"input": user_input}) answer = response["answer"] st.markdown(answer) # 3. Save AI Message st.session_state.messages.append( {"role": "assistant", "content": answer} ) except Exception as e: error_message = f"❌ Error generating response: {str(e)}" st.error(error_message) st.session_state.messages.append( {"role": "assistant", "content": error_message} )

What's happening:

  1. State Management: st.session_state.messages stores chat history
  2. Display History: Shows all previous messages when you reload
  3. User Input: The text box at the bottom
  4. The Magic Moment (Line 26):
    python
    response = rag_chain.invoke({"input": user_input})
    • This searches the database
    • Retrieves relevant chunks
    • Feeds them to Gemini
    • Returns an answer!
  5. Error Handling: Gracefully handles any issues

Running Your RAG System

Step 1: Install Dependencies

Create a requirements.txt file:

txt
streamlit langchain langchain-community langchain-google-genai langchain-huggingface langchain-chroma chromadb sentence-transformers python-dotenv pypdf torch pysqlite3-binary langchain-classic

Install with:

bash
pip install -r requirements.txt

Step 2: Set Up Environment Variables

Create a .env file:

GOOGLE_API_KEY=your_gemini_api_key_here

Get your free API key from Google AI Studio.

Step 3: Add Your Documents

Put your PDF research papers in a data/ folder:

rag-app/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ adhd_research.pdf
β”‚   β”œβ”€β”€ autism_study.pdf
β”‚   └── neurodevelopment_paper.pdf
β”œβ”€β”€ ingest.py
β”œβ”€β”€ app.py
└── requirements.txt

Step 4: Ingest Your Documents

bash
python ingest.py

You'll see:

Split into 487 text chunks.
Success! Database created in 'chroma_db' folder.

Step 5: Launch the App!

bash
streamlit run app.py

Your browser will open with a chat interface. Try asking:

  • "What are the main symptoms of ADHD?"
  • "How is autism spectrum disorder diagnosed?"
  • "What treatments are available for dyslexia?"

What You've Accomplished

Congratulations! πŸŽ‰ You've just built a production-ready RAG system! Here's what you now understand:

βœ… RAG Architecture - How retrieval and generation work together
βœ… Vector Embeddings - Converting text to searchable numbers
βœ… Chunking Strategies - Breaking documents into optimal pieces
βœ… Similarity Search - Finding relevant information lightning-fast
βœ… Prompt Engineering - Controlling AI behavior with instructions
βœ… LangChain Chains - Connecting components into workflows


Next Steps: Taking It Further

Now that you have the basics, here are some exciting enhancements:

  1. Add More Document Types: Support Word docs, websites, or YouTube transcripts
  2. Improve Chunking: Experiment with semantic chunking or larger sizes
  3. Add Citations: Show which documents the answer came from
  4. Better Embeddings: Try gte-large or bge-large for more accuracy
  5. Multiple Collections: Create separate databases for different topics
  6. Advanced Retrieval: Implement hybrid search (keyword + semantic)
  7. Deploy Online: Host on Streamlit Cloud, Hugging Face Spaces, or Render

Key Takeaways

RAG is powerful because:

  • ✨ Your AI answers are grounded in your specific documents
  • πŸ”’ Data stays private (no uploading PDFs to random websites)
  • 🎯 Reduces AI "hallucinations" by limiting responses to known information
  • πŸ“š Scales to thousands of documents without retraining models
  • πŸ”„ Easy to update - just re-run ingest.py with new PDFs

Remember:

  • Use the same embedding model for ingestion and querying
  • Start with smaller chunk_size for precise answers
  • Prompt engineering is crucial - be specific about what you want
  • Always validate AI answers against source documents

Resources


Alternative Models: Customize Your RAG System

Want to experiment with different models? Here are some excellent alternatives to consider:

Embedding Models

The embedding model converts text into vectors. Different models offer different trade-offs between speed, accuracy, and size.

Model NameSizeDimensionsBest ForSpeed
all-MiniLM-L6-v2 (Current)80MB384General purpose, fast⚑⚑⚑
all-mpnet-base-v2420MB768Better accuracy⚑⚑
gte-large670MB1024High accuracy⚑
bge-large-en-v1.51.34GB1024State-of-the-art English⚑
instructor-xl4.96GB768Task-specific instructions⚑
e5-large-v21.34GB1024Multilingual support⚑

How to Switch Embedding Models:

In both ingest.py and app.py, replace:

python
# Current model embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={"device": "cpu"}, )

With your chosen model:

python
# Example: Using all-mpnet-base-v2 for better accuracy embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs={"device": "cpu"}, ) # Example: Using bge-large for state-of-the-art performance embeddings = HuggingFaceEmbeddings( model_name="BAAI/bge-large-en-v1.5", model_kwargs={"device": "cpu"}, ) # Example: Using instructor-xl with instructions from langchain_community.embeddings import HuggingFaceInstructEmbeddings embeddings = HuggingFaceInstructEmbeddings( model_name="hkunlp/instructor-xl", model_kwargs={"device": "cpu"}, embed_instruction="Represent the medical document for retrieval: " )

⚠️ Important: If you change the embedding model, you must re-run ingest.py to rebuild your database!


LLM Models

The LLM generates the final answers. Here are alternatives to Google Gemini:

1️⃣ OpenAI (ChatGPT)

python
from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="gpt-4o-mini", # Fast and affordable # model="gpt-4o", # Most capable # model="gpt-3.5-turbo", # Cheapest option temperature=0.3 )

Setup:

bash
pip install langchain-openai

Add to .env:

OPENAI_API_KEY=your_openai_key

Pros: Excellent reasoning, widely used, great documentation
Cons: Costs money per token (though affordable)


2️⃣ Anthropic Claude

python
from langchain_anthropic import ChatAnthropic llm = ChatAnthropic( model="claude-3-5-sonnet-20241022", # Best balance # model="claude-3-5-haiku-20241022", # Fastest # model="claude-3-opus-20240229", # Most powerful temperature=0.3 )

Setup:

bash
pip install langchain-anthropic

Add to .env:

ANTHROPIC_API_KEY=your_claude_key

Pros: Very safe, excellent for analysis, large context window
Cons: Paid service


3️⃣ Google Gemini (Current)

python
from langchain_google_genai import ChatGoogleGenerativeAI llm = ChatGoogleGenerativeAI( model="gemini-2.0-flash-exp", # Latest experimental # model="gemini-1.5-pro", # Most capable # model="gemini-1.5-flash", # Fast and free # model="gemini-2.5-flash-lite", # Current choice temperature=0.3 )

Pros: Generous free tier, fast, multimodal capabilities
Cons: Slightly less accurate than GPT-4 or Claude


4️⃣ Local Models (Ollama)

Run LLMs on your own computer - completely free and private!

python
from langchain_community.llms import Ollama llm = Ollama( model="llama3.2", # Meta's Llama 3.2 # model="mistral", # Mistral AI # model="phi3", # Microsoft Phi-3 # model="gemma2", # Google Gemma 2 temperature=0.3 )

Setup:

  1. Install Ollama: Visit ollama.ai
  2. Pull a model: ollama pull llama3.2
  3. Install LangChain integration:
bash
pip install langchain-community

Pros: Free, private, no API keys, works offline
Cons: Requires good hardware (8GB+ RAM), slower than cloud APIs


5️⃣ Groq (Ultra-Fast Inference)

python
from langchain_groq import ChatGroq llm = ChatGroq( model="llama-3.3-70b-versatile", # Llama 3.3 70B # model="mixtral-8x7b-32768", # Mixtral # model="gemma2-9b-it", # Gemma 2 temperature=0.3 )

Setup:

bash
pip install langchain-groq

Add to .env:

GROQ_API_KEY=your_groq_key

Pros: EXTREMELY FAST, free tier available, great for demos
Cons: Limited models compared to OpenAI


6️⃣ Hugging Face Models

python
from langchain_huggingface import HuggingFaceEndpoint llm = HuggingFaceEndpoint( repo_id="meta-llama/Meta-Llama-3-8B-Instruct", # repo_id="mistralai/Mistral-7B-Instruct-v0.2", # repo_id="google/flan-t5-xxl", temperature=0.3, max_new_tokens=512 )

Setup:

bash
pip install langchain-huggingface

Add to .env:

HUGGINGFACEHUB_API_TOKEN=your_hf_token

Pros: Access to thousands of open-source models, free inference API
Cons: Rate limits on free tier, variable quality


Quick Comparison Table

ProviderBest ModelCostSpeedPrivacySetup Difficulty
OpenAIGPT-4oπŸ’°πŸ’°βš‘βš‘βš‘β˜οΈ Cloud⭐ Easy
AnthropicClaude 3.5 SonnetπŸ’°πŸ’°βš‘βš‘β˜οΈ Cloud⭐ Easy
Google GeminiGemini 1.5 ProπŸ’° Free tier⚑⚑⚑☁️ Cloud⭐ Easy
GroqLlama 3.3 70BπŸ’° Free tier⚑⚑⚑⚑⚑☁️ Cloud⭐ Easy
Ollama (Local)Llama 3.2βœ… Freeβš‘πŸ”’ 100% Local⭐⭐ Medium
Hugging FaceVariousπŸ’° Free tier⚑⚑☁️ Cloud⭐⭐ Medium

Recommendations by Use Case

πŸ₯ Medical/Research (Accuracy Critical):

  • LLM: Claude 3.5 Sonnet or GPT-4o
  • Embeddings: bge-large-en-v1.5 or gte-large

⚑ Speed/Free Tier:

  • LLM: Groq (Llama 3.3) or Gemini Flash
  • Embeddings: all-MiniLM-L6-v2 (current)

πŸ”’ Privacy/Local:

  • LLM: Ollama with Llama 3.2 or Mistral
  • Embeddings: Any Sentence Transformer (runs locally)

πŸ’° Budget-Conscious:

  • LLM: Gemini 1.5 Flash (generous free tier)
  • Embeddings: all-MiniLM-L6-v2 (free, efficient)

🌍 Multilingual:

  • LLM: GPT-4o or Gemini 1.5 Pro
  • Embeddings: e5-large-v2 or multilingual-e5-large

Source Code

Basic RAG Application

bash
git clone -b simple https://github.com/nazmulshuvo03/neuroRAG.git

Experimentation Tips

  1. Start Simple: Begin with free models (Gemini, Ollama) to validate your approach
  2. Test Systematically: Keep a set of test questions and compare answers across models
  3. Monitor Costs: Use cloud provider dashboards to track API spending
  4. Benchmark Speed: Time your queries - faster models improve user experience
  5. Check Quality: Verify answers against source documents regularly
  6. Scale Gradually: Start with small datasets, then scale up

Final Thoughts

You've just built something genuinely useful! RAG systems are being used by companies worldwide for:

  • Customer support chatbots
  • Legal document analysis
  • Medical research assistance
  • Educational tutoring systems

The best way to learn is by experimenting. Try different models, tweak the prompts, adjust chunk sizes, and see what works best for your use case.

Happy building! πŸš€


Have questions or improvements? Feel free to reach out or contribute to the project!