Why RAG?
Imagine you have a huge library full of books, but instead of reading every book to find the answer to your question, you have a super-smart librarian who quickly finds the most relevant pages and summarizes them for you.
That’s how RAG (Retrieval-Augmented Generation) works in AI. Instead of just relying on what it already knows (which might be outdated or incomplete), a RAG system first retrieves fresh and relevant information from a database or the internet and then generates a response using that information.
This makes RAG much smarter and more accurate because it pulls in the latest, most relevant facts before answering—just like a good researcher!
Typical RAG systems have following three actions:
- Retrieve: The original user query is used to retrieve relevant context from an external knowledge source (typically a vector database, such as Weaviate, Qdrant). The retrieval is performed using similarity search of the embedded user query in the vector database. To do this, the user query is embedded with an embedding model into the same vector space as the additional(proprietary) context in the vector database. Availability of the embedded user query in the context aware vector space, allows to perform a similarity search, and the closest data objects from the vector database are retrieved.
- Augment: The user query and the retrieved additional context are fitted into a prompt template. A prompt template provides a reusable pattern that can be updated with specific information to generate useful prompts.
- Generate: Finally, the retrieval-augmented prompt is passed to the LLM, that receives the enriched prompt and passes it to the layers of interconnected artificial neurons aka neural networks. The embedding layer within the LLM translates the retrieval-augmented prompt into numerical vectors, capturing the semantic meaning. Most modern LLMs employ the transformer architecture that uses "self-attention" mechanism to analyze relationship between words/input sequence within a sentence. The transformer relies on the defined encoder-decoder structure to predict the next output sequence, based on the encoded input sequence. There are several hidden layers within the neural network that perform complex calculations on the input embeddings, progressively building a deeper understanding of the input. Finally the feedforward layer generates the final output from the retrieval-augmented prompt processed through self attention mechanism, thus combining proprietary knowledge and complex generational capability of LLMs.
RAG Techniques
Retrieval-Augmented Generation (RAG) relies on a combination of retrieval and generation techniques to improve the accuracy and relevance of AI-generated responses.
Here are some common RAG techniques:
1. Retrieval Techniques
These methods help fetch relevant documents or information before generating a response.
- Dense Retrieval (e.g., FAISS, DPR) – Uses deep learning-based embeddings to find the most relevant documents efficiently.
- Sparse Retrieval (e.g., BM25, TF-IDF) – Uses traditional keyword-based search to rank documents by relevance.
- Hybrid Retrieval – Combines dense and sparse retrieval methods to improve accuracy.
- Hierarchical Retrieval – First retrieves broad categories of information, then drills down into specific details.
- Re-ranking Models (e.g., Cross-Encoders, Rank-BERT) – After retrieving initial results, these models refine the ranking for better precision.
2. Augmentation Techniques
These techniques improve the way retrieved information is used in response generation.
- Chunking – Breaks documents into smaller passages to improve retrieval relevance.
- Context Expansion – Adds additional metadata or summaries to make retrieval more precise.
- Memory-Augmented Retrieval – Keeps track of past queries and retrieved data for better long-term context.
3. Generation Techniques
Once the right documents are retrieved, these methods help generate better responses.
- Fusion-in-Decoder (FiD) – Feeds retrieved documents directly into a language model to improve response quality.
- Retrieval-Augmented Fine-Tuning – The model is fine-tuned on retrieved data to improve accuracy.
- Contrastive Learning – Helps the model distinguish between useful and irrelevant retrieved data.
- Chain-of-Thought (CoT) Prompting – Encourages step-by-step reasoning to improve complex responses.
4. Post-Processing Techniques
These techniques refine the final output.
- Answer Verification – Cross-checks generated answers against retrieved documents.
- Fact-checking & Consistency Checking – Uses additional models or logic rules to ensure accuracy.
- Human-in-the-Loop Feedback – Uses human reviewers to refine the retrieval and generation process over time.
No comments:
Post a Comment