Tech Bites: Developing Scalable, Secure, Private & 'Context Aware' AI Applications

Context-augmented LLM applications use current data from trusted places to beat intrinsic limitations of prevailing large language models. These applications are designed to use current data from managed and trusted sources to leverage the power of large language models and also overcome their intrinsic limitations like hallucinations and being trained on stale data.

It also resolves issues of proprietary data management and data privacy, enabling enterprises to develop unique products and solutions protecting proprietary assets and customer privacy.

The architecture supportive of leveraging the power of generation of LLMs and access to protected, accurate and latest data sources is often referred as Retrieval Augmented Generation (RAG). It combines the advantages of retrieval-based and generative AI models. The retrieval components of the solution serve as a search engine, fetching the most relevant data vectors based on the query. The generative component then uses secured context aware data vectors to craft a response. This collaboration of these two powerful technologies ensures that the AI solution provides responses enriched by real-world information rather than relying only on its training data which is usually stale. For example, when you ask a RAG AI based solution over an LLM about recent events, it retrieves the latest information and incorporates it into its response, offering you accurate and timely insights.

Typical RAG systems have following three actions:

Retrieve: The original user query is used to retrieve relevant context from an external knowledge source (typically a vector database, such as Weaviate, Qdrant). The retrieval is performed using similarity search of the embedded user query in the vector database. To do this, the user query is embedded with an embedding model into the same vector space as the additional(proprietary) context in the vector database. Availability of the embedded user query in the context aware vector space, allows to perform a similarity search, and the closest data objects from the vector database are retrieved.
Augment: The user query and the retrieved additional context are fitted into a prompt template. A prompt template provides a reusable pattern that can be updated with specific information to generate useful prompts.
Generate: Finally, the retrieval-augmented prompt is passed to the LLM, that receives the enriched prompt and passes it to the layers of interconnected artificial neurons aka neural networks. The embedding layer within the LLM translates the retrieval-augmented prompt into numerical vectors, capturing the semantic meaning. Most modern LLMs employ the transformer architecture that uses "self-attention" mechanism to analyze relationship between words/input sequence within a sentence. The transformer relies on the defined encoder-decoder structure to predict the next output sequence, based on the encoded input sequence. There are several hidden layers within the neural network that perform complex calculations on the input embeddings, progressively building a deeper understanding of the input. Finally, the feedforward layer generates the final output from the retrieval-augmented prompt processed through self-attention mechanism, thus combining proprietary knowledge and complex generational capability of LLMs.

Illustration of Representative Layers of LLM

Embeddings and Vector Database Implementation

Embeddings and vector databases are critical pieces of a RAG implementation. Embeddings are numerical representations of text (or other data) in a high-dimensional space, capturing semantic meaning. In RAG, embeddings help with:

✅ Semantic Search – Instead of relying on exact keywords, embeddings allow retrieval based on meaning, improving relevance.

✅ Efficient Document Retrieval – Queries and documents are converted into embeddings, making it easier to find similar content.

✅ Contextual Understanding – Embeddings capture relationships between words, improving retrieval accuracy.

A vector database stores and searches through embeddings efficiently. In RAG, vector databases help by:

✅ Fast Similarity Search – Uses algorithms like k-nearest neighbors (k-NN) or approximate nearest neighbors (ANN) to quickly find the most relevant documents.

✅ Scalability – Can handle millions or billions of embeddings, making it ideal for large-scale AI applications.

✅ Efficient Storage & Retrieval – Optimized for storing and searching high-dimensional embeddings, unlike traditional relational databases.

How They Work Together in RAG

1️⃣ Convert Text to Embeddings – User queries and documents are transformed into vector representations.

2️⃣ Store in a Vector Database – The document embeddings are indexed and stored in a vector database.

3️⃣ Retrieve Relevant Documents – The query embedding is compared to stored embeddings, fetching the most relevant documents.

4️⃣ Generate Response Using Retrieved Context – The retrieved documents are passed to an LLM, which generates an informed response.

If all of this is a lot of information, lets take a quick look at a simple example of manipulating text and corresponding embeddings using LLamaindex. The idea is not to dive into code, but to understand the variance in embedding length and its implications in RAG.

# get API key and create embeddings
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-large")

embeddings = embed_model.get_text_embedding(
    "Open AI new Embeddings models is great."
)

print(embeddings[:5])
[-0.011500772088766098, 0.02457442320883274, -0.01760469563305378, -0.017763426527380943, 0.029841400682926178]

print(len(embeddings))
3072 (Note the embedding length)

# get API key and create embeddings
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
)

embeddings = embed_model.get_text_embedding(
    "Open AI new Embeddings models is awesome."
)

print(len(embeddings))
1536 (Note the reduction in embedding length)

Using a vector database to store embeddings can significantly enhance the performance and scalability of applications that rely on high-dimensional data. Whether you're working on NLP, image processing, or recommendation systems, vector databases provide a robust solution for managing and querying embeddings efficiently. Vector databases are a crucial component in the pipeline of Retrieval Augmented Generation (RAG). They enable efficient storage, retrieval, and manipulation of high-dimensional vector representations, which are essential for various AI and machine learning applications.

Explainability of RAG Models

Retrieval-Augmented Generation (RAG) models are generally more explainable than standard Large Language Models (LLMs) because they explicitly retrieve external knowledge before generating responses. Here’s why:

Why RAG Models Are More Explainable

Source Attribution – Unlike pure LLMs that rely only on pre-trained knowledge, RAG models can point to specific documents or passages they retrieved before generating a response.
Better Transparency – Users and developers can inspect retrieved documents to see what influenced the AI’s answer.
Reduced Hallucination – Since RAG fetches real-time information, it is less likely to make things up compared to traditional LLMs.
Interpretable Retrieval Steps – The retrieval component can be analyzed separately to understand why certain documents were chosen.
Easier Debugging – If an answer is incorrect, you can check whether the issue lies in the retrieval step (wrong documents) or the generation step (misinterpretation of facts).

Overall, RAG improves explainability compared to pure LLMs, but additional methods like attention visualization, counterfactual analysis, or human feedback loops can enhance it further.

Prevailing Retrieval Augmented Generation Techniques

Several RAG techniques continue to evolve that help AI solutions provide relevant and context aware responses. Here are few of the most common RAG techniques which have proven to be highly effective in several use cases.

Sentence-window retrieval: The Sentence-window Retrieval technique is based on the principle of optimizing both retrieval and generation processes by tailoring the text chunk size to the specific needs of each stage. For retrieval, this technique emphasizes single sentences to take advantage of small data chunks for potentially better retrieving capabilities. On the generation side, it adds more sentences around the initial one to offer the LLM extended context, aiming for richer, more detailed outputs. This decoupling is supposed to increase the performance of both retrieval and generation, ultimately leading to better performance of the whole RAG system.
Document summary index: The Document Summary Index method enhances RAG systems by indexing document summaries for efficient retrieval, while providing LLMs with full text documents for response generation. This decoupling strategy optimizes retrieval speed and accuracy through summary-based indexing and supports comprehensive response synthesis by utilizing the original text.
HyDE: The Hypothetical Document Embedding technique enhances the document retrieval by leveraging LLMs to generate a hypothetical answer to a query. HyDE technique leverages on the ability of LLMs to produce context-rich answers, which, once embedded, serve as a powerful tool to refine and focus document retrieval efforts.
Multi-Query: The Multi-query technique enriches document retrieval by expanding a single user query into multiple similar queries by using an LLM. This process involves generating N alternative questions that echo the intent of the original query but from different perspectives, thereby capturing a broader spectrum of potential answers. Each query, including the original, is then vectorized and subjected to its own retrieval process, which increases the chances of fetching a higher volume of relevant information from the document repository. To manage the resultant expanded dataset, a re-ranker is often employed, utilizing machine learning models to sift through the retrieved chunks and prioritize those most relevant in regard to the initial query. The MultiQueryRetriever automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents.
Maximum Marginal Relevance: The Maximal Marginal Relevance (MMR) technique tries to refine the retrieval process by trying to create a balance between relevance and diversity in the documents retrieved. By using MMR, the retrieval system evaluates potential documents not only for their closeness to the query’s intent but also for their uniqueness compared to documents already selected. This approach mitigates the issue of redundancy, ensuring that the set of retrieved documents covers a broader range of information.
Cohere Rerank: Rerankers aim to enhance the RAG process by refining the selection of documents retrieved in response to a query, with the goal of prioritizing the most relevant and contextually appropriate information for generating responses. This step employs ML algorithms (such as cross-encoder) to reassess the initially retrieved set, using criteria that extend beyond cosine similarity. Through this evaluation, rerankers are expected to improve the input for generative models, potentially leading to more accurate and contextually rich outputs. One tool in this domain is Cohere rerank, which uses a cross-encoder architecture to assess the relevance of documents to the query. This approach differs from methods that process queries and documents separately, as cross-encoders analyze them jointly, which could allow for a more comprehensive understanding of their mutual relevance.
LLM rerank: Following the introduction of cross-encoder based rerankers such as Cohere rerank, the LLM reranker offers an alternative strategy by directly applying LLMs to the task of reranking retrieved documents. This method prioritizes the comprehensive analytical abilities of LLMs over the joint query-document analysis typical of cross-encoders. Although less efficient in terms of processing speed and cost compared to cross-encoder models, LLM rerankers can achieve higher accuracy by leveraging the advanced understanding of language and context inherent in LLMs. This makes the LLM reranker suitable for applications where the quality of the reranked results is more critical than computational efficiency.
Knowledge Graph RAG: Integrating Knowledge Graphs (KGs) with RAG systems represents a promising direction for enhancing retrieval precision and contextual relevance. KGs offer a well-organized framework of relationship-rich data that could refine the retrieval phase of RAG systems. Although setting up such systems is resource-demanding, the potential for significantly improved retrieval processes justifies ongoing further research and development.
Auto-RAG: The idea of automatically optimizing RAG systems, just like Auto-ML’s approach in traditional machine learning, is promising. Currently, selecting the optimal configuration of RAG components — e.g., chunking strategies, window sizes, and parameters within rerankers — relies on manual experimentation and intuition of scientists. An automated system can systematically perform RAG configurations and select the very best model.

Continuous Evolution...

All of the above sourced through research papers and publications are evolving through continuous research and development. The field of artificial intelligence, neural networks-based solutions and overlaying application development have matured into enterprise grade solutions since the arrival of large language models from various LLM vendors competing for dominance. Since I wrote our first AI solution in 2018 as team 'Prognosis Pundits' winning the Hackathon, the world, and this domain has progressively evolved opening opportunities which were unimaginable a decade ago. Techniques like federated model training, differential privacy, detecting/correcting biases in models, evaluating model fairness, measuring diversity of training data are continuously evolving and getting perfected in competing labs, poised to change the business landscapes forever.

Quote

Pages

Wednesday, 5 March 2025

Developing Scalable, Secure, Private & 'Context Aware' AI Applications