Tech Bites: 2025

Saturday, 5 April 2025

Optimized Inference Engines a Necessity of exploding AI ecosystem

Modern LLMs have expanded in parameter size, incorporated reasoning capabilities, and are increasingly embedded in agentic AI workflows. As a result, they generate a far greater number of tokens during inference and require deployment in distributed environments driving up costs. Therefore, optimizing inference-serving strategies to lower costs and support seamless scaling in distributed environments is crucial.

What Is an AI Inference Engine?

An AI inference engine is the component of an AI system that uses a trained model to make predictions or decisions based on new input data.

Training phase = learning (like studying for a test)
Inference phase = using what was learned to answer questions (like taking the test)

The inference engine does not learn or update the model — it simply runs the model with new data and gives you the output.

How It Works (Step-by-Step)

Let’s say you trained a model to recognize cats and dogs in photos.

1. Input

You give the model a new image (e.g., a photo of a dog).

2. Preprocessing

The image is resized or normalized so it fits the model’s input format.

3. Model Execution

The inference engine runs the trained model using this input and computes the output. For example, it might output:

{"cat": 0.12, "dog": 0.88}

It means there's an 88% chance this is a dog.

4. Postprocessing

The result is cleaned up or interpreted, so it gives a clear output like:

"This image is most likely a dog."

Where It’s Used

Inference happens any time you use an AI model in the real world, such as:

Chatbots generating answers
Recommendation engines suggesting videos
Self-driving cars making real-time decisions
Medical AI diagnosing X-rays
Translation apps turning speech into another language

What Makes Up an Inference Engine?

An inference engine includes:

Component	Role
Model Loader	Loads the trained model into memory
Input Preprocessor	Prepares input (e.g., text, image) to match the model format
Inference Core	Executes the model's layers and math to get the result
Output Postprocessor	Translates raw output into meaningful results
Accelerators (optional)	Uses GPUs/TPUs to speed things up

Optimized Inference

AI models, especially large ones like GPT or BERT, can be slow or expensive to run. Inference engines are optimized to make them faster and more efficient by:

Quantization (using lower precision like int8 instead of float32)
Pruning (removing parts of the model not needed)
Graph optimization (reordering computations for speed)
Hardware acceleration (using GPUs, TPUs, FPGAs)

Some well-known inference engines and frameworks:

TensorRT (NVIDIA) Dynamo (2025)
ONNX Runtime
OpenVINO (Intel)
TFLite (TensorFlow Lite)
DeepSparse (Neural Magic)

🎯 Why It's Critical for AI Ecosystem

Inference is the actual deployment of AI. While training can take days or weeks, inference needs to happen in milliseconds — especially in real-time applications like:

Fraud detection
Voice assistants
Real-time translation
Stock prediction
Autonomous vehicles

The goal of an inference engine is to:

Be fast
Use less memory
Run on edge devices, not just in the cloud
Be scalable and low-latency

An AI inference engine is like the brain of an AI application at runtime — it takes a trained model and new data, processes it, and gives fast, reliable answers.

Importance of KV Cache Management i.e Model Memory Management

KV Cache is like a memory for transformer models that stores important info from previous steps, so the model doesn’t have to “rethink” everything every time. It makes AI smarter and faster — just like how we remember what we said in a conversation instead of repeating it from the beginning. Hence this makes the management of KV Cache a critical component of any inference framework.

Diving into the latest inference framework in the market

NIVIDIA boasts of upto 30x inference performance(DeepSeek-R1 617B on ) through its latest inference framework the NVIDIA dynamo. Lets take a closer look that help it deliver strong performance over its predecessor TensorRT:

Performance published by NVIDIA is 30x for DeepSeek-R1 and 2.5x for Llama

Let us take a look at the architectural and design enhancements that support and help deliver this accelerated performance.

As Models are becoming larger with deeper integration into AI workflows that require interaction with multiple models. Deploying these integrated models at scale effectively, requires distributing them across multiple nodes, with careful coordination across GPUs. The complexity increases with inference optimization methods, like disaggregated serving, which splits responses across different GPUs, adding challenges in collaboration and data transfer.

NVIDIA Dynamo addresses the challenges of distributed and disaggregated inference serving through its four key components:

GPU Resource Planner: A planning and scheduling engine that monitors capacity and prefill activity in multi-node deployments to adjust GPU resources and allocate them across prefill and decode.
Smart Router: A KV-cache-aware routing engine that efficiently directs incoming traffic across large GPU fleets in multi-node deployments to minimize costly re-computations.
Low Latency Communication Library: State-of-the-art inference data transfer library that accelerates the transfer of KV cache between GPUs and across heterogeneous memory and storage types.
KV Cache Manager: A cost-aware KV cache offloading engine designed to transfer KV cache across various memory hierarchies, freeing up valuable GPU memory while maintaining user experience.

Friday, 14 March 2025

Resolving the Curse of Dimensionality

The curse of dimensionality is a problem that happens when you're working with data that has too many features or dimensions (like columns in a spreadsheet).

🧠 Think of it like this:

Imagine you want to find your friend's house in a small neighborhood with just 10 houses — pretty easy, right?
Now imagine your friend moves to a big city with a million houses — much harder to find them!

As the number of houses (or "dimensions") increases, it becomes harder to find patterns or make sense of the data.

📊 In terms of data:

When you have only a few features (like height and weight), finding patterns is straightforward.
But if you add too many features (like height, weight, age, eye color, zip code, favorite color, etc.), the data becomes very spread out and patterns are harder to spot.
The more dimensions you add, the harder it becomes for algorithms to work effectively because the data points are too far apart or scattered in a large "space."

🚀 Example:

If you try to find the closest pizza place using just 2 dimensions (latitude and longitude), it's easy.
If you add more dimensions like price, customer rating, delivery time, toppings variety, etc., finding the "best" pizza place becomes much more complicated because the "distance" between options increases in this higher-dimensional space.

👉 In short, more dimensions = harder to find patterns and data becomes sparse — that's the curse of dimensionality!

Resolving the Curse of Dimensionality

To handle the curse of dimensionality, you need to reduce the number of dimensions or make the data more manageable. Here are the most common techniques:

🔹 1. Feature Selection (Keep only the important features)

👉 Goal: Remove irrelevant or redundant features.

Analyze which features are actually contributing to the model’s performance and drop the others.

✅ Methods:

Filter methods – Use statistical tests (e.g., correlation) to remove unimportant features.
Wrapper methods – Try different combinations of features and keep the ones that improve performance.
Embedded methods – Use algorithms that select features automatically (e.g., Lasso Regression).

🔹 2. Feature Extraction (Create new, smaller sets of features)

👉 Goal: Combine existing features into fewer but more meaningful dimensions.

✅ Methods:

Principal Component Analysis (PCA) – Transforms the data into a smaller set of uncorrelated features.
Linear Discriminant Analysis (LDA) – Similar to PCA but maximizes class separability.
t-SNE – Useful for visualizing high-dimensional data in 2D or 3D.
Autoencoders – Neural networks that compress data into a smaller dimension.

🔹 3. Regularization (Force the model to ignore less important dimensions)

👉 Goal: Reduce the impact of unimportant features.

✅ Methods:

L1 regularization (Lasso) – Shrinks coefficients of less important features to zero.
Dropout (in neural networks) – Randomly ignores some features during training.

🔹 4. Manifold Learning (Discover the hidden structure in the data)

👉 Goal: Find a lower-dimensional structure that represents the data.

✅ Methods:

t-SNE – Maps high-dimensional data into 2D or 3D for visualization.
UMAP – Similar to t-SNE but faster and better at preserving global structure.
Isomap – Captures the geometric structure of data by preserving pairwise distances.

🔹 5. Clustering and Binning (Group similar data points)

👉 Goal: Reduce complexity by grouping similar data together.

✅ Methods:

K-Means – Reduces data to cluster centers.
Hierarchical Clustering – Builds a tree of clusters to simplify the data.
Quantization – Convert continuous data into discrete bins.

🔹 6. Dimensionality-Aware Models (Use models that handle high dimensions better)

👉 Goal: Use algorithms designed for high-dimensional data.

✅ Examples:

Tree-based models (e.g., Random Forest, XGBoost) – Handle high-dimensional data well.
Support Vector Machines (SVM) – Work well with high-dimensional data if properly tuned.
Deep Neural Networks – Can handle complex, high-dimensional patterns but require large amounts of data.

🚀 Best Approach = Combine Techniques

Start with feature selection to eliminate noise.
Use PCA or Autoencoders to compress data.
Apply a regularized model (like Lasso) or a tree-based model to improve performance.

👉 In most cases, a mix of feature selection + extraction + regularization gives the best results!

Tuesday, 11 March 2025

Self-Explaining APIs in the Age of Artificial Intelligence using MCP

MCP for a Layman

Okay! Imagine you have a robot friend that helps you with your homework and remembers things you tell it. The model context is like the robot’s memory during your conversation.

Here’s how it works:

Short-term memory: While you’re talking, the robot remembers what you said so it can keep track of the conversation. If you ask, "What's 5 + 3?" and then say, "Now add 2," it knows you're talking about the last answer (8) and gives you 10.
Long-term memory: If the robot has long-term memory (like writing things down in a notebook), it can remember things from past conversations. If you tell it your favorite color is blue, it might remember that next time you chat and say, "Hey, I found a cool blue picture for you!"
Rules of memory: The robot doesn’t remember everything. Some things it forgets on purpose (like secret information or things you might not want it to remember). Also, if you tell it to forget something, it will!

So, the model context protocol is just a fancy way of saying how the robot decides what to remember, what to forget, and how to use that memory to help you better. 😎

Lets Ask the Machine To Explain MCP

The model context protocol is essentially the set of rules and methods that a language model follows to manage and use context during a conversation. It controls how the model remembers things, how long it remembers them, and how it decides what to keep or forget. Let’s break it down step by step:

🔹 1. Context Window

The context window is like my short-term memory.
I can only remember a limited amount of conversation at once — usually a few thousand words (depending on the model).
If the conversation gets too long, older parts of the chat might "fall out" of my memory unless they are saved somewhere.

Example:

If you’ve been talking to me for a while and I seem to forget something from the beginning, it’s because that part of the chat got pushed out of my context window.

🔹 2. State Management

My responses are based on the current state of the conversation.
I generate answers based on the context I've been given — both from the immediate chat and any saved memory.
If the context is clear and consistent, my responses will be more accurate.
If the context is vague or confusing, my responses might seem off.

Example:

If you ask, "What’s my favorite color?" but you never told me before, I can’t know unless I’ve saved that information.

🔹 3. Memory Types

There are two main types of memory I can use:

✅ Short-term memory (context window):

I use this for the current conversation.
Once the conversation ends, this memory is wiped unless you’ve told me to save it.

✅ Long-term memory (persistent memory):

If I’m allowed to remember things between conversations, I can save important facts (like your preferences or interests).
This lets me personalize my responses over time.
You can ask me to forget things anytime.

Example:

If you say, "My dog’s name is Max," I might remember that next time unless you tell me to forget it.

🔹 4. Saving and Forgetting Information

I follow specific rules about what I’m allowed to remember:
- I won’t remember sensitive information (like passwords).
- I won’t store information unless you give permission or it’s obviously useful for future conversations.
- If you say, "Forget what I told you about my dog," I’ll erase it from memory.

🔹 5. Summarization and Compression

If the conversation gets long, I might "summarize" older parts to free up space in my context window.
This helps me keep track of the conversation without losing the main ideas.

Example:

If you tell me a long story about your trip to the zoo, I might compress it to "User visited the zoo and saw lions and tigers" instead of remembering every single detail.

🔹 6. Context Handling Across Sessions

If long-term memory is active, I can remember things from past sessions.
If not, I’ll treat each new conversation like a fresh start.

Example:

If you said, "I like pizza," and I have long-term memory, I might later suggest, "Hey, want to try a new pizza recipe?"

🔹 7. Privacy and Control

You control the memory:
- You can ask me what I remember.
- You can ask me to delete specific memories or clear everything.
- I won’t share your personal information or use it for anything outside of our conversation.

🏆 Why It Matters

The context protocol helps me:
✅ Keep track of the conversation.
✅ Personalize responses over time.
✅ Stay within memory limits without getting confused.
✅ Respect your privacy and control over what I remember.

Saturday, 8 March 2025

Cosine similarity vs Euclidean distance

Cosine Similarity and Euclidean distance plan an important role in machine learning space helping machines systematically 'understand' data. Euclidean distance measures the straight-line distance between two points, while cosine similarity measures the angle between two vectors, focusing on their direction rather than magnitude. So the Euclidean distance is sensitive to the magnitude of the vectors while cosine similarity ignores the magnitude of the vectors and focuses on their direction. Hence cosine similarity is helpful in for comparing documents, text data, or other situations where the direction of the vectors is more important than their length and Euclidean distance is useful for applications where the absolute distance between points is important, such as spatial data or numerical data.

Lets a take a quick look at both:

Cosine Similarity

In document vectors, attributes represent either the presence or absence of a word. It is possible to construct a more informational vector with the number of occurrences in the document, instead of just 1 and 0. Document datasets are usually long vectors with thousands of variables or attributes. For simplicity, consider the example of the vectors with X (1,2,0,0,3,4,0) and Y (5,0,0,6,7,0,0). The cosine similarity measure for two data points is given by:

Cosinesimilarity(|X,Y|)=x⋅y||x||||y||
where x·y is the dot product of the x and y vectors with, for this example,
$x \cdot y = \sum_{i = 1}^{n} x_{i} y_{i}$ and $| | x | | = \sqrt{x \cdot x}$
$\begin{matrix} x \cdot y = \sqrt{1 \times 5 + 2 \times 0 + 0 \times 0 + 0 \times 6 + 3 \times 7 + 4 \times 0 + 0 \times 0} = 5.1 \\ | | x | | = \sqrt{1 \times 1 + 2 \times 2 + 0 \times 0 + 0 \times 0 + 3 \times 3 + 4 \times 4 + 0 \times 0} = 5.5 \\ | | y | | = \sqrt{5 \times 5 + 0 \times 0 + 0 \times 0 + 6 \times 6 + 7 \times 7 + 0 \times 0 + 0 \times 0} = 10.5 \\ Cosine similarity (| x \cdot y |) = \frac{x \cdot y}{| | x | | | | y | |} = \frac{5.1}{5.5 \times 10.5} = 0.08 \end{matrix}$
The cosine similarity measure is one of the most used similarity measures, but the determination of the optimal measure comes down to the data structures. The choice of distance or similarity measure can also be parameterized, where multiple models are created with each different measure. The model with a distance measure that best fits the data with the smallest generalization error can be the appropriate proximity measure for the data. Simple implementation using numpy:

import numpy as np

def cosine_similarity(a, b):
a_norm = np.linalg.norm(a)
b_norm = np.linalg.norm(b)
if a_norm == 0 or b_norm == 0:
return 0
return np.dot(a, b) / (a_norm * b_norm)

vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])

similarity = cosine_similarity(vector1, vector2)
print(f"Cosine similarity: {similarity}")

What is Euclidean distance?

It's the straight-line distance between two points in space — like measuring with a ruler.

Here's a simple explanation of Euclidean distance in mathematical terms:

In 2D (two dimensions)

If you have two points:

Point A: $(x_1, y_1)$

Point B: $(x_2, y_2)$

The Euclidean distance between them is:
$\text{Distance} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$
This is based on the Pythagorean theorem — the legs of a right triangle are the differences in x and y, and the distance is the hypotenuse.

In 3D (three dimensions)

If the points are:

Point A: $(x_1, y_1, z_1)$

Point B: $(x_2, y_2, z_2)$

Then:
$\text{Distance} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2}$

General formula (for n dimensions):

If points A and B have coordinates:

A = $(a_1, a_2, ..., a_n)$

B = $(b_1, b_2, ..., b_n)$

Then:
$\text{Distance} = \sqrt{(b_1 - a_1)^2 + (b_2 - a_2)^2 + \dots + (b_n - a_n)^2}$

Wednesday, 5 March 2025

Developing Scalable, Secure, Private & 'Context Aware' AI Applications

Context-augmented LLM applications use current data from trusted places to beat intrinsic limitations of prevailing large language models. These applications are designed to use current data from managed and trusted sources to leverage the power of large language models and also overcome their intrinsic limitations like hallucinations and being trained on stale data.

It also resolves issues of proprietary data management and data privacy, enabling enterprises to develop unique products and solutions protecting proprietary assets and customer privacy.

The architecture supportive of leveraging the power of generation of LLMs and access to protected, accurate and latest data sources is often referred as Retrieval Augmented Generation (RAG). It combines the advantages of retrieval-based and generative AI models. The retrieval components of the solution serve as a search engine, fetching the most relevant data vectors based on the query. The generative component then uses secured context aware data vectors to craft a response. This collaboration of these two powerful technologies ensures that the AI solution provides responses enriched by real-world information rather than relying only on its training data which is usually stale. For example, when you ask a RAG AI based solution over an LLM about recent events, it retrieves the latest information and incorporates it into its response, offering you accurate and timely insights.

Typical RAG systems have following three actions:

Retrieve: The original user query is used to retrieve relevant context from an external knowledge source (typically a vector database, such as Weaviate, Qdrant). The retrieval is performed using similarity search of the embedded user query in the vector database. To do this, the user query is embedded with an embedding model into the same vector space as the additional(proprietary) context in the vector database. Availability of the embedded user query in the context aware vector space, allows to perform a similarity search, and the closest data objects from the vector database are retrieved.
Augment: The user query and the retrieved additional context are fitted into a prompt template. A prompt template provides a reusable pattern that can be updated with specific information to generate useful prompts.
Generate: Finally, the retrieval-augmented prompt is passed to the LLM, that receives the enriched prompt and passes it to the layers of interconnected artificial neurons aka neural networks. The embedding layer within the LLM translates the retrieval-augmented prompt into numerical vectors, capturing the semantic meaning. Most modern LLMs employ the transformer architecture that uses "self-attention" mechanism to analyze relationship between words/input sequence within a sentence. The transformer relies on the defined encoder-decoder structure to predict the next output sequence, based on the encoded input sequence. There are several hidden layers within the neural network that perform complex calculations on the input embeddings, progressively building a deeper understanding of the input. Finally, the feedforward layer generates the final output from the retrieval-augmented prompt processed through self-attention mechanism, thus combining proprietary knowledge and complex generational capability of LLMs.

Illustration of Representative Layers of LLM

Embeddings and Vector Database Implementation

Embeddings and vector databases are critical pieces of a RAG implementation. Embeddings are numerical representations of text (or other data) in a high-dimensional space, capturing semantic meaning. In RAG, embeddings help with:

✅ Semantic Search – Instead of relying on exact keywords, embeddings allow retrieval based on meaning, improving relevance.

✅ Efficient Document Retrieval – Queries and documents are converted into embeddings, making it easier to find similar content.

✅ Contextual Understanding – Embeddings capture relationships between words, improving retrieval accuracy.

A vector database stores and searches through embeddings efficiently. In RAG, vector databases help by:

✅ Fast Similarity Search – Uses algorithms like k-nearest neighbors (k-NN) or approximate nearest neighbors (ANN) to quickly find the most relevant documents.

✅ Scalability – Can handle millions or billions of embeddings, making it ideal for large-scale AI applications.

✅ Efficient Storage & Retrieval – Optimized for storing and searching high-dimensional embeddings, unlike traditional relational databases.

How They Work Together in RAG

1️⃣ Convert Text to Embeddings – User queries and documents are transformed into vector representations.

2️⃣ Store in a Vector Database – The document embeddings are indexed and stored in a vector database.

3️⃣ Retrieve Relevant Documents – The query embedding is compared to stored embeddings, fetching the most relevant documents.

4️⃣ Generate Response Using Retrieved Context – The retrieved documents are passed to an LLM, which generates an informed response.

If all of this is a lot of information, lets take a quick look at a simple example of manipulating text and corresponding embeddings using LLamaindex. The idea is not to dive into code, but to understand the variance in embedding length and its implications in RAG.

# get API key and create embeddings
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-large")

embeddings = embed_model.get_text_embedding(
    "Open AI new Embeddings models is great."
)

print(embeddings[:5])
[-0.011500772088766098, 0.02457442320883274, -0.01760469563305378, -0.017763426527380943, 0.029841400682926178]

print(len(embeddings))
3072 (Note the embedding length)

# get API key and create embeddings
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
)

embeddings = embed_model.get_text_embedding(
    "Open AI new Embeddings models is awesome."
)

print(len(embeddings))
1536 (Note the reduction in embedding length)

Using a vector database to store embeddings can significantly enhance the performance and scalability of applications that rely on high-dimensional data. Whether you're working on NLP, image processing, or recommendation systems, vector databases provide a robust solution for managing and querying embeddings efficiently. Vector databases are a crucial component in the pipeline of Retrieval Augmented Generation (RAG). They enable efficient storage, retrieval, and manipulation of high-dimensional vector representations, which are essential for various AI and machine learning applications.

Explainability of RAG Models

Retrieval-Augmented Generation (RAG) models are generally more explainable than standard Large Language Models (LLMs) because they explicitly retrieve external knowledge before generating responses. Here’s why:

Why RAG Models Are More Explainable

Source Attribution – Unlike pure LLMs that rely only on pre-trained knowledge, RAG models can point to specific documents or passages they retrieved before generating a response.
Better Transparency – Users and developers can inspect retrieved documents to see what influenced the AI’s answer.
Reduced Hallucination – Since RAG fetches real-time information, it is less likely to make things up compared to traditional LLMs.
Interpretable Retrieval Steps – The retrieval component can be analyzed separately to understand why certain documents were chosen.
Easier Debugging – If an answer is incorrect, you can check whether the issue lies in the retrieval step (wrong documents) or the generation step (misinterpretation of facts).

Overall, RAG improves explainability compared to pure LLMs, but additional methods like attention visualization, counterfactual analysis, or human feedback loops can enhance it further.

Prevailing Retrieval Augmented Generation Techniques

Several RAG techniques continue to evolve that help AI solutions provide relevant and context aware responses. Here are few of the most common RAG techniques which have proven to be highly effective in several use cases.

Sentence-window retrieval: The Sentence-window Retrieval technique is based on the principle of optimizing both retrieval and generation processes by tailoring the text chunk size to the specific needs of each stage. For retrieval, this technique emphasizes single sentences to take advantage of small data chunks for potentially better retrieving capabilities. On the generation side, it adds more sentences around the initial one to offer the LLM extended context, aiming for richer, more detailed outputs. This decoupling is supposed to increase the performance of both retrieval and generation, ultimately leading to better performance of the whole RAG system.
Document summary index: The Document Summary Index method enhances RAG systems by indexing document summaries for efficient retrieval, while providing LLMs with full text documents for response generation. This decoupling strategy optimizes retrieval speed and accuracy through summary-based indexing and supports comprehensive response synthesis by utilizing the original text.
HyDE: The Hypothetical Document Embedding technique enhances the document retrieval by leveraging LLMs to generate a hypothetical answer to a query. HyDE technique leverages on the ability of LLMs to produce context-rich answers, which, once embedded, serve as a powerful tool to refine and focus document retrieval efforts.
Multi-Query: The Multi-query technique enriches document retrieval by expanding a single user query into multiple similar queries by using an LLM. This process involves generating N alternative questions that echo the intent of the original query but from different perspectives, thereby capturing a broader spectrum of potential answers. Each query, including the original, is then vectorized and subjected to its own retrieval process, which increases the chances of fetching a higher volume of relevant information from the document repository. To manage the resultant expanded dataset, a re-ranker is often employed, utilizing machine learning models to sift through the retrieved chunks and prioritize those most relevant in regard to the initial query. The MultiQueryRetriever automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents.
Maximum Marginal Relevance: The Maximal Marginal Relevance (MMR) technique tries to refine the retrieval process by trying to create a balance between relevance and diversity in the documents retrieved. By using MMR, the retrieval system evaluates potential documents not only for their closeness to the query’s intent but also for their uniqueness compared to documents already selected. This approach mitigates the issue of redundancy, ensuring that the set of retrieved documents covers a broader range of information.
Cohere Rerank: Rerankers aim to enhance the RAG process by refining the selection of documents retrieved in response to a query, with the goal of prioritizing the most relevant and contextually appropriate information for generating responses. This step employs ML algorithms (such as cross-encoder) to reassess the initially retrieved set, using criteria that extend beyond cosine similarity. Through this evaluation, rerankers are expected to improve the input for generative models, potentially leading to more accurate and contextually rich outputs. One tool in this domain is Cohere rerank, which uses a cross-encoder architecture to assess the relevance of documents to the query. This approach differs from methods that process queries and documents separately, as cross-encoders analyze them jointly, which could allow for a more comprehensive understanding of their mutual relevance.
LLM rerank: Following the introduction of cross-encoder based rerankers such as Cohere rerank, the LLM reranker offers an alternative strategy by directly applying LLMs to the task of reranking retrieved documents. This method prioritizes the comprehensive analytical abilities of LLMs over the joint query-document analysis typical of cross-encoders. Although less efficient in terms of processing speed and cost compared to cross-encoder models, LLM rerankers can achieve higher accuracy by leveraging the advanced understanding of language and context inherent in LLMs. This makes the LLM reranker suitable for applications where the quality of the reranked results is more critical than computational efficiency.
Knowledge Graph RAG: Integrating Knowledge Graphs (KGs) with RAG systems represents a promising direction for enhancing retrieval precision and contextual relevance. KGs offer a well-organized framework of relationship-rich data that could refine the retrieval phase of RAG systems. Although setting up such systems is resource-demanding, the potential for significantly improved retrieval processes justifies ongoing further research and development.
Auto-RAG: The idea of automatically optimizing RAG systems, just like Auto-ML’s approach in traditional machine learning, is promising. Currently, selecting the optimal configuration of RAG components — e.g., chunking strategies, window sizes, and parameters within rerankers — relies on manual experimentation and intuition of scientists. An automated system can systematically perform RAG configurations and select the very best model.

Continuous Evolution...

All of the above sourced through research papers and publications are evolving through continuous research and development. The field of artificial intelligence, neural networks-based solutions and overlaying application development have matured into enterprise grade solutions since the arrival of large language models from various LLM vendors competing for dominance. Since I wrote our first AI solution in 2018 as team 'Prognosis Pundits' winning the Hackathon, the world, and this domain has progressively evolved opening opportunities which were unimaginable a decade ago. Techniques like federated model training, differential privacy, detecting/correcting biases in models, evaluating model fairness, measuring diversity of training data are continuously evolving and getting perfected in competing labs, poised to change the business landscapes forever.

Quote

Pages

Saturday, 5 April 2025

Optimized Inference Engines a Necessity of exploding AI ecosystem

What Is an AI Inference Engine?

How It Works (Step-by-Step)

1. Input

2. Preprocessing

3. Model Execution

4. Postprocessing

Where It’s Used

What Makes Up an Inference Engine?

Optimized Inference

🎯 Why It's Critical for AI Ecosystem

Importance of KV Cache Management i.e Model Memory Management

Diving into the latest inference framework in the market

Friday, 14 March 2025

Resolving the Curse of Dimensionality

🧠 Think of it like this:

📊 In terms of data:

🚀 Example:

Resolving the Curse of Dimensionality

🔹 1. Feature Selection (Keep only the important features)

✅ Methods:

🔹 2. Feature Extraction (Create new, smaller sets of features)

✅ Methods:

🔹 3. Regularization (Force the model to ignore less important dimensions)

✅ Methods:

🔹 4. Manifold Learning (Discover the hidden structure in the data)

✅ Methods:

🔹 5. Clustering and Binning (Group similar data points)

✅ Methods:

🔹 6. Dimensionality-Aware Models (Use models that handle high dimensions better)

✅ Examples:

🚀 Best Approach = Combine Techniques

Tuesday, 11 March 2025

Self-Explaining APIs in the Age of Artificial Intelligence using MCP

Lets Ask the Machine To Explain MCP

🔹 1. Context Window

Example:

🔹 2. State Management

Example:

🔹 3. Memory Types

✅ Short-term memory (context window):

✅ Long-term memory (persistent memory):

Example:

🔹 4. Saving and Forgetting Information

🔹 5. Summarization and Compression

Example:

🔹 6. Context Handling Across Sessions

Example:

🔹 7. Privacy and Control

🏆 Why It Matters

Saturday, 8 March 2025

Cosine similarity vs Euclidean distance

Cosine Similarity

What is Euclidean distance?

It's the straight-line distance between two points in space — like measuring with a ruler.

Here's a simple explanation of Euclidean distance in mathematical terms:

In 2D (two dimensions)

In 3D (three dimensions)

If the points are: Point A: (x1,y1,z1)(x_1, y_1, z_1) Point B: (x2,y2,z2)(x_2, y_2, z_2) Then: Distance=(x2−x1)2+(y2−y1)2+(z2−z1)2\text{Distance} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2}

General formula (for n dimensions):

If points A and B have coordinates: A = (a1,a2,...,an)(a_1, a_2, ..., a_n) B = (b1,b2,...,bn)(b_1, b_2, ..., b_n) Then: Distance=(b1−a1)2+(b2−a2)2+⋯+(bn−an)2\text{Distance} = \sqrt{(b_1 - a_1)^2 + (b_2 - a_2)^2 + \dots + (b_n - a_n)^2}

Wednesday, 5 March 2025

Developing Scalable, Secure, Private & 'Context Aware' AI Applications

Embeddings and Vector Database Implementation

How They Work Together in RAG

Explainability of RAG Models

Why RAG Models Are More Explainable

Prevailing Retrieval Augmented Generation Techniques

Continuous Evolution...

If the points are:

Point A: $(x_1, y_1, z_1)$

Point B: $(x_2, y_2, z_2)$

Then:
$\text{Distance} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2}$

If points A and B have coordinates:

A = $(a_1, a_2, ..., a_n)$

B = $(b_1, b_2, ..., b_n)$

Then:
$\text{Distance} = \sqrt{(b_1 - a_1)^2 + (b_2 - a_2)^2 + \dots + (b_n - a_n)^2}$