Innovation is a process, not an event. Sharing should be fun so that it expedites growth for all. Posts aim at sharing helpful knowledge/insights while working on various technologies, products and domains.
Quote
"Between stimulus and response there is a space. In that space is our power to choose our response. In our response lies our growth and freedom"
“The only way to discover the limits of the possible is to go beyond them into the impossible.”
Modern LLMs have expanded in parameter size, incorporated reasoning capabilities, and are increasingly embedded in agentic AI workflows. As a result, they generate a far greater number of tokens during inference and require deployment in distributed environments driving up costs. Therefore, optimizing inference-serving strategies to lower costs and support seamless scaling in distributed environments is crucial.
What Is an AI Inference Engine?
An AI inference engine is the component of an AI system that uses a trained model to make predictions or decisions based on new input data.
Training phase = learning (like studying for a test)
Inference phase = using what was learned to answer questions (like taking the test)
The inference engine does not learn or update the model — it simply runs the model with new data and gives you the output.
How It Works (Step-by-Step)
Let’s say you trained a model to recognize cats and dogs in photos.
1. Input
You give the model a new image (e.g., a photo of a dog).
2. Preprocessing
The image is resized or normalized so it fits the model’s input format.
3. Model Execution
The inference engine runs the trained model using this input and computes the output. For example, it might output:
{"cat": 0.12, "dog": 0.88}
It means there's an 88% chance this is a dog.
4. Postprocessing
The result is cleaned up or interpreted, so it gives a clear output like:
"This image is most likely a dog."
Where It’s Used
Inference happens any time you use an AI model in the real world, such as:
Chatbots generating answers
Recommendation engines suggesting videos
Self-driving cars making real-time decisions
Medical AI diagnosing X-rays
Translation apps turning speech into another language
What Makes Up an Inference Engine?
An inference engine includes:
Component
Role
Model Loader
Loads the trained model into memory
Input Preprocessor
Prepares input (e.g., text, image) to match the model format
Inference Core
Executes the model's layers and math to get the result
Output Postprocessor
Translates raw output into meaningful results
Accelerators (optional)
Uses GPUs/TPUs to speed things up
Optimized Inference
AI models, especially large ones like GPT or BERT, can be slow or expensive to run. Inference engines are optimized to make them faster and more efficient by:
Quantization (using lower precision like int8 instead of float32)
Pruning (removing parts of the model not needed)
Graph optimization (reordering computations for speed)
Hardware acceleration (using GPUs, TPUs, FPGAs)
Some well-known inference engines and frameworks:
TensorRT (NVIDIA) Dynamo (2025)
ONNX Runtime
OpenVINO (Intel)
TFLite (TensorFlow Lite)
DeepSparse (Neural Magic)
🎯 Why It's Critical for AI Ecosystem
Inference is the actual deployment of AI. While training can take days or weeks, inference needs to happen in milliseconds — especially in real-time applications like:
Fraud detection
Voice assistants
Real-time translation
Stock prediction
Autonomous vehicles
The goal of an inference engine is to:
Be fast
Use less memory
Run on edge devices, not just in the cloud
Be scalable and low-latency
An AI inference engine is like the brain of an AI application at runtime — it takes a trained model and new data, processes it, and gives fast, reliable answers.
Importance of KV Cache Management i.e Model Memory Management
KV Cache is like a memory for transformer models that stores important info from previous steps, so the model doesn’t have to “rethink” everything every time. It makes AI smarter and faster — just like how we remember what we said in a conversation instead of repeating it from the beginning. Hence this makes the management of KV Cache a critical component of any inference framework.
Diving into the latest inference framework in the market
NIVIDIA boasts of upto 30x inference performance(DeepSeek-R1 617B on ) through its latest inference framework the NVIDIA dynamo. Lets take a closer look that help it deliver strong performance over its predecessor TensorRT:
Performance published by NVIDIA is 30x for DeepSeek-R1 and 2.5x for Llama
Will look closer into what brings this improvement in performance and update this page soon..
The curse of dimensionality is a problem that happens when you're working with data that has too many features or dimensions (like columns in a spreadsheet).
🧠 Think of it like this:
Imagine you want to find your friend's house in a small neighborhood with just 10 houses — pretty easy, right?
Now imagine your friend moves to a big city with a million houses — much harder to find them!
As the number of houses (or "dimensions") increases, it becomes harder to find patterns or make sense of the data.
📊 In terms of data:
When you have only a few features (like height and weight), finding patterns is straightforward.
But if you add too many features (like height, weight, age, eye color, zip code, favorite color, etc.), the data becomes very spread out and patterns are harder to spot.
The more dimensions you add, the harder it becomes for algorithms to work effectively because the data points are too far apart or scattered in a large "space."
🚀 Example:
If you try to find the closest pizza place using just 2 dimensions (latitude and longitude), it's easy.
If you add more dimensions like price, customer rating, delivery time, toppings variety, etc., finding the "best" pizza place becomes much more complicated because the "distance" between options increases in this higher-dimensional space.
👉 In short, more dimensions = harder to find patterns and data becomes sparse — that's the curse of dimensionality!
Resolving the Curse of Dimensionality
To handle the curse of dimensionality, you need to reduce the number of dimensions or make the data more manageable. Here are the most common techniques:
🔹 1. Feature Selection (Keep only the important features)
👉 Goal: Remove irrelevant or redundant features.
Analyze which features are actually contributing to the model’s performance and drop the others.
✅ Methods:
Filter methods – Use statistical tests (e.g., correlation) to remove unimportant features.
Wrapper methods – Try different combinations of features and keep the ones that improve performance.
Embedded methods – Use algorithms that select features automatically (e.g., Lasso Regression).
🔹 2. Feature Extraction (Create new, smaller sets of features)
👉 Goal: Combine existing features into fewer but more meaningful dimensions.
✅ Methods:
Principal Component Analysis (PCA) – Transforms the data into a smaller set of uncorrelated features.
Linear Discriminant Analysis (LDA) – Similar to PCA but maximizes class separability.
t-SNE – Useful for visualizing high-dimensional data in 2D or 3D.
Autoencoders – Neural networks that compress data into a smaller dimension.
🔹 3. Regularization (Force the model to ignore less important dimensions)
👉 Goal: Reduce the impact of unimportant features.
✅ Methods:
L1 regularization (Lasso) – Shrinks coefficients of less important features to zero.
Dropout (in neural networks) – Randomly ignores some features during training.
🔹 4. Manifold Learning (Discover the hidden structure in the data)
👉 Goal: Find a lower-dimensional structure that represents the data.
✅ Methods:
t-SNE – Maps high-dimensional data into 2D or 3D for visualization.
UMAP – Similar to t-SNE but faster and better at preserving global structure.
Isomap – Captures the geometric structure of data by preserving pairwise distances.
🔹 5. Clustering and Binning (Group similar data points)
👉 Goal: Reduce complexity by grouping similar data together.
✅ Methods:
K-Means – Reduces data to cluster centers.
Hierarchical Clustering – Builds a tree of clusters to simplify the data.
Quantization – Convert continuous data into discrete bins.
🔹 6. Dimensionality-Aware Models (Use models that handle high dimensions better)
👉 Goal: Use algorithms designed for high-dimensional data.
✅ Examples:
Tree-based models (e.g., Random Forest, XGBoost) – Handle high-dimensional data well.
Support Vector Machines (SVM) – Work well with high-dimensional data if properly tuned.
Deep Neural Networks – Can handle complex, high-dimensional patterns but require large amounts of data.
🚀 Best Approach = Combine Techniques
Start with feature selection to eliminate noise.
Use PCAor Autoencoders to compress data.
Apply a regularized model (like Lasso) or a tree-based model to improve performance.
👉 In most cases, a mix of feature selection + extraction + regularization gives the best results!
Okay! Imagine you have a robot friend that helps you with your homework and remembers things you tell it. The model context is like the robot’s memory during your conversation.
Here’s how it works:
Short-term memory: While you’re talking, the robot remembers what you said so it can keep track of the conversation. If you ask, "What's 5 + 3?" and then say, "Now add 2," it knows you're talking about the last answer (8) and gives you 10.
Long-term memory: If the robot has long-term memory (like writing things down in a notebook), it can remember things from past conversations. If you tell it your favorite color is blue, it might remember that next time you chat and say, "Hey, I found a cool blue picture for you!"
Rules of memory: The robot doesn’t remember everything. Some things it forgets on purpose (like secret information or things you might not want it to remember). Also, if you tell it to forget something, it will!
So, the model context protocol is just a fancy way of saying how the robot decides what to remember, what to forget, and how to use that memory to help you better. 😎
Lets Ask the Machine To Explain MCP
The model context protocol is essentially the set of rules and methods that a language model follows to manage and use context during a conversation. It controls how the model remembers things, how long it remembers them, and how it decides what to keep or forget. Let’s break it down step by step:
🔹 1. Context Window
The context window is like my short-term memory.
I can only remember a limited amount of conversation at once — usually a few thousand words (depending on the model).
If the conversation gets too long, older parts of the chat might "fall out" of my memory unless they are saved somewhere.
Example:
If you’ve been talking to me for a while and I seem to forget something from the beginning, it’s because that part of the chat got pushed out of my context window.
🔹 2. State Management
My responses are based on the current state of the conversation.
I generate answers based on the context I've been given — both from the immediate chat and any saved memory.
If the context is clear and consistent, my responses will be more accurate.
If the context is vague or confusing, my responses might seem off.
Example:
If you ask, "What’s my favorite color?" but you never told me before, I can’t know unless I’ve saved that information.
🔹 3. Memory Types
There are two main types of memory I can use:
✅ Short-term memory (context window):
I use this for the current conversation.
Once the conversation ends, this memory is wiped unless you’ve told me to save it.
✅ Long-term memory (persistent memory):
If I’m allowed to remember things between conversations, I can save important facts (like your preferences or interests).
This lets me personalize my responses over time.
You can ask me to forget things anytime.
Example:
If you say, "My dog’s name is Max," I might remember that next time unless you tell me to forget it.
🔹 4. Saving and Forgetting Information
I follow specific rules about what I’m allowed to remember:
I won’t remember sensitive information (like passwords).
I won’t store information unless you give permission or it’s obviously useful for future conversations.
If you say, "Forget what I told you about my dog," I’ll erase it from memory.
🔹 5. Summarization and Compression
If the conversation gets long, I might "summarize" older parts to free up space in my context window.
This helps me keep track of the conversation without losing the main ideas.
Example:
If you tell me a long story about your trip to the zoo, I might compress it to "User visited the zoo and saw lions and tigers" instead of remembering every single detail.
🔹 6. Context Handling Across Sessions
If long-term memory is active, I can remember things from past sessions.
If not, I’ll treat each new conversation like a fresh start.
Example:
If you said, "I like pizza," and I have long-term memory, I might later suggest, "Hey, want to try a new pizza recipe?"
🔹 7. Privacy and Control
You control the memory:
You can ask me what I remember.
You can ask me to delete specific memories or clear everything.
I won’t share your personal information or use it for anything outside of our conversation.
🏆 Why It Matters
The context protocol helps me:
✅ Keep track of the conversation.
✅ Personalize responses over time.
✅ Stay within memory limits without getting confused.
✅ Respect your privacy and control over what I remember.
Cosine Similarity and Euclidean distance plan an important role in machine learning space helping machines systematically 'understand' data. Euclidean distance measures the straight-line distance between two points, while cosine similarity measures the angle between two vectors, focusing on their direction rather than magnitude. So the Euclidean distance is sensitive to the magnitude of the vectors while cosine similarity ignores the magnitude of the vectors and focuses on their direction. Hence cosine similarity is helpful in for comparing documents, text data, or other situations where the direction of the vectors is more important than their length and Euclidean distance is useful for applications where the absolute distance between points is important, such as spatial data or numerical data.
Lets a take a quick look at both:
Cosine Similarity
In document vectors, attributes represent either the presence or absence of a word. It is possible to construct a more informational vector with the number of occurrences in the document, instead of just 1 and 0. Document datasets are usually long vectors with thousands of variables or attributes. For simplicity, consider the example of the vectors with X (1,2,0,0,3,4,0) and Y (5,0,0,6,7,0,0). The cosine similarity measure for two data points is given by:
Cosinesimilarity(|X,Y|)=x⋅y||x||||y||
where x·y is the dot product of the x and y vectors with, for this example,
and
The cosine similarity measure is one of the most used similarity measures, but the determination of the optimal measure comes down to the data structures. The choice of distance or similarity measure can also be parameterized, where multiple models are created with each different measure. The model with a distance measure that best fits the data with the smallest generalization error can be the appropriate proximity measure for the data. Simple implementation using numpy:
import numpy as np
def cosine_similarity(a, b):
a_norm = np.linalg.norm(a)
b_norm = np.linalg.norm(b)
if a_norm == 0 or b_norm == 0:
return 0
return np.dot(a, b) / (a_norm * b_norm)
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
similarity = cosine_similarity(vector1, vector2)
print(f"Cosine similarity: {similarity}")
What is Euclidean distance?
It's the straight-line distance between two points in space — like measuring with a ruler.
Here's a simple explanation of Euclidean distance in mathematical terms:
In 2D (two dimensions)
If you have two points:
Point A:
Point B:
The Euclidean distance between them is:
This is based on the Pythagorean theorem — the legs of a right triangle are the differences in x and y, and the distance is the hypotenuse.
Context-augmented LLM applications use current data from trusted places to beat intrinsic limitations of prevailing large language models. These applications are designed to use current data from managed and trusted sources to leverage the power of large language models and also overcome their intrinsic limitations like hallucinations and being trained on stale data.
It also resolves issues of proprietary data management and data privacy, enabling enterprises to develop unique products and solutions protecting proprietary assets and customer privacy.
The architecture supportive of leveraging the power of generation of LLMs and access to protected, accurate and latest data sources is often referred as Retrieval Augmented Generation (RAG). It combines the advantages of retrieval-based and generative AI models. The retrieval components of the solution serve as a search engine, fetching the most relevant data vectors based on the query. The generative component then usessecuredcontext aware data vectors to craft a response. This collaboration of these two powerful technologies ensures that the AI solution provides responses enriched by real-world information rather than relying only on its training data which is usually stale. For example, when you ask a RAG AI based solution over an LLM about recent events, it retrieves the latest information and incorporates it into its response, offering you accurate and timely insights.
Typical RAG Solution
Typical RAG systems have following three actions:
Retrieve: The original user query is used to retrieve relevant context from an external knowledge source (typically a vector database, such as Weaviate, Qdrant). The retrieval is performed usingsimilarity searchof the embedded user query in the vector database. To do this, the user query is embedded with an embedding model into the same vector space as the additional(proprietary) context in the vector database. Availability of the embedded user query in the context aware vector space, allows to perform a similarity search, and the closest data objects from the vector database are retrieved.
Augment: The user query and the retrieved additional context are fitted into a prompt template. A prompt template provides a reusable pattern that can be updated with specific information to generate useful prompts.
Generate: Finally, theretrieval-augmentedprompt is passed to the LLM, that receives the enriched prompt and passes it to the layers of interconnected artificial neurons aka neural networks. The embedding layer within the LLM translates theretrieval-augmentedprompt into numerical vectors, capturing the semantic meaning. Most modern LLMs employ the transformer architecture that uses "self-attention" mechanism to analyze relationship between words/input sequence within a sentence. The transformer relies on the defined encoder-decoder structure to predict the next output sequence, based on the encoded input sequence. There are several hidden layers within the neural network that perform complex calculations on the input embeddings, progressively building a deeper understanding of the input. Finally, the feedforward layer generates the final output from theretrieval-augmentedprompt processed through self-attention mechanism, thus combining proprietary knowledge and complex generational capability of LLMs.
Illustration of Representative Layers of LLM
Embeddings and Vector Database Implementation
Embeddings and vector databases are critical pieces of a RAG implementation.Embeddingsare numerical representations of text (or other data) in a high-dimensional space, capturing semantic meaning. In RAG, embeddings help with:
✅Semantic Search– Instead of relying on exact keywords, embeddings allow retrieval based on meaning, improving relevance.
✅Efficient Document Retrieval– Queries and documents are converted into embeddings, making it easier to find similar content.
✅Contextual Understanding– Embeddings capture relationships between words, improving retrieval accuracy.
Avector databasestores and searches through embeddings efficiently. In RAG, vector databases help by:
✅Fast Similarity Search– Uses algorithms likek-nearest neighbors (k-NN)orapproximate nearest neighbors (ANN)to quickly find the most relevant documents.
✅Scalability– Can handle millions or billions of embeddings, making it ideal for large-scale AI applications.
✅Efficient Storage & Retrieval– Optimized for storing and searching high-dimensional embeddings, unlike traditional relational databases.
How They Work Together in RAG
1️⃣Convert Text to Embeddings– User queries and documents are transformed into vector representations.
2️⃣Store in a Vector Database– The document embeddings are indexed and stored in a vector database.
3️⃣Retrieve Relevant Documents– The query embedding is compared to stored embeddings, fetching the most relevant documents.
4️⃣Generate Response Using Retrieved Context– The retrieved documents are passed to an LLM, which generates an informed response.
If all of this is a lot of information, lets take a quick look at a simpleexampleof manipulating text and corresponding embeddings using LLamaindex. The idea is not to dive into code, but to understand thevariance in embedding lengthand its implications in RAG.
# get API key and create embeddings
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(model="text-embedding-3-large")
embeddings = embed_model.get_text_embedding(
"Open AI new Embeddings models is great."
)
print(len(embeddings))
3072 (Note the embedding length)
# get API key and create embeddings
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
)
embeddings = embed_model.get_text_embedding(
"Open AI new Embeddings models is awesome."
)
print(len(embeddings))
1536 (Note the reduction in embedding length)
Using a vector database to store embeddings can significantly enhance the performance and scalability of applications that rely on high-dimensional data. Whether you're working on NLP, image processing, or recommendation systems, vector databases provide a robust solution for managing and querying embeddings efficiently. Vector databases are a crucial component in the pipeline of Retrieval Augmented Generation (RAG). They enableefficient storage, retrieval, and manipulation of high-dimensional vector representations, which are essential for various AI and machine learning applications.
Explainability of RAG Models
Retrieval-Augmented Generation (RAG) modelsare generally more explainable than standard Large Language Models (LLMs) because they explicitly retrieve external knowledge before generating responses. Here’s why:
Why RAG Models Are More Explainable
Source Attribution– Unlike pure LLMs that rely only on pre-trained knowledge, RAG models can point to specific documents or passages they retrieved before generating a response.
Better Transparency– Users and developers can inspect retrieved documents to see what influenced the AI’s answer.
Reduced Hallucination– Since RAG fetches real-time information, it is less likely to make things up compared to traditional LLMs.
Interpretable Retrieval Steps– The retrieval component can be analyzed separately to understand why certain documents were chosen.
Easier Debugging– If an answer is incorrect, you can check whether the issue lies in the retrieval step (wrong documents) or the generation step (misinterpretation of facts).
Overall, RAG improves explainability compared to pure LLMs, but additional methods likeattention visualization, counterfactual analysis, or human feedback loopscan enhance it further.
Several RAG techniques continue to evolve that help AI solutions provide relevant and context aware responses. Here are few of the most common RAG techniques which have proven to be highly effective in several use cases.
Sentence-window retrieval: The Sentence-window Retrieval technique is based on the principle of optimizing both retrieval and generation processes bytailoring the text chunk sizeto the specific needs of each stage. For retrieval, this technique emphasizes single sentences to take advantage of small data chunks for potentially better retrieving capabilities. On the generation side, it adds more sentences around the initial one to offer the LLM extended context, aiming for richer, more detailed outputs. This decoupling is supposed to increase the performance of both retrieval and generation, ultimately leading to better performance of the whole RAG system.
Document summary index: The Document Summary Index method enhances RAG systems byindexing document summaries for efficient retrieval, while providing LLMs with full text documents for response generation. This decoupling strategyoptimizes retrieval speedand accuracy through summary-based indexing and supports comprehensive response synthesis by utilizing the original text.
HyDE: The Hypothetical Document Embedding technique enhances the document retrieval by leveraging LLMs togenerate a hypothetical answer to a query. HyDE technique leverages on the ability of LLMs to produce context-rich answers, which, once embedded, serve as a powerful tool to refine and focus document retrieval efforts.
Multi-Query: The Multi-query technique enriches document retrieval byexpanding a single user query into multiple similar queriesby using an LLM. This process involves generating N alternative questions that echo the intent of the original query but from different perspectives, thereby capturing a broader spectrum of potential answers. Each query, including the original, is then vectorized and subjected to its own retrieval process, which increases the chances of fetching a higher volume of relevant information from the document repository.To manage the resultant expanded dataset, a re-ranker is often employed, utilizing machine learning models to sift through the retrieved chunks and prioritize those most relevant in regard to the initial query. The MultiQueryRetriever automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents.
Maximum Marginal Relevance: The Maximal Marginal Relevance (MMR) technique tries to refine the retrieval process by trying tocreate a balance between relevance and diversity in the documents retrieved. By using MMR, the retrieval system evaluates potential documents not only for their closeness to the query’s intent but also for their uniqueness compared to documents already selected. This approach mitigates the issue of redundancy, ensuring that the set of retrieved documents covers a broader range of information.
Cohere Rerank: Rerankers aim to enhance the RAG process by refining the selection of documents retrieved in response to a query, with the goal of prioritizing the most relevant and contextually appropriate information for generating responses. This stepemploys ML algorithms (such as cross-encoder) to reassess the initially retrieved set, using criteria that extend beyond cosine similarity. Through this evaluation, rerankers are expected to improve the input for generative models, potentially leading to more accurate and contextually rich outputs. One tool in this domain is Cohere rerank, which uses a cross-encoder architecture to assess the relevance of documents to the query. This approach differs from methods that process queries and documents separately, ascross-encoders analyze them jointly, which could allow for a more comprehensive understanding of their mutual relevance.
LLM rerank: Following the introduction of cross-encoder based rerankers such as Cohere rerank, the LLM reranker offers an alternative strategy bydirectly applying LLMs to the task of reranking retrieved documents. This method prioritizes the comprehensive analytical abilities of LLMs over the joint query-document analysis typical of cross-encoders. Althoughless efficient in terms of processing speed and cost compared to cross-encoder models, LLM rerankers can achieve higher accuracy by leveraging the advanced understanding of language and context inherent in LLMs. This makes theLLM reranker suitable for applications where the quality of the reranked results is more critical than computational efficiency.
Knowledge Graph RAG:Integrating Knowledge Graphs (KGs) with RAG systemsrepresents a promising direction for enhancing retrieval precision and contextual relevance. KGs offer a well-organized framework of relationship-rich data that could refine the retrieval phase of RAG systems. Although setting upsuch systems is resource-demanding, the potential for significantly improved retrieval processes justifies ongoing further research and development.
Auto-RAG: The idea of automatically optimizing RAG systems, just like Auto-ML’s approach in traditional machine learning, is promising. Currently,selecting the optimal configuration of RAG components— e.g., chunking strategies, window sizes, and parameters within rerankers — relies on manual experimentation and intuition of scientists.An automated system can systematically perform RAG configurations and select the very best model.
Continuous Evolution...
All of the above sourced through research papers and publications are evolving through continuous research and development. The field of artificial intelligence, neural networks-based solutions and overlaying application development have matured into enterprise grade solutions since the arrival of large language models from various LLM vendors competing for dominance. Since I wrote our first AI solution in 2018 as team 'Prognosis Pundits' winning the Hackathon, the world, and this domain has progressively evolved opening opportunities which were unimaginable a decade ago. Techniques like federated model training, differential privacy, detecting/correcting biases in models, evaluating model fairness, measuring diversity of training data are continuously evolving and getting perfected in competing labs, poised to change the business landscapes forever.