How to Monitor LLMOps with Drift Monitoring

Published

September 29, 2023

Last Edited

August 26, 2025

Amit Paka

Former

Founder and COO

Fiddler AI

LLMs can experience drift when user inputs or model responses change, impacting performance and response quality.
‍
Drift monitoring helps identify prompt and output shifts, allowing businesses to detect changes in user behavior and adjust models accordingly.
‍
Monitoring data and performance drift ensures that LLMs remain accurate, reliable, and aligned with business goals.
‍
Proactively addressing drift through fine-tuning or updating retrieval strategies helps maintain model accuracy and improves user experience.

This year has seen LLM innovation at a breakneck pace. Generative AI is now a boardroom topic and teams have been chartered to leverage it as a competitive advantage. Enterprises are actively exploring use cases and deploying their first GenAI applications into production.

One key factor influencing this LLMOps performance is prompt drift — when the nature of user inputs changes significantly from what the model was originally trained or fine-tuned to handle. As prompts shift, LLMs may struggle to interpret and respond accurately, leading to incoherent outputs or reduced response quality.

Drift monitoring plays a crucial role in identifying these changes in prompts early. By tracking how prompts evolve and ensuring models adapt accordingly, businesses can catch potential issues before they escalate into major problems and maintain the reliability of their LLM applications.

In this post, we will dive into how LLM performance can be impacted, and how monitoring LLMs using the drift metric can help catch these issues before they become a problem.

How Enterprises are Deploying LLMs and Managing AI Drift

Four approaches to LLMs in production (AWS Generative AI Summit)

In a separate blog post, we dove into the four different approaches that enterprises are taking to jumpstart their LLM journey, as summarized below:

Prompt Engineering with Context involves directly calling third-party AI providers like OpenAI, Cohere, or Anthropic with a prompt that is curated or “engineered” in a specific way to elicit the right response
Retrieval Augmented Generation (RAG) involves augmenting prompts with externally retrieved data relevant to the query so that the LLM can correctly respond with that information
Fine Tuned Model involves updating the model itself with a larger dataset of information which obviates the need for augmentation data in the prompt
Trained Model involves building an LLM from scratch with large corpora of data which can be domain-centric to build a domain focused LLM. e.g. BloombergGPT

Regardless of whichever LLM deployment approach you take, LLMs will degrade over time. And it is critical for LLMOps teams to have a defined process on how to monitor and be alerted on LLM performance issues before they negatively impact the business and end-users.

Common LLM Issues That Lead to Data Drift

While LLMs tackle generalized conversational skills, enterprises are focused on targeted domain-centric use cases. Teams deploying these LLMs care about the LLM’s performance on a finite set of test data that includes prompts representative of the use case and their expected responses. Performance problems occur when prompts or responses begin to deviate from the ones expected.

There are two reasons why this tends to happen:

1. New Kinds of Prompts

LLM solutions like chatbots are deployed to a focused set of queries — inputs that end-users will commonly ask the LLM. These queries and their expected responses are documented to form the test data that the model can either be fine-tuned or validated with. This helps ensure that the LLM has been quality tested for these prompts.

However, customer behavior can change over time. For example, customers might need information from a chatbot about new products or processes that were not around when the chatbot was built. Since the use case was not previously accounted for, the underlying LLM may not have been fine-tuned for it or the RAG solution may not find the right document to generate a response. This reduces the overall quality of the response and performance of the chatbot.

2. Different Responses to the Same or Similar Prompts

Robustness

Even when the LLM has been tested or fine-tuned with a base set of prompts, users might not enter their prompts in exactly the same way as tested. For example, an eCommerce LLM will perform well if a user inputs the prompt “How do I return a product?” because the LLM was tested with that prompt. However, it might not do well if the prompt changed to “I’m confused about how to return my shoes” or “Can I get help on sending back the gift?” since the model might not recognize them as the same question. As a result, the LLM will respond in a different, unexpected way. This is called model robustness, and weaker LLM robustness can result in different responses to the same questions with different linguistic variations.

Changes to Underlying Models

When using AI via third-party APIs, the LLMs behind the APIs can unexpectedly change. Like traditional ML models, LLMs can also be refreshed or tuned. There might not be a significant update to warrant changing the major or minor version of the LLM itself, so the performance of the LLM might change for your set of prompts. A recent paper that evaluated OpenAI’s GPT-3.5 and GPT-4’s performance at two different points in time found greatly varying performance and behavior.

LLM monitoring: Varying performance between versions of GPT-4 and GPT-3.5. — Varying Performance between the March 2023 and June 2023 versions of GPT-4 and GPT-3.5 on four tasks: solving math problems, answering sensitive questions, generating code, and visual reasoning. (Reference)

Drift Monitoring for LLMs: Why it Matters

Similar to model monitoring in the well-established MLOps lifecycle, LLM monitoring is a critical step in LLMOps to ensure high performance is maintained. Drift monitoring, for example, is needed to identify whether a model’s inputs and outputs are changing for a fixed baseline, typically a sample of the training set or a slice of production traffic or in the case of LLMs, a fine-tuned dataset or a response-prompt validation set.

If there is model drift, it means that the model is either seeing different data from what is expected or outputting a different response from what is expected. Both of these can be a leading indicator of degraded model performance. Similar to traditional model drift metrics, the drift itself is calculated as a statistical metric that measures the difference in density distributions of these two prompt and response comparisons for LLMs.

The Role of Drift Monitoring in Catching LLM Data Drift

Let’s look at how LLM drift can be measured and how it can help identify performance issues.

Tracking Drift to Detect Shifts in LLM Outputs

To ensure an LLM use case is implemented correctly, you need to identify the types of prompts the model to handle with their correct responses. These form the dataset that you can use to fine-tune the model or use as a test dataset if you’re engineering prompts or deploying RAG. This dataset represents the expected reality for the model and serves as a baseline for detecting prompt drift—when the nature of user inputs deviates from what was originally anticipated. Additionally, monitoring output drift is crucial to ensure that the model’s responses align with user expectations and business objectives over time.

As production prompts evolve, prompt drift monitoring helps measure how different they are from the baseline. A significant shift in prompts can indicate changing user behavior, requiring updates to the model or retrieval strategies.

For example below, consider a chatbot answering technical questions about an ML training solution. We see a significant spike in drift which is represented by the blue line in the timeline chart. By further diagnosing the traffic using Uniform Manifold Approximation and Projection (UMAP), a 3D representation of the data, we can see that there is a new cluster of users asking about deep learning dropout and backpropagation concepts that the use case was not designed to handle. These types of prompts can now be added to the fine-tuning dataset or introduced into RAG as a new document.

Monitor drift to identify a spike in data changes that contribute to performance degradation in an LLM

Diagnose the root cause of drift by obtaining qualitative insights through a 3D UMAP

Identify clusters of outlier prompts that caused drift and collate insights to improve LLM performance with fine-tuning or RAG

Measuring the Impact of LLM Data Drift

LLM drift can significantly affect model reliability, resulting in inaccurate responses and a poor user experience. Data drift monitoring helps organizations identify shifts in model behavior early, allowing them to take corrective actions such as fine-tuning, updating RAG data, or refining prompt strategies. By proactively monitoring drift, teams can maintain model accuracy, consistency, and alignment with business objectives, ensuring LLMs deliver reliable performance over time.

Monitoring Data Drift in LLM Responses

We just reviewed how drift can help identify change in prompts over time. However, as we saw earlier, LLM responses can change with prompts that mean the same thing but are presented in different linguistic variations. Monitoring for data drift alone is insufficient to assess operational quality; we also need to track performance drift to ensure accuracy and consistency.

If there is no drift in prompts but a noticeable performance drift in responses, this suggests the underlying invoked model is returning a different response than expected. This requires improving the solution for this LLM with new engineered prompt variations that give the desired response for RAG and potentially fine-tuning the LLM with them.

When prompt and performance drift occur simultaneously, AI practitioners can calculate drift for the combined prompt-response tuple. This helps determine whether responses vary for stable prompts or if changes in both prompts and responses are driving the shift.

‍

Track LLM drift in prompts and responses to fine-tune models or update RAG, ensuring optimal performance and accuracy. — The performance of LLMs can be monitored by tracking the drift of prompts and responses, allowing AI practitioners to troubleshoot and improve the model by taking corrective actions like fine-tuning or adding new documents into RAG

As enterprises bring more LLM solutions to production, it will be increasingly important to ensure high performance in those deployments in order to achieve their business objectives. Monitoring for drift allows teams deploying LLMs to stay ahead of any impact to their use case performance.

How to Monitor LLMOps Performance with Drift Monitoring