Tracking Drift to Monitor LLM Performance
In this episode, we discuss how to monitor the performance of Large Language Models (LLMs) in production environments. We explore common enterprise approaches to LLM deployment and evaluate the importance of monitoring for LLM quality or the quality of LLM responses over time. We discuss strategies for "drift monitoring" — tracking changes in both input prompts and output responses — allowing for proactive troubleshooting and improvement via techniques like fine-tuning or augmenting data sources.
Read the article by Fiddler AI and explore additional resources on how AI observability can help developers build trust into AI services.
[00:00:00] Welcome back everyone to Safe and Sound AI. Today, we're going to take a deep dive into something that's, uh, pretty crucial when it comes to large language models, keeping those LLMs running smoothly, we like to say safe and sound around here, specifically, we're talking performance monitoring. Think about it like this.
[00:00:19] You know how your car needs regular checkups to stay in tip top shape? Well, LLMs actually need the same kind of attention, you know, to make sure they're performing how we expect them to. And a key idea to keep in mind is something called drift. Drift is kind of like a canary in the coal mine for LLM performance.
[00:00:36] But we'll get to that a bit later. First, let's set the stage a little. I mean, we've all seen just the meteoric rise of generative AI and large language models. It feels like every company out there is rushing to deploy these LLMs, trying to get a leg up on the competition.
[00:00:51] Yeah, the excitement is definitely there, but it's really important to remember LLMs are just like any other machine learning model, meaning they can actually degrade over time.
[00:00:59] And that can have some pretty serious implications for reaching business goals.
[00:01:02] Yeah, exactly . Imagine you put all these resources into building a chatbot to provide great customer support and then all of a sudden it starts giving out wrong information because of drift. That's a problem.
[00:01:14] Absolutely. So let's take a look at the four main approaches that companies are using to deploy LLMs.
[00:01:19] And I think each approach kind of presents its own monitoring challenges. So the first one is prompt engineering with context. And this one involves like really carefully crafting prompts, trying to get the right responses from third party AI providers. Then there's retrieval augmented generation. And in this case, we're actually adding external data to the prompts.
[00:01:39] This gives the LLMs more context so they can answer these really complicated questions. Third, we have a fine-tuned model. This means taking an existing LLM and training it on a much larger dataset, a domain specific dataset. So it gets specialized expertise for a specific industry or a specific task.
[00:01:55] And then finally, we have the most resource intensive approach, the trained model. This one's actually building a brand new LLM from scratch, trained on a massive dataset. And a good example of this is Bloomberg GPT which is tailored just for finance.
[00:02:08] Wow, okay, so we have four different approaches. I'm already kind of seeing how monitoring each of these could get really complex really fast.
[00:02:14] Yeah, and I think the key takeaway here, is that monitoring is super important, no matter how you're deploying. Because LLMs aren't just static things, they change. Their performance can fluctuate over time. And so good LLM Ops requires a robust process to find and address these performance issues before they become big problems.
[00:02:35] Right, so they're not a set it and forget it solution. So then what exactly makes their performance, you know, start to deteriorate?
[00:02:41] Well, there are two main things, I think. First, you have all these new kinds of prompts that can emerge. You know, customer behavior is dynamic. It's constantly changing. And this means an LLM might encounter prompts that it wasn't trained for.
[00:02:53] And so the responses might not be great. Imagine a chatbot that was trained on smartphones. And then suddenly everyone starts asking about self driving cars.
[00:03:01] It seems like the unpredictability of users can really mess things up.
[00:03:05] Yeah, for sure. And then the second thing is what we call different responses to the same or similar prompts.
[00:03:11] This basically boils down to model robustness. An LLM might give different answers to the same question just phrased differently. So, for example, with e commerce, it might handle a question like, How do I return a product? Perfectly. But then if someone asks, I'm confused about how to return my shoes, it fails, even though they mean the same thing.
[00:03:29] So even though the questions seem similar to us humans, the LLM gets tripped up on the little differences.
[00:03:35] It all comes down to how the model interprets those nuances. And then on top of that, you have to remember those third party APIs. that are providing the underlying LLMs. Well, they can change without wanting.
[00:03:45] It's just like software updates. You know, LLMs go through revisions and tuning. And even small changes can really impact your applications. There was actually a research paper that found some pretty big performance differences between versions of OpenAI's GPT 3.5 and GPT 4 over time, so that means even if you're not making any changes, the underlying model could be shifting and suddenly your prompts aren't as effective.
[00:04:08] Okay, so we've got these evolving prompts. Model robustness issues and even the AI models themselves are changing. It sounds like keeping track of all this is really important. But how do we actually catch these problems, you know, before they become huge headaches?
[00:04:24] Well, that's where drift monitoring comes in.
[00:04:26] It's a technique that we've taken from traditional machine learning. And it's proving super valuable for LLMs. Basically, it helps us see those little performance shifts before they snowball into big problems.
[00:04:37] I like that. So we're not just waiting for things to break. We're being proactive. But break down drift monitoring for me.
[00:04:43] What exactly are we monitoring and how does it work?
[00:04:45] So imagine you go back to a place that you haven't seen in years. You take a new picture and you compare it to an old photo of that place. You'll probably notice some differences. Maybe there's a new building or some trees have gotten taller. Those differences are kind of like drift.
[00:04:58] They're showing us how things have changed. And when it comes to LLMs, we can actually apply this idea of drift monitoring to both. The prompts that the model's receiving and the responses that it generates, it's all about finding these shifts and patterns. That could mean there's a problem.
[00:05:17] Okay, so like a before and after snapshot. To see what's different. Yeah. But can you give me a real world example?
[00:05:23] Yeah, sure. Let's say you have a chatbot on your website that's supposed to answer questions about your products and services. If you want to do drift monitoring, you'd start by making a baseline dataset, basically a snapshot of the usual prompts that the chatbot gets and the responses it gives under normal conditions.
[00:05:38] So we're capturing how the LLM should be acting and using that as our benchmark.
[00:05:43] Exactly. Then, as your chatbot's talking to customers, you keep monitoring the incoming prompts and the outgoing responses, and you compare this real time data to your baseline dataset, using some statistical analysis.
[00:05:57] Looking for anything that really stands out, anything different from that normal behavior.
[00:06:01] Exactly. And those things that stand out, those red flags? That's what we call drift. Maybe the chatbot's getting tons of questions about a brand new product, something it wasn't trained on. Or maybe it's giving different answers to the same question, which means there might be a problem with model robustness.
[00:06:17] So it's like an alarm system, letting us know about potential problems before they get out of control.
[00:06:23] That's a great way to put it. But it's not just about finding the problem. Drift monitoring actually gives you insights that can help you figure out the root cause so you can fix it. It's really a tool for continuous improvement.
[00:06:34] So we're not just sitting back and watching. We're using these insights to actually make the LLMs better. But how do we do that? How do we turn those insights into action?
[00:06:43] Well, that's where it gets really interesting because by looking at the type of drift, you can make smart decisions. Maybe you need to fine-tune your LLM or update your training data.
[00:06:53] Maybe you need to adjust your prompts. Each situation kind of needs its own solution.
[00:06:58] It sounds like it's all about being proactive. Like preventative care for our LLMs, we're constantly trying to optimize them, make sure they're working at their best.
[00:07:07] And that's becoming more and more important as LLMs are being used for these really critical things, from customer service and financial analysis to even medical diagnosis .
[00:07:17] The stakes are high. We need to make sure these systems are working reliably and responsibly.
[00:07:21] For sure. So we talked about why drift monitoring matters. But I'm curious about the how. What are the actual techniques that we use to measure this drift?
[00:07:30] That's a great question. Let's get a little technical here, but I'll try to keep it high level.
[00:07:35] There are all sorts of statistical methods that can quantify drift, but they all basically come down to this. Comparing distributions of data, remember those before and after pictures we talked about, imagine you're analyzing the colors in those images to see how much they've changed. That's kind of what we're doing with drift monitoring.
[00:07:54] We're looking at these statistical properties of the prompts and responses to see how much they've deviated from that baseline.
[00:08:01] Okay, so we're looking for those shifts in the statistical patterns. Those tell us that something might be off.
[00:08:06] Yeah, and there are different ways to actually measure those shifts.
[00:08:09] One common approach is something called the Kullback–Leibler divergence. It's a way to measure the difference between two probability distributions.
[00:08:17] Okay, I'm going to need you to break that down a little. What does it mean to measure the difference between probability distributions?
[00:08:23] So think of it like this.
[00:08:25] We're comparing how likely it is for certain prompts or responses to show up in our baseline dataset, compared to how likely they are to show up in our real time data. If there's a big difference between those probabilities, that's a sign of drift.
[00:08:38] Got it. So, if the probability of seeing certain prompts or responses has changed a lot, that means the LLM might be dealing with something new.
[00:08:46] Exactly. And the Kullback–Leibler divergence gives us a number for that difference. So we can actually measure how much drift there is.
[00:08:53] So we're not just guessing, we have this mathematical tool to see these subtle changes.
[00:08:58] And that's just one method. The important thing is, we can apply these techniques to both the inputs, which are the prompts, and the outputs, which are the responses.
[00:09:07] So we get a complete picture of how the LLM is performing.
[00:09:11] Wow. So we're not just checking if the prompts have changed, we're also looking at whether the LLM is still giving us the right answers.
[00:09:16] Exactly . And that gives us a powerful way to troubleshoot problems. For example, let's say we see drift in the responses, but not in the prompts.
[00:09:26] That could mean that the LLM itself has changed. Maybe because of an update from the third party provider?
[00:09:32] It's like we have this detective on the case, helping us figure out not just what's wrong, but why it's wrong. This is super valuable. Especially as these LLMs get more complex.
[00:09:41] Absolutely. And the good news is, drift monitoring is getting easier to do.
[00:09:46] There are new tools and platforms that can automate a lot of the process, so LLM OPS teams can integrate it into their workflow more easily.
[00:09:53] This has been so helpful. We went from the basics of drift monitoring to the nitty gritty of how it all works. But before we wrap up, I'm curious about the bigger picture. Where does drift monitoring fit in with the future of LLMOps? It feels like we're just getting started. It really feels like drift monitoring is that connection between like the theory of LLMs and how they actually work in the real world.
[00:10:16] Yeah, it's like the tool that turns those complex statistical insights into actual things you can do to improve how LLMs work. It helps us make sure AI is being deployed responsibly.
[00:10:25] And as LLMs become, you know, more and more part of our lives, we want to make sure they're not just working efficiently, but ethically too.
[00:10:34] That's a really important point. I mean, we've seen how AI systems can kind of inherit and even amplify. Biases that already exist in society and monitoring for drift can be huge in finding and fixing these problems before they cause any harm.
[00:10:47] So drift monitoring isn't just about making things run better.
[00:10:50] It's also about building AI systems that are fair.
[00:10:53] Exactly. It helps make sure that this powerful technology is used responsibly and benefits everyone.
[00:11:00] This has been an amazing conversation. We've really gone deep into drift monitoring, explored why it matters, how it works, and what it means for the future.
[00:11:09] It's clear that this is an area that's constantly changing and we need to stay on top of things.
[00:11:14] I completely agree. As LLMs get more advanced and more intertwined with our lives, drift monitoring is going to be crucial to making sure they're deployed safely and responsibly.
[00:11:24] Absolutely. Well, thank you so much for joining us on this deep dive into the world of drift monitoring.
[00:11:29] I hope you all learned a lot and remember, just like those regular checkups for your car, drift monitoring is key to keep those LLMs running smoothly and ethically. This podcast was brought to you by Fiddler AI. For more on monitoring LLMOps performance, see the article in the description.