Metrics to Detect Hallucinations with Pradeep Javangula
In this episode, we’re joined by Pradeep Javangula, Chief AI Officer at RagaAI
Deploying LLM applications for real-world use cases requires a comprehensive workflow to ensure LLM applications generate high-quality and accurate content. Testing, fixing issues, and measuring impact are critical steps of the workflow to help LLM applications deliver value.
Pradeep Javangula, Chief AI Officer at RagaAI will discuss strategies and practical approaches organizations can follow to maintain high performing, correct, and safe LLM applications.
[00:00:00] Joshua Rubin:
[00:00:06] Joshua Rubin: Welcome and thank you for joining us on today's AI Explained, uh, on Metrics to Detect Hallucinations. This is kind of a fireside. Um, I'm Josh Rubin, Principal AI Scientist at Fiddler AI. I'll be your host today.
[00:00:18] Joshua Rubin: We have a very special guest today on AI Explained and that's uh, Pradeep Javangula, Chief AI Officer at RagaAI. Welcome, Pradeep.
[00:00:28] Pradeep Javangula: Hi, Josh. How's it going? Hi, everyone. Thank you so much. I appreciate the invitation.
[00:00:32] Pradeep Javangula: So looking forward to this conversation, so.
[00:00:36] Joshua Rubin: So, uh, maybe before we jump to intros, uh, I'm just looking at the poll questions here, um, I guess as advertised by the, the name of the fireside, hallucinations are really top of mind for people. Um, but, uh, I don't know if you're looking at the responses also, but it looks like, I'm actually, it's actually interesting to me that we have some coverage over all of the different, uh, all of the different questions that people are worried about.
[00:01:00] Joshua Rubin: So, latency, cost, privacy. Um, but I think I would have probably predicted a distribution like this, uh, with some, some significant concern about safety and toxicity. It's certainly top of mind for us right now. Um, so probably, maybe, maybe you can do a quick introduction of yourself.
[00:01:18] Pradeep Javangula: Yeah.
[00:01:19] Joshua Rubin: Yeah.
[00:01:20] Pradeep Javangula: Well, first of all, thank you, Josh.
[00:01:21] Pradeep Javangula: Uh, appreciate the invitation and thank you everyone for joining. So, so my name is Pradeep Javangula. I'm the Chief AI Officer for RagaAI. Um, we are a comprehensive testing validation and verification sort of like product and service that, um, we're a sort of like a fledgling company that's been around for roughly about two, two and a half years.
[00:01:44] Pradeep Javangula: And, uh, consisting of a variety of data scientists, machine learning engineers. My background is I've been in the, in the Bay Area for lots of years. Since 1996, I often like to call it the dawn of the internet. I've been a founder of different companies and I've been fortunate to be involved with many aspects of data, data science, machine learning platforms, primarily through Search engines.
[00:02:12] Pradeep Javangula: So I used to be at a company called Ink2Me, which was a big web search engine before there was a Google. We used to power Yahoo, Microsoft, and AOL. And after that, started a computational advertising company and spent time at Adobe. As the head of their ML and AI efforts, and it worked in a variety of different domains.
[00:02:31] Pradeep Javangula: So, enough of that, but, but yeah, like I said, it's like, you know, so if you'd said 20 years ago that like, oh, this is the level of, um, sort of interest and progress and progress that we would have achieved with AI, um, In 2024, I'd have been a bit surprised, but here we are, so it's all good.
[00:02:55] Joshua Rubin: Yeah, it's an amazing, amazing sort of resume for the stuff that you're doing now.
[00:02:59] Joshua Rubin: I often point at Adobe as, uh, you know, a sort of, uh, a really impressive, uh, You know, organization for having mobilized AI, especially generative AI, you know, for a lot of really valuable use cases, you know, I think maybe more than most applications, um, Adobe's products now are benefiting from, you know, years invested in, um, you know, generating really powerful, useful generative AI tools.
[00:03:29] Joshua Rubin: Um, but I guess that's not, not so much the topic of today. Uh, right, right. So, um, I, you know, we're, I think we're going to talk, at least start with, um, hallucinations today. And, uh, you know, I think as everyone is racing to roll out one kind of, you know, LLM or generative application or another, Um, you know, it's sort of this new, new paradigm of programming where, uh, you know, the output is not particularly deterministic.
[00:03:57] Joshua Rubin: So, uh, maybe, maybe we start with the obvious question, uh, you know, what, what's a hallucination anyway? You know, and maybe why should people be worried about it?
[00:04:05] Pradeep Javangula: Right. So, um, I guess maybe it's a good idea to start at the beginning in terms of sort of like, uh, defining. First of all, I'm not a huge fan of the word hallucination as a way to describe the phenomenon that occurs when leveraging, large language models or foundational models in terms of generating, content, right?
[00:04:30] Pradeep Javangula: And so the generative AI terminology is probably a better sort of like in a use case for what this thing is, right? You know, it is, primarily given that a large language model has been trained on a very large corpus of text, right? Where there's Some discerning and determination of what the underlying patterns are.
[00:04:49] Pradeep Javangula: It is attempting to predict sort of the next token, right? Or the sequence of tokens, right. So depending on what the question is, what the prompt is, there is content that is being generated and, and in the content that is being generated, it is maintaining a certain level of fidelity to facts, or actual truth, is something that a large language model has no sense of.
[00:05:17] Pradeep Javangula: It has no idea that it is actually spewing out something that is, sort of like, appears statistically, stochastically, as the next logical token to show up. So. Because of that, you would end up with text or images or multimodal sort of like output that is generated that doesn't jive with the truth.
[00:05:37] Pradeep Javangula: Therefore, it appears to be making stuff up, right? And that's what's sort of like basically referable as hallucination. And as you rightly pointed out, so the risks associated with producing something that is untruthful or something that is not sort of like grounded in reality have adverse implications that really do need to be addressed.
[00:05:58] Pradeep Javangula: So, which is I guess the reason why 70 percent of our audience says hallucinations is our number one problem, right? Um, and, um, yeah, so, so I would define hallucination as, as, as, as basically the, the generated content Not being grounded in truth, not being grounded in anything that is actually contextually relevant, right?
[00:06:23] Pradeep Javangula: Um, and so it'll often appear to be that this thing that's on the other side, the model that's on the other side is, um, is producing arbitrary content, so. Um, so we can call it chat. In some circles, it is actually considered a feature, not a bug, right? You know, because it's actually speculating, and it is sort of like generating things that are, that a normal human would probably not do.
[00:06:51] Pradeep Javangula: So it's not a scientific or mathematical definition, but there it is.
[00:06:57] Joshua Rubin: Yeah, right, right. I think, I think the kind of fundamental misunderstanding here is sort of that, uh, You know, a large language model or a generative AI application and it is by definition supposed to be doing something sort of factual and well constrained like computer software usually does.
[00:07:16] Joshua Rubin: Um, you know, and it's sort of like, it's sort of like making stuff up in one flavor or another. In a sort of statistically, uh, justified way is sort of the nature of the beast, right? So, you know, instead of programming where it's sort of this additive process where you're building things that behave in a well specified way, it's a little more like sculpting, right?
[00:07:37] Joshua Rubin: Where you start with something that has this behavior of spewing plausible statistically grounded stuff and then sort of carving away the bad behaviors. Uh, to get at, uh, something that's a little bit more, um, appropriate for a specific business application or, uh, whatever application. Um, you know, another kind of related concept that, you know, that I think is interesting are, um, You know, sometimes it can be factually correct, but sort of contextually inappropriate, right?
[00:08:08] Joshua Rubin: Like, you don't, if you're a business, you don't so much want, uh, you know, if you're an airline, you don't want your, uh, customer service chatbot giving medical advice, right? Even if, even if that's kind of baked into the brain of the chatbot. There's also this very related, interesting sort of issue of, you know, constraining the model to be sort of context appropriate or policy following.
[00:08:29] Joshua Rubin: Um, you know, some of those things we do with kind of fine tuning, uh, but, uh, you know, a lot of LLM development happens through, uh, through prompting, um, anyway, uh, you know, and basically asking the model for what you want.
[00:08:42] Pradeep Javangula: That's well put.
[00:08:43] Joshua Rubin: So, um.
[00:08:44] Pradeep Javangula: Really well put. Yes.
[00:08:45] Joshua Rubin: Yeah, yeah. So, uh, I don't know, do you want to, maybe, maybe you can say something about how, um, how RagaAI thinks about, uh, sort of, uh, evaluating these things, like, you know, how, how, how do you measure this and, uh, you know, sort of business specific way.
[00:09:02] Pradeep Javangula: Maybe it's like a brief introduction to RagaAI would be appropriate, right? Meaning that, first of all, we serve sort of the enterprise community, meaning that enterprises that are building applications that are AI enabled or AI driven, either for consumers or for other businesses. So it's essentially B2B, sort of like, is our primary orientation.
[00:09:26] Pradeep Javangula: So in that context, as you rightly pointed out, unless you are an OpenAI or Anthropic or someone else that is building sort of general purpose things that are aimed at everybody, most enterprises are leveraging these large language models of different stripes for the purposes of solving specific problems, conversational agents, or some sort of reasoning agents, or workflow agents, or insight deliverers, um, based off of their enterprise corpus.
[00:10:01] Pradeep Javangula: Whatever that corpus may be. It could be large volumes of text, it could be some structured data, it could be image data, and so on. So RagaAI is primarily involved with Attempting to figure out what your problems are in a machine learning or an AI app development cycle. And our primary focus is pre deployment, although there's some overlap with Fiddler and other, and other companies in terms of like what we end up doing on the inference side as well.
[00:10:32] Pradeep Javangula: Primarily, we look at what we deliver as a, as a platform to deal with data issues. Model issues and the operational issues, meaning that as you are contemplating on building a RAG application or building a much more of a classical machine learning application, what is the level of rigor that you apply with respect to sampling, with respect to the segmentation between the two.
[00:11:00] Pradeep Javangula: test and train, or train and test, and figuring out how well your model is performing with a certain level of all the metrics that data scientists care about. Precision, recall, AUC ROC curves, and, and, and all sorts of things, and think of sort of like a suite of tests that we deliver to to the developers or to the machine learning engineers or to data scientists that they wouldn't have to hand code but given data, given model, and given some sort of like inference input, we would assess and will tell you sort of like where things are, right?
[00:11:40] Pradeep Javangula: So, so that's the, that's the thrust of like what we're doing and as a consequence of like what we do, we kind of have to plug ourselves into the stream of the application development life cycle. So the ML or the AI app development lifecycle. So, so we end up often being asked to sort of like deliver on compliance and governance oriented sort of like solutions in specific verticals.
[00:12:06] Pradeep Javangula: And I should also point out that this problem of being able to detect all problems and point to the root causes of like what's causing those problems is highly non trivial, number one. Number two, it is. very verticalized in nature right. And so we by no means are claiming that we've solved all problems for all verticals for all times right.
[00:12:31] Pradeep Javangula: So and on top of that, our ecosystem continues to evolve very rapidly and very quickly. Our focus has been on, on, on, on specific verticals. So we've started out with computer vision and then expanded to sort of like large language models and text. And we also have support for structured data sets. So that's, that's what we are.
[00:12:52] Pradeep Javangula: So maybe I actually missed out on, on, on the question you asked. So, so yes. So it's a pretty...
[00:12:57] Joshua Rubin: I lost track of it already.
[00:13:01] Pradeep Javangula: So pre deployment is, is, uh, is our primary focus, so to probe into the application development lifecycle with respect to data and model and identify issues.
[00:13:13] Joshua Rubin: Nice. Um, I can relate to your sentiment about the challenges associated with, uh, you know, drilling down and finding problems and offering remediations.
[00:13:25] Joshua Rubin: I think, you know, sort of the first step is And maybe you would, I'd be interested in, you know, your perspective on this, but I think we sort of see a first step as observability, right? Like, can you characterize how well the thing works? You know, for us, we pay a lot of attention to, you know, production time.
[00:13:42] Joshua Rubin: Has the world changed in some way? Has some new topic emerged that's a problem? Uh, you know, the first step is like, you know, having some sort of comprehensive, uh, observability into the performance of the model by some number of dimensions. Um, You know, but of course that's not enough, right? A customer, you know, what we hear all the time is, okay, you've told us there's a, an issue, uh, but that only gets us halfway there.
[00:14:07] Joshua Rubin: Like, you know, where, where do we take this? Um, so I don't know if you have any, any, any thoughts around that, or we can maybe move on a little bit.
[00:14:16] Pradeep Javangula: Like I said, this is a hard problem, but as I said, ultimately, sort of like an application is evaluated in terms of how well it is or how satisfactory its functionality is as discerned by the user, right?
[00:14:33] Pradeep Javangula: So, you know, but we can know something about how well. It's being received or how satisfactory the responses are from sort of like inference data, right? Which is, which is what you guys do an amazing job off and try to point to sort of like other interpretability or explainability aspects of sort of like, what, what, what's going on?
[00:14:55] Pradeep Javangula: But then you want to take it all the way back, right? Meaning that, how do I re. So, you know, in what way should I change my training data set so that it's much more uniformly representative of the kind of model that I ought to be building? Is there how much of a drift is there? How much of, how much of sort of like failure mode analysis is actually occurring?
[00:15:17] Pradeep Javangula: So, how do you deal with things like active learning? So this is what we do. This is basically the community refers to as active learning. It is taking what's happening out in production and identifying the things that you ought to do, um, in terms of improvement or a continuous sort of like improvement in terms of trying to make that happen.
[00:15:40] Pradeep Javangula: And often it is gated by How much sort of like volume do you have right? It's like, it would be great if you, we could have sufficient traffic to be able to sort of distribute it across a swath of models. It could be like just an A/B test or it could be like an A/B/C/D test or even multivariate testings on for different sort of like champion challenger models out in production so that would be probably the best way to deal with it but often you are dealing with sort of like one at a time and then you are trying to come back to some corrective or remediation measures so uh so we given that we were involved in the core of their Sort of like development pipeline.
[00:16:22] Pradeep Javangula: So the development life cycle, uh, we have observed what they did, how they've labeled. For instance, if we do supervised learning use case of how well they have labeled things and what the labeling inaccuracies may be, and using very. A very methodical statistical measures about in and out of distribution of the training data set, right?
[00:16:46] Pradeep Javangula: And of looking at a time series evolution of the metrics, um, of, of the different types of model metrics, and even of multiple models to be trained. to be, to be tested on, on, on, on the specific data set and so on. So that's the kind of thing that we do. And given that we also follow sort of the lineage of the data as it gets processed through, uh, right.
[00:17:12] Pradeep Javangula: And, uh, so, and, you know, like I said, that's, that's why it's like, you know, uh, there's a, there's an overlap between what Fiddler does and what, what Raga does from the perspective of sort of like observability in general i would even argue that what we do should be called observability as well but then you know that it is right so
[00:17:33] Joshua Rubin: To my taste i'm happy to I'm happy to call you observability also Okay, my questions are online, I'm keeping notes, and they're branching in a very, many different branches in a non linear way, so I'm trying to, let me see how I want to slice this.
[00:17:51] Joshua Rubin: So, you know, one question that's sort of interesting to me is, you know, when we think about things like basic, you know, uh, classifier metrics, precision recall, very under the curve, that kind of stuff. Um, I'm curious how the large language model is different for you. So when, so, so, you know, often when we're thinking about things like active learning, uh, gosh, yesterday we were, uh, discussing a problem with a hard classification use case, and there's this old paper from, uh, Facebook Research, uh, this Focal Loss paper, where they figured out a bunch of stuff for computer vision models.
[00:18:29] Joshua Rubin: Uh, but basically by asking the model, during its training, to take the cases where it was unsure more seriously than the easy cases. So there's an adjustment to the lock function that causes it to sort of, uh, get really good at things it finds challenging or really focus on things it gets challenged. Um, so I guess maybe one, one question for you is how, how is the, um, generative AI or large language model world, how is that different from a sort of, you know, you may not be training these models, right?
[00:18:59] Joshua Rubin: You might not be fine tuning, um, you know, and, and, and what knobs do you have for, um, do you think about in terms of. You know, ways to improve an underperforming model.
[00:19:12] Pradeep Javangula: Right. So firstly, you know, in the, in the enterprise scenario, and I'll limit my comments to sort of like the enterprise, the non general, general types of use cases, right?
[00:19:22] Pradeep Javangula: So the most predominant of LLM adoptions are really happening In terms of retrieval, augmented generation use cases, right? And, and the LLM is largely being called upon to generate, summarize, to sort of like, uh, uh, to, to, to, to provide an answer that isn't a, a blue link, like in a web search scenario with some sort of like a snippet that surrounds it, but much more about a summarization of a co, of a, of a set of results that are appropriate to the prompt that has been given.
[00:19:56] Pradeep Javangula: So given that, that is the, that's the primary use case that we see, right? Um, in terms of sort of like these foundational models being deployed, you basically start the world with your context, right? Meaning your context DB or the corpus that is relevant to the kind of rag application that you want to develop, right?
[00:20:17] Pradeep Javangula: And you go through a variety of sort of like embedding generations of those documents and put them into a into a vector db and then when and and you know you go through yet another sequence of steps that allow for a user to be somewhat guided through a set of prompts that are relevant to the domain or the context in question right and hopefully you can apply a variety of some personalization techniques as well knowing who the user is or what they're attempting to do, and so on.
[00:20:47] Pradeep Javangula: And then, uh, uh, from, from a response perspective, you want to leverage the results of such retrieval mechanism to be summarized in a cohesive manner, depending on where your, what your style of choice is. Now, so what are the potential problems that could occur? So first is, I would call coherence, right? So, you know, we were just jumping into sort of like the metrics associated with that.
[00:21:12] Pradeep Javangula: What would cause hallucination, right? And so the things that could cause hallucination first This looks like, you know, hallucination needs to be expected, right? And so what you want to do is to try and figure out how much hallucination is likely to have happen, and that requires a reasonable amount of rigor as far as your quality assessment process is concerned.
[00:21:35] Pradeep Javangula: And, you know, to the extent possible, like you pointed out, have kind of hard guardrails about sort of like, I don't want to answer a question about politics or about healthcare if this is a customer service bot or, something to that effect. so, coherence is really sort of the measure of complexity of the generated text, right?
[00:21:57] Pradeep Javangula: Meaning that the hallucinated text may actually exhibit a high level of perplexity due to sort of like inconsistencies or kind of nonsensical sequences. Now, you know, The quote unquote notion of nonsense is really a human judgment thing, right, which is the reason why what you want to do is to surface something like a perplexity metric, and allow for a human to be able to figure out what's going on, right? So now, if in response to a given prompt, the probability of such perplexity arising is really high, then really sort of, you know, given that we don't often have sort of like the innards of, of a model, of an LLM that we are leveraging.
[00:22:44] Pradeep Javangula: We might just want to back off, right, saying it's like, look, you know, I am not trained to answer this question or that, you know, so you do some sort of like a boilerplate response in order to make that happen. Yes, it's not a particularly satisfactory answer, but like, you know, so there are potentially other ways of dealing with it in terms of maybe you meant this, right, in terms of, and since you, since we are dealing with these sort of like large dimensional vectors, we can always think about things that are within its neighborhood according to a, uh, and a metric.
[00:23:13] Pradeep Javangula: So one is sort of like perplexity. That's, you know, among the coherent sub bucket, I would put perplexity. Second is word overlap, right? Meaning comparing the generated text with reference corpus. I mean, I'll often go back to, so the reference corpus being sort of like, quote, unquote, the ground truth or us being able, being able to evaluate up front how closely it matches up with the corpus that is being used is one of the most sort of like foolproof ways of, of, of dealing with it.
[00:23:46] Pradeep Javangula: You know, I've seen less instances of this sort of the, uh, sort of third element inside of, uh, inside of coherence is basically grammar and syntax. So I think all of these large language models seem to be just almost impeccable. In terms of the grammar with the degenerate, it seems like, you know, really hard to detect cases where it is the case.
[00:24:08] Pradeep Javangula: But I did see some cases where it's like, no, I wouldn't write that. That's not sort of like English, you know, and not in that model. So that's sort of like one big bucket of coherence. The second bucket I would say is basically fact checking, right? Which is assuming that everything that you fed into your context DB is the truth.
[00:24:31] Pradeep Javangula: This is, this is, this is, this is, this is your facts and this is the enterprise's fact or the data scientist's facts. Knowing, uh, knowing that and, and, and, and understanding or analyzing that corpus sufficiently or prepping that corpus sufficiently with All sorts of techniques and I've seen people do named entity recognitions or knowledge graph building or other forms of sort of like embedding generations that are true to that specific domain.
[00:24:59] Pradeep Javangula: These are all things that are important and which will guide in terms of saying it's like of the response that was emitted. How do I actually effectively compare it against the context that was provided and, you know, in a pure mathematical geeky sense, you can think of it purely in terms of distance or in terms of, uh, of nearness or, uh, or, uh, or other ways of thinking of it from a classical sort of like NLP perspective, right?
[00:25:29] Pradeep Javangula: So that's sort of like fact checking. And so, you know, semantic coherence, semantic consistency is also another thing that I would point to as an important part, which is that there's, you know, uh, like, like both of us were pointing out, this is basically a stochastic pattern, right? It is attempting to generate larger and larger sequences of words that, that cohere with some level of Not to be confused with semantic discipline or linguistic discipline, but the semantic nature of the content that is being generated can often be weird.
[00:26:04] Pradeep Javangula: It's like, you know, I've seen responses where the first paragraph and second paragraph are completely unrelated to each other. It's like What are you doing? And, uh, and, you know, we're put in this sort of like this awkward place, Josh, right, where there's this thing called the foundational model. We don't know what all went into training and specifically not even mostly familiar with the types of things that go into the weights or the corpus that is associated with it.
[00:26:34] Pradeep Javangula: So we can control what we can control, and therefore the things that we can control are the corpus, And like you said, so like guardrails and other sorts of things that, uh, that make sense. So those are sort of like the, and, and there are a few others like contextual understanding and having some form of human evaluation, which is some level of human evaluation upfront in the development process is the thing that is, that's kind of clearly important, right?
[00:27:01] Pradeep Javangula: So it's like, you know, if you don't spend sufficient amount of time on that front, then. One can end up in a place where it's like, well, we don't know how it's going to behave in the wild, right? And that's when Fiddler is brought in saying, it's like, explain to me what's going on here.
[00:27:17] Joshua Rubin: Yeah, well, this is that, that question gets hard, uh, with the large, large models, of course.
[00:27:21] Joshua Rubin: Uh, and I think all the, uh, Frontier Labs would agree. Um, so I guess I, maybe you've already sort of touched on this, but, you know, uh, you know, one question I would ask about these kind of evaluations are, you know, human feedback, simple statistical models, uh, you know, uh, simple ish. You know, classifier models, do you evaluate with a large language model?
[00:27:49] Joshua Rubin: Is it models all the way down? Uh, you know, how, uh, I'd just be curious to hear about how you think about this, um, you know. Who's responsible for the eval? I know that, you know, we're always encouraging our customers to build human feedback into the loop as much as possible because it's sort of the gold standard, right?
[00:28:09] Joshua Rubin: If you can find out, uh, that human beings were actually disappointed by what happened, um, you know, that's treasure. Um, but not always so easy.
[00:28:20] Pradeep Javangula: So, so, so the question was, how would you
[00:28:22] Joshua Rubin: Oh, the question, the question was, where do you, so, you know, what, what kinds of tools do you use, you know, for, uh, is it really all, the whole spectrum?
[00:28:31] Joshua Rubin: Uh, maybe the answer is just yes. Um, from simple statistics. Some of those things like, uh, that you mentioned, like, um, you know, word overlap, right? Those are, those are nice metrics because you can compute them very quickly, right? I mean, a lot of other things. First, you know, it follows the keep it simple, uh, rule, uh, it's effective, but it's also fairly computationally, uh, um, inexpensive, right? But some of these problems we can address with other big models. Um, you know.
[00:29:04] Pradeep Javangula: Right yeah. So I, you know, uh, I think it's sort of like, you know, tread carefully would be my high level. And, and, and, and the thing is, it is, uh, um, the, the, um, evaluating, Your application, in concert with an LLM, um, is first, size and data sufficiency problem.
[00:29:32] Pradeep Javangula: Just in terms of like, what kind of corpus do you have? Does it have the types of things that can effectively answer the questions that you want, right? And, you know, back in the day, I used to run sort of like enterprise search engines, and this is very similar to that. Part of the use case, except that sort of like, you know, um, in that case, just like web search, we would just return documents with sort of like snippets.
[00:29:55] Pradeep Javangula: And it's a bit of an unsatisfactory thing where there is no state that's maintained, where there is no dialogue involved. There is no aspects of, there are no aspects of personalization that can be employed and so on. So I think we're in a different era of, uh, of, of, of enterprise search and enterprise retrieval.
[00:30:12] Pradeep Javangula: So, uh, I would, I would argue that. You know, that knowing what your data looks like and how well you prep the data for the purposes of the retrieval application that you're providing is, is going to go a long way in terms of eliminating sort of like this strange hallucinatory effects. That's number one.
[00:30:35] Pradeep Javangula: And I think, you know, and you guys do this too, sort of like, or other observability platforms do this just in terms of sort of like bias detection and mitigations on their classical approaches of trying to figure out sort of like, Whether there exists bias, sensitive variables, and PHI and PII data and analysis of that nature.
[00:30:54] Pradeep Javangula: And, you know, but when I first sort of like met the Fiddler founding team back in like 2019, I was thoroughly impressed with the sort of like tremendous focus that you guys had on explainability. I was at Workday at the time and uh, We've done multiple sort of like POCs with your team in terms of trying to arrive at explainability as it relates to compliance with employment law.
[00:31:22] Pradeep Javangula: Many of our lawyers and the security officers were sort of like mystified by like, you know, this, your application, your machine learning application appears to be doing reasonably well. However, So, what did you use, right? You know, what kind of data did you leverage in order to make that happen, and sort of the what if scenarios of if I took this stuff out, or if I suppressed this variable, or this feature, and so on, what would it do to the overall model?
[00:31:50] Pradeep Javangula: So that I think is actually important. Diversity of model evaluation is something that I would, uh, I would definitely take, and to your point, right, it is, Sometimes it's, it's helpful to sort of like simply say, look, if I was a, a human attempting to sort of like answer this question, is there a methodical process that I can actually write down?
[00:32:14] Pradeep Javangula: It might involve some mathematics. It might involve some computations and some statistical evaluations in terms of trying to make that happen. And attempting to. See if your overall AI application, whether it's reasoning or whether it is generating and so on, adheres to some such sort of like simpler, uh, simpler world would make sense.
[00:32:36] Pradeep Javangula: And what you pointed out is actually quite, uh, quite appropriate as well, meaning that just start with sort of like a linear regression. It's okay. You know, so perturb one of these features, right, sufficiently, and try to fit a linear model against it, and then figure out what your overall response, um, responses look like in relation to that sort of like smaller space, right?
[00:33:01] Pradeep Javangula: Um, and, um, yeah, so there's a, there's a bunch of other stuff, I mean, it's like, and, and the beauty of this stuff is like the, the, the level of research that's going on in this domain is. is, is, uh, is really strong and actually trying to keep up with what's going on is itself a hard problem, right? You know, like you pointed out, this, the, the meta paper, the Facebook paper has been around for a while, but there's a whole bunch of other stuff and, uh,
[00:33:25] Joshua Rubin: Oh, absolutely. It's, it's a firehose. I think a firehose is an understatement for the rate at which new stuff is happening.
[00:33:33] Joshua Rubin: Very challenging. Um, you know, just a, just a sort of a side comment that I think is related to, you know, I'm, I'm a bit of a measurement fanatic and we've talked about sort of rigor in, uh, sort of measurement and characterization.
[00:33:49] Joshua Rubin: Um, you know, the other end of the spectrum from that keeping it simple solutions to evaluation is depending on. Models of one kind or another, you know, maybe even having, in some cases, large language models, do some of that hallucination eval, ask the question, is this model response faithful to the context that was provided to it, right?
[00:34:11] Joshua Rubin: Um, these models can reason in very sophisticated ways. Um, and, you know, uh, to, to my taste, Um, if you're going to go to that length to evaluate in those kinds of complex, sophisticated ways, um, you know, then there's also a kind of a question of characterizing how well those tools work at evaluation. Like, there's this sort of meta evaluation level is, you know, if I'm going to bring in some large language model based system for, you know, So, this is a great example of a model that we're working on.
[00:34:39] Joshua Rubin: And, you know, if you're just evaluating my application, then, you know, whoever the provider of that is, whether it's developed in house or some third party application, you know, you might ask to, uh, see how well that model is characterized on some public data, where you can verify, you know, you have, you know, it might not be sufficient for the lawyers, uh, who are worried about, you know, the corner case, where it's going to do something bad, but if you can say something I think that goes a long way to building trust and confidence in a sophisticated set of tools.
[00:35:16] Pradeep Javangula: That's actually what we do. Yes. Uh huh. Our platform is really sort of like a, on the, on the modeling side of it is actually a model evaluator. And you can use sort of like an adversarial approach, right? So you have your model, which is, which can be a black box to, to our sort of like testing platform. And then we can challenge it with another model, To try and determine what the fidelity of the responses look like.
[00:35:42] Pradeep Javangula: And it is, it's a legitimate technique, right? It's like, you know, the, the, in terms of being able to champion challenge this or, or, and the other approach that you pointed out is also something that we use, which is that, you know, because of our testing platform and, and, and our goal is to be embedded in the development cycle, so we really don't follow the classical, sort of like SaaS model, right?
[00:36:06] Pradeep Javangula: In the sense that we. install or allow our applications to be deployed in a native environment for a specific client. Right. So, and given that that's the case, since we have to show stuff about like what we do, we often leverage open source datasets, right. You know, from AR, VR, or from sort of like publicly available sort of like corpus, um, um, medical data or medical imaging data, those are actually harder to come by, but, but like, you know, news feeds or Reddit or other sorts of things in order to show off what our, uh, what our actual platform does.
[00:36:42] Pradeep Javangula: And that's a good, good enough, in my opinion, way to evaluate whether your model evaluator or this meta level thing that you're talking about is going to be useful, uh, uh, to your purpose or not.
[00:36:54] Joshua Rubin: Nice. Yeah, yeah, no, I think we're in a similar boat in how we train and evaluate the, you know, the tools that we use.
[00:37:00] Joshua Rubin: Um, so probably in the next few minutes, we should cut over to questions, but I had a couple more little things that were interesting to me. Um, and so maybe we, uh, we try to address those in the next five minutes, and then we give our audience a chance to type some questions in if they haven't already, and then we'll jump over to the questions.
[00:37:21] Joshua Rubin: Um, so, so one question is, uh, you know, do you think about this hallucination evaluation from the perspective of, like, is it, is it a model developer problem, or do you think about the, the business user in, in, like, the, um, you know, there's all these different stakeholders and organizations, and one question we get often is, like, Who's sort of your end user, right?
[00:37:43] Joshua Rubin: Like, and to what degree are they worried about the same things? Do you think about, um, you know, is that a distinction that, that you think about? Like, uh, certain, certain metrics being of more interest to business users, or like sort of KPI oriented metrics versus, you know, the sort of strict science y metrics that a model developer might be interested in.
[00:38:07] Joshua Rubin: Does that factor into your thinking about
[00:38:10] Pradeep Javangula: No, that makes sense. That actually makes sense. And I think, you know, I don't know if it's necessarily about which metrics are more relevant to business users and so on. Ultimately, everyone cares about the quality of the application and the satisfactory thing that it is delivering, right?
[00:38:24] Pradeep Javangula: So, um, I would approach the sort of product management or the business team in terms of saying, it's like, how do you characterize success? for this application. You know, start at that highest level. And of course, they have other metrics in terms of, well, potentially we're, uh, we're making our overall organization that much more productive, or that we're enabling a certain level of transparency, right?
[00:38:49] Pradeep Javangula: Uh, and so people measure, for instance, these RAG oriented applications in terms of the diversity of questions that are being answered effectively. And, you know, to, uh, to some extent, sort of the productivity that this thing is actually enabling in terms of emitting an answer, right? Um, uh, uh, you know, uh, uh, I'm, I'm speaking from experience in terms of sort of what we are encountering in many of these application scenarios.
[00:39:14] Pradeep Javangula: So the product managers or the business executives are largely interested in, uh, The, the, the, the fidelity of the application and the usage of the application and the kind of satisfaction that is being derived from it, right?
[00:39:33] Pradeep Javangula: We, which are one set of metrics, right? You know? Mm-Hmm. of like, you know, even if you do sort of an inline survey about did this really answer your question or then you can also have a slightly deeper measure in terms of, in order to answer this question.
[00:39:46] Pradeep Javangula: I needed to retrieve a response from like three distinct datasets that were not always interlinked to each other, but you somehow correlated all of those things and emitted a coherent response. And that's, those are the types of things that, uh, that people are using to evaluate. On the data scientist side, and given that I'm a math geek like you.
[00:40:07] Pradeep Javangula: So we probably end up being much, much more rigorous in terms of actual quantitative things that are that, that are involved. I mean, whether it be distance metrics, based distance, or sort of like, you know, the levels of clustering that's involved are the types of models and algorithms that are being leveraged.
[00:40:27] Pradeep Javangula: You know how. Comprehensive were the number of models that you used to evaluate and then back on to sort of the things that you guys do in terms of sort of, you know, shapley values or integrated gradients or other forms of ways of explainability. Those are the types of things that I have seen data scientists and machine learning engineers obsess over more, which is, which is fine. I think, I think that's what you should do.
[00:40:51] Joshua Rubin: I would, for me, for me, I would say that kind of most commonly used, I mean, it really, In the, in this LLM world, more than the explainability metrics, since that's such a, a, a challenging problem right now. Um, one of the most useful things is, you know, uh, semantic similarity.
[00:41:08] Joshua Rubin: It's, you know, this, it's the same, the same thing that sort of undergirds this, the vector search, right? It's converting prompts and responses into, into a mathematical representation that, where, where proximity represents semantic similarity. Um, because that ends up helping to localize, um, common failure modes.
[00:41:31] Joshua Rubin: If you can identify where a bunch of problems are happening and you realize there's some semantic commonality, like this is, uh, you know, questions that have to do with this particular topic. often have a hallucination problem, or often our users thumbs down the response, right? That's an incredibly powerful diagnostic tool.
[00:41:51] Joshua Rubin: Um, so, um, okay, so I'm going to maybe jump to a question here, and I think you covered maybe some of this, but the first question is, there are many types of hallucination scores for evaluation and monitoring. How should each be used?
[00:42:06] Pradeep Javangula: Um, let's see, um, did, uh, the valuation metrics ought to be used, um, first in terms of, like I said, detecting, um, detecting things such as perplexity, you know, uh, determining, uh, a sense of sort of, like, how contextually relevant they are, how semantically relevant they are, um, and what level of semantic consistency is being achieved, right?
[00:42:37] Pradeep Javangula: So, uh, you know, we can characterize each LLM application with those metrics, uh, in tow, then, um, you can be reasonably confident about how well your thing is going to perform out in production. So I know it's a very general purpose 30, 000 foot level answer, but that's, uh, uh, That's what we are.
[00:43:02] Pradeep Javangula: Yeah, that's interesting. Um, let's see, what do we have also? What have you seen, by the way? I'm curious to hear yours on that front. Just in terms of, like, how have you seen sort of, like, these metrics being leveraged or utilized?
[00:43:19] Joshua Rubin: Yeah, um, I mean, people want hallucination detection. We hear that a lot. This is, again, this is not going to be a surprise given the, um, the title of this and the, the, uh, what do you call it?
[00:43:33] Joshua Rubin: The survey from the beginning, right? Yeah. Um, you know, we talk to a lot of, a lot of potential customers and our existing customers who, you know, really want something sort of ironclad with high classification, uh, capability, um, you know, and oftentimes we turn to things like, you know, in the literature now, people talk about, you know, Um, uh, you know, response faithfulness, you know, it's the, does this answer faithfully reflect the information that was in the context provided to the model, right?
[00:44:06] Joshua Rubin: Um, and that breaks down even further into, into questions like, um, you know, is there a conflict with the information inside the, uh,
[00:44:18] Pradeep Javangula: That's right.
[00:44:19] Joshua Rubin: The context, or is it, or is it just baseless, baseless made up additional stuff, right? So baselessness versus conflict is a, is a sort of a nuanced metric. And then there's also relevance.
[00:44:29] Joshua Rubin: And I think this, this touches on a lot of the things that you were, you know, mentioning. And some of these, you know, this is, uh, you know, you can use things like, um, uh, you know, traditional statistical metrics. There are things like blue and rouge that were developed like, uh, 20 years ago that are, that are sort of these.
[00:44:46] Joshua Rubin: You know, and, and these, so, so the question of, like, what, this response may be factual, it may be grounded, but didn't answer the user question. Um, you know, the other one that I think is hot is, um, LLM aside, did my retrieval mechanism pull up the information that was relevant? So, context relevance, right? Um, so like, you know, sometimes, you know, this, I think in some ways this gets into sort of production monitoring, right?
[00:45:17] Joshua Rubin: If there's a topic shift in the world, if all of a sudden one of your customers starts asking, you know, uh, they see something in the news and they start asking your chatbot a question that is, not present in your vector database.
[00:45:30] Pradeep Javangula: Yeah, that's right. Yes.
[00:45:30] Joshua Rubin: Um, suddenly you might on one random day start missing a lot of, uh, you know, opportunities to give a good answer just because you didn't know that the users were, you know, you know, Asking about the Taylor Swift concert in your town, or, you know, whatever, make up your, uh, your story, right?
[00:45:48] Joshua Rubin: Uh, so, you know, being able to quickly identify whether there's a problem in your, uh, in your content, either in your database or in the way that your database is configured to retrieve documents, like the chunking or the embedding model. Um, so, I don't know, this gets a little bit, a little bit nuanced, but, but I think all of those things are important.
[00:46:09] Joshua Rubin: All, you know, instrumenting all of the different pieces of this. LLM application. It's not just about the generative model.
[00:46:18] Pradeep Javangula: That's a really good point. You know, we used to run into this problem all the time, even back in sort of like the early 2000s and so on, where, first of all, this notion of sort of like, how well, how contextually relevant are you?
[00:46:33] Pradeep Javangula: Was a problem that that existed even before, and evaluating the quality of your application in the context of a changing universe, the types of questions are being asked, or the corpus itself morphing itself pretty dramatically. I think Google's done a a phenomenal job of this stuff. over over the years and so like the the the suite of techniques that they leverage from everything from knowledge graphs to sort of like page rank to uh old suite of select personalization approaches are all quite relevant.
[00:47:05] Joshua Rubin: Um, you know, what about malicious, what about identifying malicious intent from users? Is that something that you guys think about at all?
[00:47:11] Pradeep Javangula: That is something that we think about. Um, and, um, yeah, so that's what we're doing. We, we are really using sort of like an adversarial approach in order to make that happen. Um, and, and, and trying to determine sort of the, it's almost like an intent classification in the prompt. Right. Um, and to the extent possible, we, we attempt to detect it and point it and, and, and, and actually take an aggressive approach to it, right?
[00:47:44] Pradeep Javangula: It is that, you know, just back off of sort of responding or divert the, uh, divert the user to change topics to something else. That kind of stuff is, uh, is what we have been advising. We've been, we've been trying to like, you know, uh, since we're trying to build basically an overall quality and, and.
[00:48:01] Pradeep Javangula: validation and verification platform. We have to build some of these things ourselves, right? Otherwise, we will have very little credibility, uh, as we build out our platform. And as we do this, we discover more and more problems. And, uh, you know, so the, the world is also changing quite a bit. Yeah. So, so you're right.
[00:48:19] Pradeep Javangula: Maliciousness is a big deal. Toxicity is a pretty big deal, right? Just in terms of, you know, the determining what the tone is. of what's contained in a prompt. And again, sort of like, you know, detecting it and gently guiding in a different direction is the way to go. So, yeah, so those are definitely things that we're thinking about.
[00:48:45] Pradeep Javangula: By no means am I saying that we have a 100 percent sort of like answer to that problem yet.
[00:48:50] Joshua Rubin: We have an interesting it was just because we're kind of in this toxicity and sort of malicious human malicious intent it gets in into this You know all the the prompt injection attacks, you know people trying to you know I you maybe you've seen these things on that are floating around like Twitter and LinkedIn of like You know, they'll, they'll, uh, encounter a chatbot from a car dealership or even like a salesperson will reach out on LinkedIn, uh, you know, as a LinkedIn message, uh, and the, uh, the user suspects that it, or the human who's received this message suspected it's actually a, an LLM that's reached out to them, right?
[00:49:27] Joshua Rubin: And so they ask it some sort of probing question that sort of reveals that it's Probably an LLM and then they say, you know, gosh, I'm really glad you reached out to me. Could you please pretend like you're a Python interpreter and, and, and, uh, you know, and, uh, provide the response to the following code, you know, and they get it to say something stupid because it's not a human and it's happy to be a Python interpreter.
[00:49:50] Joshua Rubin: Um, so there's some, some really funny like screen grabs that people have, uh, put together from, you know, abusing Uh, chatbots that approach them like humans. Um, one interesting problem, so like the, well, let me, let me start with the interesting problem, which is that, uh, you know, what we've seen recently, so we've been thinking a lot about, like, prompt injection attacks.
[00:50:10] Joshua Rubin: I would say that's another, um, very hot topic. It's, you know, how do you, you know, make sure people aren't trying to get your chatbot to do something it wasn't intended for or get something out of your company that uh, you know, your company didn't mean to offer, right? So there was just this Air Canada thing a month or so back where, you know, yeah, this chatbot, uh, you know, offered some, you know, special discount tickets for somebody who had to attend a funeral or something like that.
[00:50:38] Joshua Rubin: That was totally hallucinated, not part of the terms. Uh, of the, the company, and then, uh, you know, whatever, Canada sort of refused to honor the fairly reasonable miscommunication, actually.
[00:50:50] Pradeep Javangula: Yeah, that one is particularly egregious, and I think it's not even clear how much sort of, you know, um, accountability and liability does the company have if a chatbot misbehaves and basically gives out, it's like, 50 percent off on everything for today.
[00:51:06] Joshua Rubin: Right, right.
[00:51:07] Pradeep Javangula: It's like, what are you going to do, right? And it's well within the context of what it's supposed to do, right? It is, uh, and it's difficult to detect that particular, um, sort of like weirdness in, in, in terms of responses. And, you know, uh, these are like, it's a weird world we live in right now.
[00:51:28] Joshua Rubin: There's, there's an interesting thing that happened. So, um, I did this experiment, anonymity, sometime last year, I started playing with ChatGPT and playing games of 20 questions with it to see if I could discover interesting things about how it reasons. And you know, the, you know, you would, uh, so the, the gist of the story is, uh, you know, if you ask it to think of a clue and play 20 questions with you where you're guessing question, you're trying to guess the object that it's thinking of.
[00:51:56] Joshua Rubin: Um, it actually doesn't, you know, it doesn't know, it doesn't have anything in mind, right? It's just playing the game and trying to produce a satisfactory answer. If you, uh, you know, you can rerun the same session with the same, uh, random seed and you can interrupt at an earlier level and it'll give you a different, different clue, right?
[00:52:12] Pradeep Javangula: Completely different answer.
[00:52:14] Joshua Rubin: And it really gets to this. So, you know, one of the ways that these models are fine tuned, like, maybe fine tuning isn't right, but, you know, the, the, the human reinforcement learning through human feedback mechanism that they use in order to make these really great at answering questions in a way that humans like, there's this theory that it also, there's this, that the model is doing this thing they call reward hacking, Basically, it's finding ways to make humans happier with its answers.
[00:52:43] Joshua Rubin: And there's a benefit to that, which it gets to be good at answering human, you know, questions in satisfying ways. But a side effect, uh, people think is that, you know, this sort of confident but wrong answers, uh, get enhanced, right? Because humans are often compelled by Confidence over factuality. The human doesn't know the difference, right?
[00:53:07] Pradeep Javangula: Yep.
[00:53:08] Joshua Rubin: And, and so, you know, and it leads to these interesting things like, um, you know, uh, a sort of bias to say yes versus say no, right? Like, so, so it's, it actually turns out that you can at home do these experiments yourself. It's, it's, it's fairly easy to get, to basically lead, um, an LLM to a kind of answer that you want or a kind of behavior you want by, uh, you know, offering it.
[00:53:34] Joshua Rubin: Yes versus no options, assuming it'll mostly take the yes path. But you can actually guide them quite a bit, uh, with, you know, fairly simple tricks where you're sort of exploiting the fact that they're trying to make humans happy um.
[00:53:48] Pradeep Javangula: I have been thoroughly unimpressed with any sort of like, you know, plannings that these generative models do, right?
[00:53:54] Pradeep Javangula: You know, and I've been reading a bunch of papers. It's a huge area of interest for me, right? And not just for Raga, it's been so for, for a while. I've always thought that sort of these, uh, the neural nets and the deep learning approaches Starting with so like ImageNets have a certain sort of like upper bound in terms of like how far they can go beyond a certain level.
[00:54:17] Pradeep Javangula: So I'm more of a sort of like a bit of a healthy skeptic of the approach itself essentially yielding all sorts of machine reasoning in some form or shape. I'm probably more of a fan of Gary Marcus. Right, uh, who, uh, thinks of sort of like, you know, knowledge crafts and knowledge representation and grounded in fact being combined together with statistical approaches.
[00:54:42] Pradeep Javangula: I'm not suggesting that the, the, the, the, the large language models are not doing sort of like just amazing things, just blow you away sometimes, isn't it? So, yeah. What it seems to be doing from a planning perspective about sort of like planning a trip or sort of, you know, doing some of these agentic stuff that they're, that they're attempting to do, but, but, but still sort of like in, in specific domains, whether it be sort of like an industrial domain or in a healthcare domain or much more of a regulated domain, uh, things are much more haywire.
[00:55:13] Joshua Rubin: Yeah I think the thing that I, I feel a lot of those opinions myself. Um, you know, I think a thing to be, uh, mindful of is that, uh, things that, that behave like humans, humans are sort of hardwired to perceive as humans, right? Like, and so, um, you know, in a way we're sort of hacking ourselves, right? There's a vulnerability when we make human like things, um, especially when we know that They do have these fairly significant limitations, um, so, um, I think we're kind of rolling up on the hour.
[00:55:48] Joshua Rubin: I don't know if, uh, you had any more, sort of, parting thoughts about, sort of, um, you know.
[00:55:54] Pradeep Javangula: Yeah, so I think, you know, uh, yeah, so I mean, my parting thought, first of all, is like, you know, this is a very great conversation, pleasant conversation, and it's, it's been, it's been fun to hang out and sort of like pontificate for the most part about sort of like what's going on with this sort of like LLM world.
[00:56:10] Pradeep Javangula: I think the, the, the, this particular evolution is going really fast and it is, uh, it's the pressure that is coming on data scientists and machine learning engineers from all quarters. Thank you very much. To be able to deploy these things, just at least to show off that we are generating GenAI is something that I think one should all have a healthy level of skepticism for, just as with every sort of software application or tool and so on.
[00:56:44] Pradeep Javangula: So this is, be careful with respect to sort of like how you go about doing this, what the right infrastructure choices are that you ought to deploy, you know. Careful selection of the context corpus that is, that is going to be needed and determining what things are important. Like from some of the questions that I'm seeing, it's like, is toxicity important?
[00:57:06] Pradeep Javangula: Yeah, I think it's toxicity is super important, especially if it's consumer facing. It can have a very adverse impact on your brand. and you don't want that to happen, right, and similarly sort of like as much as possible pay attention to sort of attributability or basically citations, right, when it is generating responses, you know, expose as much of sort of like how you have built the application without Sort of like intimidating the user, which are like, here's how I generated my response.
[00:57:38] Pradeep Javangula: Here's why I think this is truthful and so on, will go a long way toward establishing credibilities So, um, you know, and, and, and be willing to say, it's like, look, this is a chatbot. It's, it's, it's basic core nature is to be this stochastic parrot that will generate stuff. Therefore, there will be things that are, that are, that are, uh, that are not truthful, or not factual, or not, sort of like, are incoherent, and in which case, it's like, take the, sort of, you know, the, the, the, um, the caveat should actually be that, hey, I can make errors.
[00:58:15] Pradeep Javangula: It's a perfectly fine thing to do and, uh, you know, fall back on a, on a human to be able to answer the question if it is sort of like truly mission critical. So, and so that's my general refrain.
[00:58:27] Joshua Rubin: That's wonderful so thanks, thanks a lot, Pradeep. This was a super easy hour. I really appreciate it. Bye.
[00:58:32] Joshua Rubin: Thank you for the opportunity. Bye everyone. See ya.
[00:58:35] Joshua Rubin: