Season 1 | Episode 11

Inference, Guardrails, and Observability for LLMs with Jonathan Cohen

‍

In this episode of AI Explained, we are joined by Jonathan Cohen, VP of Applied Research at NVIDIA.

We will explore the intricacies of NVIDIA's NeMo platform and its components like NeMo Guardrails and NIMS. Jonathan explains how these tools help in deploying and managing AI models with a focus on observability, security, and efficiency. They also explore topics such as the evolving role of AI agents, the importance of guardrails in maintaining responsible AI, and real-world examples of successful AI deployments in enterprises like Amdocs. Listeners will gain insights into NVIDIA's AI strategy and the practical aspects of deploying large language models in various industries.

About the guest

Jonathan Cohen is a VP of Applied Research at NVIDIA, where he is the engineering leader of the Nemo platform. He focuses on incubating new AI technology into products, including NIMs, BioNemo, LLM alignment, speech language models, Nemo guardrails, foundation models for human biology, and genomics analysis. Jonathan spent a total of 14 years at NVIDIA in a variety of engineering and research roles, with a three-year stint at Apple as Director of Engineering in the middle. Earlier in his career, he specialized in computer graphics in the visual effect industry, winning a Scientific and Technical Academy Award in 2007.

Transcript

[00:00:00]

[00:00:06] Krishna Gade: Welcome, and thank you, everybody, for joining us today on the AI Explained on Inference, Guardrails, and Observability of Generative AI. I am Krishna Gade, Founder and CEO of Fiddler AI, and I'll be your host today.

[00:00:21] Krishna Gade: Without further ado, we have a very special guest on today's AI Explained. That is Jonathan Cohen, VP of Applied Research at NVIDIA. Welcome, Jonathan.

[00:00:31] Jonathan Cohen: Hi, good morning.

[00:00:33] Krishna Gade: Good morning. Thank you so much for having, uh, being on this show, Jonathan.

[00:00:37] Jonathan Cohen: Yeah, my pleasure.

[00:00:39] Krishna Gade: Awesome. So Jonathan, uh, is an engineering leader of the NeMo platform. Uh, he has incubated AI technology into products, including NIMS, which is NVIDIA Inference Microservice, and he's worked on a lot of foundation models for human biology and all together spent about 14 years in NVIDIA and various different roles.

[00:01:00] Krishna Gade: And prior to that at Apple as well.

[00:01:03] Krishna Gade: So Jonathan, maybe for our viewers, Uh, NVIDIA, for many people, comes across as a GPU and hardware company. What is NIM and NeMo? These seem like software tools. Could you, could you share, uh, what these are and how they fit into the NVIDIA AI strategy?

[00:01:20] Jonathan Cohen: Yeah, thanks, uh, that's a great question.

[00:01:22] Jonathan Cohen: So, NVIDIA has always been a hardware and software company. You know, we've, we call ourselves an accelerated computing platform company, and computing platform means not just a chip, not just a computer, but a platform, everything you need to do your computing. Um, so we've always included APIs and tools and acceleration, and in fact, the reason why our platform is so successful, has been so successful, um, and delivers so much value is precisely because optimizations and improvements you make at any level of the stack, they all multiply together.

[00:01:58] Jonathan Cohen: You know, if the hardware is twice as efficient, and then the software algorithm I use is twice as efficient, Then, cumulatively, I had a 4x improvement, right? And that's how we always think about it, these full stack improvements. So NeMo is just, you know, the latest in a long line of NVIDIA software platforms, um, kind of following this strategy.

[00:02:16] Jonathan Cohen: Um, NeMo is our platform for accelerating the creation and operation of modern AI systems. You could call them, you know, AI agents, or generative AI, or Large language models, you know, they're not, these aren't all synonyms exactly, but obviously they're all kind of in the same bucket. And I think of them as modern AI systems.

[00:02:40] Jonathan Cohen: Modern AI systems, uh, you know, touches a lot of things. So there's training large language models or foundation models. There's Um, customizing existing models, so there's many great community models out there, for example, the Llama models from Meta. Um, you, you might want to start with one of those models and then somehow customize it and fine-tune it based on your own data.

[00:03:03] Jonathan Cohen: There's deploying these models, there's managing the lifecycle of the deployment. I, I have a model, I need inferencing, I need to observe it in action, I need to log what it's doing, I need to take those logs and somehow, you know, um, maybe evaluate good and bad examples, maybe have a human correct it when it made a mistake, turn this back into training data, retrain, redeploy, you know, there's a complete, what we call, flywheel, typically, around any deployed AI system.

[00:03:30] Jonathan Cohen: Um, and NeMo is the software that manages all of that. And so NeMo covers both some open source Python components, That are basically PyTorch based, uh, tools for training, um, fine-tuning evaluation, and then also a, a microservices platform, um, which is, uh, at, at this point the, the only component of the microservices platform that's in general availability is nims, which, which stands for NVIDIA Inference Microservice.

[00:03:59] Jonathan Cohen: Um, and NIM is a very simple idea. The, the idea of NIM is just take a model, whatever model, um, package it up behind an uh, an API server. So, in the case of, uh, large language models, something that's, let's say, OpenAI, um, completion, endpoint, you know, compatible API, or Llama stack inference compatible, or whatever it is.

[00:04:23] Jonathan Cohen: So there's a little, there's a little API service that makes it easy to talk to this model. Um, behind that service, you have highly optimized, uh, inference. So, because it's NVIDIA, you know, we're going to always give you the best performance, best throughput, best latency, um, And so we'll use technology like TensorRT or whatever the best technology is, um, add all the uh, enterprise grade features like security, you know, up to date with the latest CVEs and security patches, um, connections into logging and observability platforms.

[00:04:58] Jonathan Cohen: And that whole thing then is a container. So we don't, this is not a managed service. NVIDIA is not operating this as a service. We take all this code, we package it up in a container, and you can get that container and run it yourself. And so if you want to have your own, you want to manage your own LLM inferencing endpoint, it's as simple as grab the container, docker pull, docker run.

[00:05:20] Jonathan Cohen: That's it, and you're up and running in, you know, a few minutes. Um, and because it's portable, you can host it in any cloud, you can host it on prem, uh, you have control, uh, you can, you can host it, you know, geographically, or in terms of network topology, you know, as close to whatever you need to host it as, as you want.

[00:05:40] Jonathan Cohen: You can host it in a VPC, you can host it in your own private infrastructure that's air gapped, whatever you want, it's just a container. Um, and that's, I think that's a very powerful idea, and that's something that our customers really appreciate, because if you think about, you know, what are we, what are these models actually being used for?

[00:05:58] Jonathan Cohen: Increasingly, they're, you know, to quote Jensen, our CEO, these are digital workers. They're, they're these agents that are, that are operating alongside humans at companies, um, doing things that humans do. You know, would be doing or could be doing or, or helping humans do things, which means they have access to all the very sensitive data that our employees have access to.

[00:06:19] Jonathan Cohen: Um, you know, if, if it's in a healthcare setting, it might have access to, or you'd want it to have access to medical data, or you'd want it to have access to proprietary company data, whatever it is. Um, and oftentimes for this most sensitive data, controlling where that data goes and where it gets sent and, and who might be looking at it and who's storing it is, is really important to customers.

[00:06:40] Jonathan Cohen: And so being able to have total control over how you deploy these things, where you deploy them, what you do with the data that gets sent, what you do with the logs, or maybe, maybe you don't log, or whatever it is, or you know, what, where you run your observability, all these things is, is I think very compelling to our customers.

[00:06:55] Jonathan Cohen: And so that's the idea behind NIM. Um.

[00:06:59] Krishna Gade: Absolutely. This is great. Uh, so you touched upon a bunch of things, right? One of the things that I heard, uh, Jensen and some of, and even you talk about is this NIM as model in a briefcase. Essentially, it's like a large language model packaged into this container, uh, and where you can go and run and deploy in your favorite cloud environment.

[00:07:19] Krishna Gade: So, how does it work? For example, You know, let's say, you know, if I'm a customer running, you know, my workloads are in Amazon or Google Cloud, you know, how, how do I, you know, what are, what are the advantages of using NIM, you know, how do I use NIMs for, for my, you know, large language model inferencing?

[00:07:38] Jonathan Cohen: Yeah, so it's all NIM needs.

[00:07:41] Jonathan Cohen: This is, uh, Kubernetes environment. So anywhere where you can, or, I mean, and you can probably make it work even in a non Kubernetes environment as long as it's some kind of container orchestration, but let's just say for simplicity Kubernetes. Um, so anywhere where you can orchestrate and launch containers and, and a NIM container.

[00:08:00] Jonathan Cohen: That's the, that's what makes it so simple. When people say, Oh, it's a model in a briefcase. I mean, what they mean is it's not our model that we're holding for you. Okay. We give it to you, you know, digitally, and you're free to move it to wherever you like. So, any cloud. Now, you know, you need a GPU, you need some accelerated computing hardware to run a NIM.

[00:08:22] Jonathan Cohen: Um, the software stack that's in a NIM today is GPU accelerated. Um, but as long as you have a Kubernetes, uh, environment that can orchestrate containers that has access to GPUs, You can run NIM. Not all NIMs will fit on all GPUs. They have memory requirements, you know, if you want to run, uh, Llama 405B, that's a very large model, uh, it's not going to fit on, you know, a small GPU in, in some cloud somewhere.

[00:08:51] Jonathan Cohen: So, so there's requirements like that, but other than the physical requirements around memory, You can do whatever you'd like with a NIM.

[00:08:59] Krishna Gade: And so does, do you offer like pre packaged NIMs with like, say, Llama or Mistral? Exactly. Where you can download the containers and, you know, run it on your, uh, Kubernetes?

[00:09:08] Jonathan Cohen: That's exactly right. So we have a catalog if you go to build.nvidia.com, you can see a complete catalog of all of the models that we have NIMified. Um, and that process, uh, we have something internally we call the NIMFactory. Um, and what NIMFactory does is we take these models when they, when they get launched or, you know, in some cases we know about them shortly before they're launched, and we put them through this process where we, um, we pre build, uh, optimized TensorRT engines.

[00:09:37] Jonathan Cohen: Uh, we measure them, we, we, so we do all the hardware level optimization, um, we make sure they work, we certify them, uh, across different hardware, uh, SKUs that, that we have. We package it and then we put it on this build. nvidia. com and so you can try it and you can interact with NIMS. NIMS are not just large language models, we have NIMS for, you know, the concept applies to any model, um, that you may want to inference.

[00:10:02] Krishna Gade: Even computer vision models and things like that.

[00:10:04] Jonathan Cohen: Absolutely. Computer vision, um, image generation models. We have speech recognition models as nims. We have biology models as nims, protein language models, molecular docking models. All kinds of things are, are as nims and, and for many of those nims you can download them, them, uh, download them yourself.

[00:10:21] Jonathan Cohen: So on build.nvidia.com, if it says run anywhere. That means that it's a model that's been, uh, you know, packaged and we've tested it well enough that we, uh, that you can actually download it and, you know, we guarantee some level of quality, uh, that it, that it should run with low latency and, you know, high throughput and, and accuracy.

[00:10:42] Krishna Gade: Awesome. So now, if, okay, so I have, like, gotten a hold of a NIM server, maybe like Llama server, so I've deployed a bunch of containers.

[00:10:51] Krishna Gade: Now, now I want to, as an enterprise, I want to make sure that I'm taking care of all the security issues and the things that you touched upon, you know, I have, you know, maybe my healthcare records and how do I, as an enterprise company, you know, solve some of these security challenges when deploying these LLMs, right?

[00:11:08] Krishna Gade: You know, how can guardrails be customized to address these and where would some of the NeMo framework help me?

[00:11:16] Jonathan Cohen: Yeah, so, um, you know, the answer depends a lot on how you're deploying it. So, not everything that you deploy, not every large language model, let's say, that you deploy is actually going to be used as a chatbot.

[00:11:29] Jonathan Cohen: Some of them may be very internal endpoints that are doing some very specific task. Um, and so the security, you know, the attack surface, or the, um, you know, anomaly detection, these kinds of things, they may be harder or easier depending on your actual use case and deployment scenario. But let's just pick an example.

[00:11:48] Jonathan Cohen: Let's say I have a, you know, the simplest, most obvious example, a kind of customer service, you know, customer facing customer service chatbot that I've built. Uh, okay, now I have a really, um, I have this probabilistic AI system that I'm exposing to the public. Uh, I think probably everybody has seen by now news stories, you know, that sort of trickle out about companies that have deployed these and, you know, the customer talks the chatbot into offering it a discount or, or something like that, right?

[00:12:17] Jonathan Cohen: And now suddenly I have, I have this problem where this digital representative, you know, digital representative of my company has done something against my company policy. And, and, and so this is a good example of the, you know, the challenges and the risks of deploying AI models. Um, you know, I always think about it, just the analogy with a human, so if I had a human customer service worker, I give them a book, I tell them, you know, what are the, uh, what are our policies, what's our refund policy, you know, how do you handle a rude customer, how do you handle an irate customer, right?

[00:12:51] Jonathan Cohen: We train our humans, we train people, we do all these things. Um, And sometimes people get it wrong, and it's, you know, quality assurance, and you, you know, this is why when you, whenever you call a line that says, you know, this call may be monitored for quality assurance purposes, that's what they're talking about.

[00:13:06] Jonathan Cohen: They do record these lines, and supervisors listen in, and they review calls, and they're constantly training their staff, right? Um, why would you expect a digital worker to be any different, right? So when I'm interacting with a digital worker, I certainly want to be able to monitor what it's saying, monitor what the people are saying.

[00:13:24] Jonathan Cohen: Um. That's on the monitoring side, but I also want to somehow control and constrain what this AI is doing. You know, for example, I say, you are a customer service, uh, chatbot that is only to talk about product quality issues with our products. Don't talk about politics. Don't offer your opinion about what's the best wine or, you know, whatever it may be.

[00:13:47] Jonathan Cohen: Um, especially, you know, when you're, when this chatbot is based on a modern large language model, these, these models are incredibly sophisticated, able to carry on conversations about all sorts of things, perform many tasks, but that's not what you want, right? You typically want to highly, highly constrain, um, the domain, what you might call the ODD, the operational design domain in which this chatbot is actually going to operate.

[00:14:11] Jonathan Cohen: Um, so we have a product, I mean, there's, there's many solutions out there. From NVIDIA, we've developed this technology we call NeMo Guardrails, which is specifically designed to do this. Um, and NeMo, the way NeMo Guardrails works is, um, it's actually built on top of a very sophisticated technology. It kind of hides a lot of the underlying sophistication, but there's a very sophisticated technology called Colang, um, not, not to be confused with Golang, so C -C O L A N G.

[00:14:38] Jonathan Cohen: And Colang is a dialogue modeling language. And Colang is incredibly powerful. Um, but essentially in Colang you can describe, using this, this um, specific modeling language, the structure of a dialogue, of a conversation. You can talk about, you know, the topics that, that you would want to cover, the tone, and, and you can write triggers.

[00:14:57] Jonathan Cohen: So, for example, you can say, you know, if If the, um, the customer says something in an angry tone, then trigger some rule, like, you know, make sure the bot responds in a, in a very conciliatory tone. If the customer says something rude, and let's say the customer says something rude, um, you know, five times in a row without making any substantive actual, uh, uh, request, then maybe your bot would go into some predefined, you know, hey, I can tell that you're really upset.

[00:15:26] Jonathan Cohen: Why don't I disconnect today and, you know, we can reconnect in the future, right? So whatever rule you may have, um, you can essentially, in Colang, you can, you can describe these very, very complicated and potentially sophisticated rules. We also have a lot of, um, pre built rules. So you have to know all this, right?

[00:15:43] Jonathan Cohen: And, and we're, one of the things that we're working towards, um, before we, we, uh, release it for general availability is, Um, sorry, the open source toolkit is available today, but we're working on a microservice version of this, and what we're working on there is really to have a bunch of these pre built best practices to make it easy to deploy.

[00:16:03] Jonathan Cohen: But guardrails also can, for example, look for prompt injection attacks, or different security vulnerabilities, or Um, um, you know, jailbreak attempts and all these things that are also kind of security issues, you know, it's this, it's this very interesting world where security and dialogue management are kind of overlapping for the first time ever.

[00:16:24] Jonathan Cohen: Uh, you know, one way I like to think about it is, in the past, when you think about computer security, you think about an API. I mean, really, security is all about APIs. I have an API. Have some, you know, way of communicating with the computing system.

[00:16:37] Krishna Gade: So it's much more structured, but now anything can be done through your...

[00:16:41] Jonathan Cohen: and security flaws, security holes, are some way of accessing that API in a way the developer didn't intend, that can get the system to do something the developer didn't intend it to be able to do through that API.

[00:16:54] Jonathan Cohen: You know, leaking information or, or granting permissions or whatever it is. This is kind of what computer security is really built around, this notion of APIs and, and uh, access. And now we're doing this, we're putting this conversational engine in front of an API. So I can talk to a chatbot that has a human like conversation with me, and then the chatbot's translating what I'm saying into calls to some more structured API behind the scenes.

[00:17:17] Jonathan Cohen: But now my conversation itself is the attack surface.

[00:17:22] Krishna Gade: Correct

[00:17:22] Jonathan Cohen: and an attack, um, might look like trying to convince a chatbot to do something it's not supposed to do.

[00:17:32] Krishna Gade: There are these do attacks and do anything attacks, DAN attacks.

[00:17:36] Jonathan Cohen: Yeah, exactly, exactly. You know, they don't look like computer security attacks of the past, right?

[00:17:42] Jonathan Cohen: And so, so there's a very interesting overlap between computer security and dialogue modeling. And NeMo Guardrails is really trying to embrace that and, and so that's why the core technology is actually a dialogue modeling system. And then we layer on top of that a number of computer security concepts and techniques.

[00:18:01] Krishna Gade: So, so on that, right, so, so the Guardrails seems like a rules engine framework, you know, it's almost like a

[00:18:06] Jonathan Cohen: It's a fuzzy rules engine, yeah.

[00:18:08] Krishna Gade: Fuzzy rules

[00:18:09] Jonathan Cohen: A dialogue based rules engine.

[00:18:11] Krishna Gade: So now, if I'm saying that, okay, if the user is abusing or if there is a jailbreaking attack, how is that detection capability happening?

[00:18:18] Krishna Gade: Are you actually using language models behind the scenes or other AI techniques to detect, you know, if this is a jailbreaking attack or something?

[00:18:26] Jonathan Cohen: So, Guardrails is actually relatively agnostic about, the software is relatively agnostic about this. And it relies on um, access to existing models. So, for example, you could have a jailbreak detection model, but, you know, given an utterance, is this likely to be a jailbreak or not?

[00:18:40] Jonathan Cohen: Um, we have topic models that, um, you know, look at, again, a conversation and decide what, you know, given a bunch of options, which topic most quickly matches or, or, um, uh, it also can sort of generally use large language models to, to do these sort of assessments, like, what is the tone of this conversation or whatever it is.

[00:19:00] Jonathan Cohen: Um, but the, the power of guardrails, in fact, is whatever. Whatever model you have that you want to use as part of your, um, guardrailing system, Nemo Guardrails can call. It can, it's very easy for Nemo Guardrails to become an orchestrator of existing techniques you have. So I have a jailbreak technique that I've jailbreak detection technique that I've developed or...

[00:19:22] Krishna Gade: so you can bring your own jailbreak detection model

[00:19:24] Jonathan Cohen: whatever

[00:19:25] Krishna Gade: other API said you could integrate

[00:19:27] Jonathan Cohen: That's right

[00:19:27] Krishna Gade: Guardrails

[00:19:28] Jonathan Cohen: and we've developed some but but in fact you know that the way I would recommend you to play NeMo guardrails is that you use Llama guard for this and there's a huge community of these Um, models out there.

[00:19:38] Jonathan Cohen: In fact, that was the, the original design of NEMA Guardrails was, was based on this idea like, NVIDIA's not gonna solve the guard railing problem. This is like immense problem. It's like saying, we're gonna solve computer security. Of course we're not. There's a huge community of people working on this.

[00:19:52] Jonathan Cohen: So, from the very beginning, we thought it was really important to be able to tap into that. And to provide a technology that, you know, maybe had some of our own tech, um, techniques inside, but fundamentally was more of an orchestrator of the community's techniques as they're being developed.

[00:20:07] Krishna Gade: Yeah like, for example, we have integrated within Guardrails so that Fiddler's intelligence techniques can be available for Guardrails. So

[00:20:15] Jonathan Cohen: And actually, just, just a comment about that, you know, the other, the other really important part of Guardrails is logging and monitoring. The NeMo Guardrails itself doesn't really do that, right?

[00:20:24] Jonathan Cohen: It it's kind of a rules engine, as you say, more sophisticated maybe than a simple rules engine, but it's a rules engine. But you, you, you also really want to be able to have something that's monitoring it, looking for anomalies, um, uh, you know, dashboards, all these kinds of things. That's not what NeMo Guardrails does.

[00:20:42] Jonathan Cohen: You can think of NeMo Guardrails as kind of more of a an endpoint monitoring like node, but you really are going to want to connect it into some larger platform like Fiddler.

[00:20:50] Krishna Gade: Right, right. Makes sense. And, and so, uh, When it comes to, like, uh, runtime deployments of AI, one of the design patterns that we are seeing is, um, in enterprises, they're building this gateway service, almost like a brokerage service, that calls into many of the LLM endpoints, so there's almost like a model garden that evolves You know, for different use cases.

[00:21:11] Krishna Gade: And the gateway service federates requests back in the day. So it seems like a guardrailing system framework could really help you build a really powerful gateway, right? So have you seen examples of customers?

[00:21:23] Jonathan Cohen: Yeah, you certainly can do things like that. Um, you know, you can also choose to use some other systems.

[00:21:29] Jonathan Cohen: So, you know, there's a lot of these, I think, what people would now call, like, agentic. AI systems, um, or, or compound AI systems, I think they're kind of synonyms in a lot of ways, where you, again, have multiple AI models that are interacting. NeMo Guardrails could be used that way, but you don't have to. You can, you know, build your, your, your software on, let's say, um, Langchain or something like that, or whatever platform you want, and still connect it into something like NeMo Guardrails, and connect it into NIMS.

[00:22:00] Jonathan Cohen: And, and again, I think that. The concept of NeMo, just the broader platform, is NeMo is not monolithic. There is no, it is very carefully designed to be a set of modular, independent microservices. Now, we designed them all so they work well together. But you can say, I'm going to use NIM, and I want to use my own monitoring, my own guardrails, my own fine-tuning, my own deployment, that's fine.

[00:22:26] Jonathan Cohen: NIM is just a microservice, right? You can connect it into whatever. Or you can say, I want to use NIM and NeMo Guardrails, but I want to use my own, you know, toxicity detection models, and I want to use my own fine-tuning, whatever. That's fine, too. It's, it's really designed, it's really Explicitly designed to be something where you don't have to embrace the whole platform. You can take...

[00:22:45] Krishna Gade: A bunch of Lego blocks and you can pick and choose whatever you want.

[00:22:48] Jonathan Cohen: Yeah, or you can embrace the whole platform because the whole platform has been designed with some kind of coherent vision about how you're going to build and deploy these systems. But if you have an existing agentic platform that you're using and you like this one model and you want to deploy it with a NIM, that's fine too.

[00:23:03] Jonathan Cohen: Or, you know, you know, I have, let's say many models interacting and I already have some guardrails that are working for me. But I want to add, you know, Llama 405B with guardrails, and you like the guardrails, the configurations that we provide, you can use that too. So it really is very flexible in that sense.

[00:23:17] Jonathan Cohen: And that was a very important design point for us.

[00:23:22] Krishna Gade: So maybe switching gears, right, so we talked a little bit about evaluation and observability of these generative applications. You know, do you see like these techniques of evaluation and monitoring differ from domains? Like, so there's a question from the audience, you know, how does it differ from healthcare LLM apps to like financial services LLM apps, you know?

[00:23:39] Krishna Gade: You know, what's like, what sort of advice or guidance that you can provide, you know?

[00:23:45] Jonathan Cohen: Well, I guess it depends what you mean by evaluation. So, um, Let's, let me answer that in two ways. So, one level of evaluation is this kind of monitoring. You know, I'm looking for a problem. The kinds of problems that you might have are extremely domain dependent.

[00:24:04] Jonathan Cohen: Right? So in healthcare, I might have a monitor that's specifically looking for PII leakage. Um, you know, I designed my system so that it should never happen, but I want to just add another layer of security there where, hey, if the guardrail, if, if, um, if the chatbot ever provides PII that's not linked to the current patient, then flag that.

[00:24:30] Jonathan Cohen: Um, that, that would be a, uh, a guardrail, you know, uh, uh, uh, an evaluation, an online evaluation that would probably make a lot of sense in like a healthcare setting, right? Um, or compliance, you know, in finance, I have a lot of rules about what information I can and can't include, or you know, this conversation is not allowed to talk about this, or, or we have an internal example in a video we've been working towards for, for a long time, which is a, um, an HR benefits chatbot that can answer questions about benefits.

[00:24:57] Jonathan Cohen: And there's a lot of things that you're just not allowed to answer, you know? If you say, in what stock should I invest my 401k, you can't, you know, your HR partner can't answer that question for you, right? So the HR benefits chatbot shouldn't answer questions like that. Um, and, and again, you'll program it that way, you know, you might.

[00:25:15] Jonathan Cohen: Fine-tune it and provide it, you know, some guidelines, whatever, not to do that. But you probably also want to add a guardrail and also a monitor to check for, you know, are people asking these kinds of questions? What percentage of the time that someone asks a question that we're not supposed to answer, are we actually answering it?

[00:25:31] Jonathan Cohen: You know, is there a sudden spike in people asking inappropriate questions? Whatever it is. So there's a There's a, there's like a real time evaluation that I think is very context specific. The other way I can answer that question, though, just in general, is, you know, you want to evaluate your system. How accurate is it?

[00:25:46] Krishna Gade: Yep.

[00:25:48] Jonathan Cohen: System evaluation, so this is more, like, like, offline. You know, I built a model, and before I deploy it, let's say, I want to just know how good is it. Um, and that is also extremely domain specific. I mean, it's a use case specific. And, and I think this is a great example of, you know, what the, how I think about AI is, is, you know, generic, general AIs are great, but I think most business use cases are not generic.

[00:26:14] Jonathan Cohen: I don't want an AI that is going to opine about, you know, religion and history. I want an AI that does this one task, you know, It takes receipts, and my reimbursement policy, and the requested reimbursement, and checks that this receipt matches this reimbursement policy, you know, this receipt with this requested reimbursement matches my policy.

[00:26:34] Jonathan Cohen: I want an AI that just does that. I don't want it to tell me about, you know, the history of the French Revolution or whatever, right? Um, and, and therefore my evaluation is also going to be extremely good. Task specific and the way you're gonna build your AI is you're gonna first like very clearly define your task.

[00:26:52] Jonathan Cohen: You're gonna probably collect a lot of like training samples. You have, you know, probably lots of human experts at doing that. 'cause a lot of people have been doing these broke tasks for a long time. Probably have a lot of data collected over the years that you could turn into training data. Some of it is going to be used to train, some of it is going to be used for evaluation.

[00:27:09] Jonathan Cohen: And so that's, that's why I said there's sort of two ways to answer this question. I think they're both very important, and they're both, in fact, domain specific.

[00:27:18] Krishna Gade: So this is an interesting thing, right? So when it comes to, like, I used to work on search before, and we would have human raters that evaluate if the search quality was, like, Good or bad.

[00:27:26] Krishna Gade: And, you know, and so now, uh, there are companies that employ red teams to do this, uh, to sort of, uh, give you the domain specific answers. But how do you think, like, the, the field will mature? So there is now this approach of LLM as a judge. to use as a, as an evaluator, you know, uh, what's like the future here?

[00:27:47] Krishna Gade: Like, how do you think like the evaluation, because there's scale that we're talking about, right? As GenAI hits scale, now you have lots of things to evaluate and observe, you know, what, what's like the future technology here?

[00:27:59] Jonathan Cohen: Well, I think LLM as a judge, as people call it, makes a ton of sense. Um, you know, and I think what you're doing there is you're, you're sort of saying, well, it's probably less accurate than a human, but I can evaluate way more.

[00:28:14] Jonathan Cohen: And so I'm going to trade off, you know, volume of evaluation for quality and that, on balance, that's probably a good trade. And I think there's a lot of evidence that that is true. Um, I think there's still kind of a gold standard of humans evaluating, and I don't think we have any AIs that are as good as humans at evaluating responses.

[00:28:32] Jonathan Cohen: Um, you know, you have, uh, these things like Chatbot Arena, which are fundamentally humans. The interesting thing about Chatbot Arena, as a good example, um, is, you know, humans have biases. So, so, you know, Chatbot Arena is where you can go and play with it, and you ask questions and a bunch of different chatbots answer, and you rate, you know, how well you like the answers.

[00:28:52] Jonathan Cohen: And so, this is kind of considered like the gold standard for how good is your chat. But, you know, humans have preferences. We like, you know, friendly tone, that's not overly pedantic, that's, you know, the answer's long enough, but not too long, it's helpful, all these kinds of things, right? And so, Chatbot Arena, you know, strongly selects for these things that people like.

[00:29:13] Jonathan Cohen: Is that a better chatbot? I don't know. You know, it's kind of subjective, right? So, so I think, I think there's some strengths and weaknesses to humans and LLMs as a judge. The other thing I would say is not LLM, not all large language models are equally good judges.

[00:29:29] Krishna Gade: Right.

[00:29:30] Jonathan Cohen: And you can build large language models that are specifically good judges, and you can fine-tune and improve your large language model as a judge for a particular task.

[00:29:39] Jonathan Cohen: Um, and so I think, and again, you know, if you think of the analogy of a large language model, it's like a person. Again, I'm not trying to anthropomorphize this technology, but I think it's helpful to think about it this way. Um, you know, some people are better teachers than other people. Some people are better judges.

[00:29:58] Jonathan Cohen: Uh, you know, some people are better at grading papers than other people. You know, like a high school history teacher that grades papers all day long is probably a lot better at grading history papers that I would be um, and so LLMs, you know, you can make them better or worse at things, and so I think, I think the future of LLM as a judge is, I mean, very, they're going to be used for lots of judging, but I think we will continue refining our techniques for building better AIs that are great at judges.

[00:30:29] Krishna Gade: So, so far we've talked about guardrailing and monitoring in the context of accuracy and security, right? You know, you want to make sure that your GenAI app is accurate. You know, you mentioned examples of HR, chatbot, you know, sticking to what it needs to do and filtering out other things. Or like security issues where, you know, do you move, you know, eliminate PII from showing up in responses.

[00:30:49] Krishna Gade: But then there's a whole, you know, gamut of things, you know, ethical considerations that, you know, people talk about, you know, I've seen Jensen go on stage, talk about Responsible AI and AI Safety quite a few times. And then there's the legal aspects of it. And as, you know, you know, countries and continents come up with regulations, there's this, you know, who owns the liability, you know, is the, is the developer of the large language model or deployer?

[00:31:12] Krishna Gade: There's like a lot of these considerations here. And then if you are providing the guardrails, You know, would the guardrail provider own the liability? So there's lots of these implications here, right? So how do you think about it? You know, when a company, especially in regulated sectors, Fiddler works with a lot of regulated customers, some of them might be on this call. How do they think about the responsible AI aspects of generative AI? And, and will some of these tools, you know, help with them?

[00:31:40] Jonathan Cohen: That's a, that's a complicated question. Um, you know, I think, how do you think about the responsibility of your employees? And in a future where you have human employees and digital employees, you're responsible, you're responsible for the behavior and the processes and the policies of both of them.

[00:32:02] Jonathan Cohen: Um, I think what's important is that you have some confidence that your digital employees are actually following your policies. That one, you know how to, that you have a technique for explaining to your digital employees, you know, I'll say in quotes, explaining to your digital employees what your policies are.

[00:32:21] Jonathan Cohen: So we have techniques for explaining to our human employees what our policies are, right? Training manuals, I know I think legally we all have to do our sexual harassment training every two years. NVIDIA has a code of conduct, and I have to retake my code of conduct training. You know, and it's like a little 30 minute, like, training course designed by, you know, some, someone in the legal department somewhere, right?

[00:32:40] Jonathan Cohen: I mean, this is like, there's a whole industry, there's companies whose, whose business is like helping enterprises train their employees in their own policies, right? And, and like we were talking earlier about in call centers, you have quality assurance and again, whole businesses whose business is helping companies enforce quality assurance or implement quality assurance, quality assurance plans, enforce the level of quality, and all this kind of stuff, right?

[00:33:06] Jonathan Cohen: The same thing has to happen for your AI workers. Um, how do I train my AI? How do I teach my AI my corporate values? How do I ensure that it's complying with all the relevant regulations? All these, these are difficult things to do.

[00:33:23] Jonathan Cohen: And, you know, we have some techniques for it today. Um, we will continue refining those techniques. Uh, you know, and I think it's always a trade off between, um, let's say, the ease with which you can constrain your AI versus the quality and intelligence of an AI. So let me give you an example. You know, I could have something that's not an AI at all.

[00:33:48] Jonathan Cohen: It's just a straight up traditional dialogue system with hard coded responses and dialogue trees. I am 100 percent confident it will never say anything that I don't want it to say, because I literally wrote every line that it will ever respond to, you know, this is like your old school, I mean, old school

[00:34:05] Krishna Gade: decision trees

[00:34:06] Jonathan Cohen: decision trees and state machines and all this like well understood technology, you know, um, typical dialogue management, uh, you know, you chat with like, Um, some companies online, you know, help a lot, and most of them are like these kind of hard coded things.

[00:34:22] Jonathan Cohen: So very easy to control and constrain.

[00:34:25] Krishna Gade: But sometimes extremely frustrating.

[00:34:27] Jonathan Cohen: Exactly. We'd all agree they're not going to be intelligent, and therefore, you know, they don't know how to fulfill their mission. On the other hand, I can have a totally open ended chatbot that talk about anything that seems like a human.

[00:34:36] Jonathan Cohen: It's so smart and can solve my, you know, do my math homework problems and write me a speech and give me customer service in the form of a sonnet, you know, whatever, right? Um, but really hard to constrain. And so, you know, there's like this spectrum in between, and obviously, we're not satisfied with AIs that are dumb but trivial to constrain.

[00:34:58] Jonathan Cohen: Because if we were, we wouldn't have invented large language models, and the world wouldn't be excited about this. So, it's really that, like, how do you take this thing that's very intelligent and very flexible, but also constrain it, and again, my, I just come back to this, like, we figured this out with humans.

[00:35:14] Jonathan Cohen: Companies employ lots of humans, and for the most part, you're pretty confident that your humans are, in fact, following your policies. You know, that's not a thing that's like, keep CEOs up at night. That, oh, my employees, what if my employees don't do what I tell them to do? I mean It's not easy, but it's tractable, and I don't see any reason why the AI version of that is any less tractable.

[00:35:37] Krishna Gade: Yeah, makes sense. So it starts with putting the controls and guardrails and observability in place to get

[00:35:43] Jonathan Cohen: These are all very important ingredients, absolutely.

[00:35:47] Krishna Gade: Yeah, absolutely. Awesome. So maybe we can switch gears a little bit.

[00:35:51] Krishna Gade: So today, in the last, I would say, maybe in the last six to nine months, there's has been a lot of talk about agentic AI, right?

[00:35:58] Krishna Gade: And so if you kind of think about the evolution of generative AI in the enterprise, um, there's people started a lot with RAG and then fine-tuning, but still I think a lot of 95 percent of genAI apps people are building are still RAG, but then there's no agentic workflows have come about. Could you talk about what you're seeing there and what is this agentic workflow? How does it differentiate with RAG and how do companies need to think about, you know?

[00:36:22] Jonathan Cohen: Well, they're very similar. Um, you know, I think what people call agentic workflows are just a generalization of RAG. Sorry, now I should say RAG, Retrieval Augmented Generation. So the idea, maybe I should explain what these all mean.

[00:36:36] Jonathan Cohen: So the idea of RAG, Retrieval Augmented Generation, um, is If I, what is an LLM? This is a very philosophical question. So LLM is some neural, fundamentally neural network, typically transformer based neural network, that's been trained on a large amount of text, human text, usually. Um, and what it does is it learns the structure of that human text.

[00:37:03] Jonathan Cohen: And, and the way it learns the structure of that human text is it memorizes a lot of things. A lot of, you know, that a, that a mother and a father are, are learning from each other. Uh, related, and, you know, they're different genders. Just like a king and a queen are related, but they're different genders.

[00:37:17] Jonathan Cohen: And, and that, um, uh, I don't know, France has a capital called Paris. And that, you know, like, there's just tons and tons of facts that are in our heads that allow us to communicate. Um. And these LLMs have memorized so many things, um, sometimes they memorize specific facts, a lot of times they memorize sort of conceptual facts.

[00:37:39] Jonathan Cohen: And so when you ask it a question, it'll, you know, in its sort of vast memory that's encoded in the weights of the neural network, it'll spit out an answer. And oftentimes those answers are plausible but totally made up. And people refer to this phenomenon as hallucination or confabulation. Um, and so that's not very useful.

[00:37:56] Jonathan Cohen: I have this thing that sounds really good, and speaks my language, and can't be relied on to actually tell me the truth. Because everything it says sounds plausible. And that's just kind of inherent in how large language models work. And so then people said, well, okay, but what if I actually have facts?

[00:38:12] Jonathan Cohen: You know, I have, I have like Wikipedia. I'll say, I'll consider Wikipedia to be the ground truth. And instead of just asking my large language model a question, I'm I will, um, first find relevant passages from Wikipedia, let's say, that are likely to have facts relevant to the answer, and I'll prime the large language model.

[00:38:30] Jonathan Cohen: I'll say, hey, hey, large language model, here's a bunch of facts, articles from Wikipedia, here's a question, now answer it. And what, what you find, it's like, intuitively makes sense, is that the large language model is much more likely to use the facts in the passage that you fed it rather than make them up.

[00:38:48] Jonathan Cohen: And so this led to this idea of retrieval augmented generation, where, um, when you interact with a large language model, the first thing it does is it goes and retrieves a bunch of, hopefully relevant information somehow, um, before it answers. And this makes a lot of sense. And, and it's most LLMs that I've seen deployed in production are retrieval augmented in this way.

[00:39:13] Jonathan Cohen: But then people started to think, well, if I have this large language model that's, like, retrieving information, you know, maybe the thing I ask it is, um, you know, what time is it in Paris? Well, that's not stored in a database. To find out what time it is in Paris, it needs to, like, first of all, realize that I'm asking it to go look on a clock somewhere, figure out what time zone am I in, what time is it right now, do the translation, and answer.

[00:39:39] Jonathan Cohen: So that's, like, on some level it's retrieving information, but it's not retrieving it from a database or text. It's using a tool, looking up, you know, your geolocation, looking up the current time, all these sorts of things, right? And so, what started as RAG systems kind of evolved into this more general notion of, what if I just allow my AI to use tools to call other systems, computer systems, to get information?

[00:40:06] Jonathan Cohen: And maybe those other computer systems themselves are AIs. You know, I have an AI that's specialized in figuring out what time it is. It's all this AI does. And it has access to clocks and whatever. And my, my master AI, I don't know what you'd call it, my, my, um, front level AI, uh, gets the user's question and says, Oh, I don't know how to answer this, but let me go ask the time AI, because it's good at this, right?

[00:40:28] Jonathan Cohen: And so now instead of one AI that, You know, go back many, many, many steps in my story. One AI that's just memorized a bunch of stuff and maybe makes things up. Um, I actually now have a network of AIs that are all experts in their different domains that themselves may have access to external tools or facts or databases or whatever.

[00:40:47] Jonathan Cohen: And this is now what we call like an agentic system because it's made up of these agents. And it's a very natural evolution, I think.

[00:40:54] Krishna Gade: And so the communication is happening through these unstructured interactions between these agents.

[00:40:58] Jonathan Cohen: Very interesting, yeah, I think that's right, you know.

[00:41:00] Krishna Gade: It's almost like humans speaking to each other in a workplace, basically.

[00:41:03] Jonathan Cohen: I just keep coming back to this idea of, um, you know, the future, uh, like, the history of computers is APIs, you know, Application Programmer Interface, uh, Application Programming Interface, where you have a structured way to communicate. of sending a request to a computer program to get a response. You know, I form my packets this way.

[00:41:25] Jonathan Cohen: JSON, you know, we have all these formats, and YAML, and HTTP, whatever it is, you know, all these structured ways of things communicating. And now we're kind of saying, well, It turns out that we have computers, we have automated systems that are pretty good at just speaking human languages like English. Um, so instead of formulating my request to it in a very structured way, I can just kind of talk to it in English.

[00:41:49] Jonathan Cohen: And it is funny to imagine you might have all these agents, you know, that are communicating, speaking English to each other.

[00:41:57] Jonathan Cohen: And that's super interesting because now all the problems I had in monitoring and security, um, and logging and guardrails, I actually probably want to start to monitor the communication between my agents using all the same tools.

[00:42:11] Jonathan Cohen: So, so my example earlier, I said, you know, I have a customer facing chatbot, customer service chatbot. So I have a human talking to an AI. And we want to monitor them. We want to say, what is a human saying, what's the AI saying back, you know, compliance, and on topic, and all these kinds of things we care about.

[00:42:28] Jonathan Cohen: Well, now I might have a human talk to an AI, and then that AI goes and talks to a bunch of other AIs, maybe in English. And maybe those AIs talk to other AIs in English, and maybe one of those AIs talks to a computer system in JSON, right? But I have many links in these communication chains. And I probably wanna monitor all of them because one, you need to

[00:42:46] Krishna Gade: put those safeguards and controls in place isn't

[00:42:48] Jonathan Cohen: That's right you know. What if this AI starts speaking in poetry to another ai or somehow, you know, it's plausible, right? One, this AI starts doing something anomalous and interacting with other AI in the system, and now I've got a whole bunch of AI doing weird anomalous things. I really wanna know that that happens, right?

[00:43:04] Jonathan Cohen: And, and it should be doable because you'd expect these agents, as I've described them, they're pretty specialized. You know, if this agent. If my front agent talks to my time telling agent and starts asking it about the weather, that probably doesn't make a lot of sense. Or if it starts speaking in rhyme, that probably doesn't make a lot of sense.

[00:43:28] Jonathan Cohen: Or I'm expecting it to send requests in JSON, and it starts sending them not in JSON. That's probably a weird thing, that I'd want to have an alert flag, and I'd want to go debug. Well, what happened? Right. So I think

[00:43:39] Krishna Gade: those connections are still being done by the humans and programmers today, right? You know, like, which, which question to route to which agent and whatnot. But in the future, that can get autonomous and things can be, yeah.

[00:43:50] Jonathan Cohen: Absolutely. All these things. And, and I think, you know, testing these systems is going to be more complicated, um, because you can have much more, uh, variety. In the kinds of data flow, you know, I send a request in and the communication pattern can have a lot of variety based on what the request is.

[00:44:07] Jonathan Cohen: Um, so kind of asserting that things look right, I think, will get harder and harder. Um, which means, again, that your monitoring needs to also be pretty flexible and, um, probably, you know, fuzzy. And so, again, I think that the technology that's kind of underlies NeMo Guardrails, which is dialogue modeling, I think it's probably the right paradigm for this future as well.

[00:44:26] Jonathan Cohen: Um, but again, you know, tools like Fiddler, observability, monitoring, super important, anomaly detection, these things are super important, and, and, and will be even more important in the future. There's this, um, concept in, uh, programming, people, software engineers, in the, in your audience might, might know, called, uh, pre and post conditions.

[00:44:43] Jonathan Cohen: So this is kind of like asserts in languages, where you say, you know, we call this function, And there's some conditions, just logically, that must be true. You know, my computer is a giant state machine, and the state of the computer at the time that this function could be called should have this, this, this, this, and this true.

[00:44:58] Jonathan Cohen: And I, and I typically code these as asserts. And at the end of this function, I should have these conditions should be true. And inside the function, you know, these conditions should be true, right? And this is, like, absolutely, like, a best practice from a software engineer perspective. Because most bugs, almost every bug, is your system is in a state that you didn't expect it to be in, and all these asserts help you find these.

[00:45:17] Jonathan Cohen: Um, I think these agentic systems, uh, I mean, they're computer systems. They also have states. Now, the states are much fuzzier and more amorphous and complicated, but I think that the future version of pre and post conditions is going to be some form of guardrails, monitoring. You know, alerts.

[00:45:40] Krishna Gade: Yeah, absolutely. So, so, I think, so it seems like there's a question from the audience, which, which I think is related to what you just told.

[00:45:47] Krishna Gade: The question is from Philippe, you know, finally, where do you see the role of AI agents evolving in the next five to 10 years? Do you envision them being fully automated in critical sectors and do you anticipate ongoing human oversight as a necessity for safety and alignment?

[00:46:01] Jonathan Cohen: You know, I, so these are, that's honestly a more policy question than anything else. I think, I mean, we already have computing systems. That are, you know, autonomous in some ways, right? Um, I mean, think of just like software that you run that, I don't know, like my spam filter. It's pretty autonomous. I'm not tweaking my spam filter.

[00:46:20] Jonathan Cohen: I just trust it. Now, sometimes it gets things wrong. So do I every once in a while go check my spam folder? Occasionally, you know, less and less. Ten years ago I would have done a lot more. I trust my spam filter a lot more now. So I kind of check it less and less, right? Um, but I think that's an example, you know, I, we're surrounded, like, I feel like we, we look towards this future as like, it's going to be wild, but a lot of these things we already live with, right?

[00:46:44] Jonathan Cohen: Um, but I think the question is getting, like, a deeper, a deeper question now, is, is more and more of the world going to be automated away from us in a way that will, you know, will be increasingly opaque to humans? I think that's, I think that's dangerous, right? I think there's, like, a lot of value in human oversight for all kinds of reasons.

[00:47:04] Jonathan Cohen: Um, and I think compliance is going to force that. You know, until you are really confident in the technology. You know, you're going to have humans overseeing things. And, you know, what does it mean to be really confident in technology? I don't even know. Um, you know, I guess I don't want to mention specific industries, but there's a lot of things that's hard to imagine, you know, any responsible company would allow an AI to go execute something without human oversight, you know, some critical business function without human oversight anytime soon, you know, will it happen eventually? I, gosh, I couldn't answer that question.

[00:47:40] Jonathan Cohen: But I, I think I think over the next, you know, let's say five years, I don't even want to give a time horizon, but in the medium term, five ish years, um, I think what we will see is a lot of the rote work that we do that's time consuming, um, be automated. And the role of humans will be overseeing a lot of this kind of work.

[00:47:59] Jonathan Cohen: And checking, like my example earlier on checking, you know, a reimbursement receipt, uh, you know, for compliance. I mean, that's something humans do today that requires kind of human cognition and, and, and brain power, or has required human cognition and brain power. But you could totally imagine that an AI would be pretty good at doing that task.

[00:48:15] Jonathan Cohen: Would it be 100%? No. Is there a risk of it not being 100%, it's low, you know, like, if that, if that AI gets it slightly wrong, well, there's probably an appeal, the person who filed for that reimbursement might say, hey, you know, the AI rejected me, but, and then a human would step in, you know, so I feel like we, you can imagine how a lot of these processes could start to be automated, and you can also imagine how human oversight, you know, isn't going away anytime soon, right? That's kind of the future I imagine in the next couple of years.

[00:48:45] Krishna Gade: Awesome. This is great. I think we've touched upon the past, the present, and the future a little bit.

[00:48:51] Krishna Gade: But maybe, maybe we can get back to a little bit of, uh, you know, success stories, and especially going back to the, the NIM and the NeMo stack that you're building.

[00:48:59] Krishna Gade: You know, could you share, like, a success story of some of these projects that have impacted, uh, customers AI capabilities in the recent few, recent past?

[00:49:07] Jonathan Cohen: Sure. Yeah, so, I mean, one of, one of our, um, earliest, um, Successful customers is Amdocs. So Amdocs is a very large company, um, that is a service provider to the telco industry. Um, so a lot of, like, the bills that you get from Verizon, or actually, maybe I shouldn't mention companies because I'm not actually sure who's their customer, but, you know, when you get your cell phone bill, a lot of times it's actually processed using Amdocs software.

[00:49:32] Krishna Gade: Yeah.

[00:49:33] Jonathan Cohen: Um, customer service, they, they do a lot of the a lot of the, the software that runs the telcom industry, um, is operated by Amdocs.

[00:49:42] Jonathan Cohen: Um, and so they, they've been a very early and great partner of ours adopting NIMS. We have something called NeMo Retriever, which is a collection of, um, retrieval models that, so, so in this RAG system, Retrieval Augmented Generation, there's this question of how do you retrieve information? How do you find documents that are likely to be relevant to the, to the conversation that you're having with your AI?

[00:50:04] Jonathan Cohen: And so there's AI models that do that. And so we have something called NeMo Retriever. And so they've used NeMo Retriever and NIMS and they built a bunch of different RAG systems and they've seen huge success. So I think the number they told us, we have a blog post about this on our technical blog on nvidia.com.

[00:50:22] Jonathan Cohen: But the numbers they told us was I think an 80 percent reduction in latency versus deploying a comparable system using all of the various managed services that were out there. So so what they really wanted was rather than hitting all these different endpoints managed by different service providers, you know, and different networks wherever they may be.

[00:50:41] Jonathan Cohen: They just wanted to take all the models and run them themselves um hit the endpoints on their own infrastructure. Um, you know, you reduce network latency. They could ensure they had, you know, optimal size models and They had control over what the models were, how big of, you know, there's always a trade off between a bigger model, which is higher latency and slower, is more accurate, versus a smaller model, but maybe it's overkill for this one thing, and, or I could take a smaller model and fine-tune it and make it really good at this one task, and even though it's smaller, it's actually more accurate than a bigger, more generic model.

[00:51:11] Jonathan Cohen: So all these kind of factors. And so they did a lot of this work with our help, um, and they saw an overall end to end reduction of 80 percent in latency, while, while no loss in accuracy or quality. Um, I think their cost, uh, I think it was like a 60 percent reduction in data preprocessing cost and a 40 percent reduction in inference time cost, if I'm getting this right.

[00:51:34] Jonathan Cohen: So, you know, and it, again, it just makes sense, right? Deploying all this, being able to optimize the models. Deployment on optimized infrastructure is going to have benefits. And I think they're a great example of a, of a company that has a real production system with, with, um, demanding customers, uh, and they deployed it and they're seeing some really great results.

[00:51:53] Krishna Gade: Awesome. This is great. Uh, you know, thank you so much, Jonathan, for, uh, being on this session. Um, I learned a lot, uh, and, uh, from all the way to you know, how the NVIDIA AI strategy is in building these Lego blocks to, you know, fine-tune, prompt engineer, build RAG applications, and also, you know, inference them at scale, and also the guardrailing aspect of it.

[00:52:17] Krishna Gade: And I think, you know, what we are seeing, honestly, in enterprise use cases, whether it's banking or financial services, people are building these internal search applications, Q&A applications, customer service. These are some of the dominant generative AI applications. And this aspect of the things that we talked about, you know, security, guardrailing, you know, inadvertent keywords showing up, or topics that the chatbot should not mention, um, that are coming through.

[00:52:42] Krishna Gade: And then just the sheer accuracy of these LLMs are, are really, you know, top of the mind issues. We are very excited about the partnership that we have. So if you're thinking about NeMo Guardrails, like many of our customers do, you could integrate Fiddler with NeMo Guardrails for monitoring, and also get the guardrailing and all the rules framework that NeMo offers today.

[00:53:04] Krishna Gade: Thank you so much, Jonathan. Thanks again.

[00:53:05] Jonathan Cohen: Thank you. That was a really enjoyable conversation.

[00:53:07] Krishna Gade:

‍