Season 1 | Episode 4

Building Generative AI Applications for Production with Chaoyu Yang

‍

On this episode, we’re joined by Chaoyu Yang, Founder and CEO at BentoML.

AI-forward enterprises across industries are building generative AI applications to transform their businesses. While AI teams need to consider several factors ranging from ethical and social considerations to overall AI strategy, technical challenges remain to deploy these applications into production.

Yang, will explore key aspects of generative AI application development and deployment.

About the guest

Chaoyu Yang is the founder and CEO of BentoML.ai, an open source ML model serving framework. Previously, Chaoyu was an early software engineer at Databricks, working on its Data Science and ML platform.

Transcript

Krishnaram Kenthapadi: Welcome and thank you for joining us today's AI Explained on Building Generative AI Apps for Production. My name is Krishnaram Kenthapadi. I'm the Chief AI Officer and Chief Scientist at Fiddler AI and I will be your host today.

This is Chaoyu Yang, founder and CEO at BentoML. Chaoyu, would you like to just go ahead and introduce yourself?

Chaoyu Yang: Absolutely. Thank you Krishnaram for helping me and my name is Chaoyu. I'm the founder and CEO of BentoML. We help AI teams to deploy and host their AI models and workloads in the cloud. We have a bunch of open-source work and very excited to be here.

Krishnaram Kenthapadi: Awesome, let's start with, like discussing some challenges around deploying LLM prototypes into production. Can you highlight what are the key challenges that enterprises face when deploying such prototypes into production?

Chaoyu Yang: I think a lot of time people start with understanding the quality of the machine learning model, or I guess these days more people are working around large language models.

Chaoyu Yang: so being able to evaluate how the model performs and understand how it behaves in different scenarios, is critical before you let it get to the hands of your users and customers. I think at a pretty high level, a lot of organization would have concern around things such as data privacy, security, and then once pass through that stage, typically around, performance in terms of latency, how fast your user will be able to receive the response and, the cost of operating everything in the cloud.

Chaoyu Yang: And for us specifically, for us providing developer tools, I guess, developer experience and developer velocity is also a very important factor to consider, because applications keep changing. you need a way for developers to quickly iterate and get things to, updated to production.

Krishnaram Kenthapadi: So can you go maybe a little bit deeper into this cost and performance considerations? Maybe can you give some examples of what are the kind of considerations the customers have?

Chaoyu Yang: Absolutely. I think especially for generative AI and large language models, you typically will need GPU resources in order to run inference with those large models.

Chaoyu Yang: And GPU availability is a real challenge. I guess the first thing is you really want to make sure to pick the right cloud vendor that offers the GPU available for your workload. And then because GPUs are so expensive, you should definitely consider optimizations around your cloud infrastructure, how you are leveraging those GPUs.

Chaoyu Yang: You want to make sure your workloads are efficiently auto-scaling to handle a large amount of traffic, or when there's not too much traffic, you may want to scale down to one or zero instances to reduce cost. There are also things such as utilizing spot instances for online traffic without downtime or working with cloud vendors to purchase reserved instances.

Chaoyu Yang: All those things can generally give you more cost efficiency if you're looking to host those models yourself. I think... a little bit more on performance. I think the first thing you need to understand is your performance goals. So there's no silver bullet that can just optimize your AI workload for everything.

Chaoyu Yang: I guess large language models, basically, if you're only looking at latency, for example, you should think about, typically people think about token per second, but, you should also think about what's the time, does it take to receive the first token, generated token from your service. as well as for a lot of cases where you're working with structured data, structured response from a large language model, you may want to evaluate end to end latency, where large language models generate the entire response.

Krishnaram Kenthapadi: So kind of, somewhat related to that, note,how do you think LLM developers should decide between, say, using an open-source LLM versus querying OpenAI or other endpoints, what are the kinds of trade offs?

Chaoyu Yang: I think going back to the quality, first is, GPT-4 from OpenAI, for example, probably offers the best performance out there in terms of large language model. but you, you want to see if you are solving a domain specific problem. let's say you're working with a customer support chatbot, it doesn't need to answer a complicated question that's retrieving, getting some knowledge from Wikipedia.

Chaoyu Yang: And there's already a lot of content out there talking about how to fine-tune an open-source model such as Llama 2 that can actually beat GPT-4 in performance, in some of the domain specific tasks. So I think definitely from your business requirement perspective, look at what type of quality you need, what type of problem you're looking to solve, and at least validate if open-source models are a valid path going forward.

Chaoyu Yang: I think one common question I hear from developers and especially startups, it's if they're mostly building around OpenAI and ship the product to production, whether or not there will be a differentiation in terms of the service they provide. I guess that goes beyond this discussion. That's something people tend to think about as well is how much room there is for you to build that competitive advantage over your competitors.

Chaoyu Yang: And another consideration typically for larger enterprises is data privacy. Using a vendor like OpenAI, definitely you are running the risk of exposing sensitive data to a third party vendor. Whereas with open-source models, you can self host, it's possible to run them on premises, and you also avoid the vendor locking.

Chaoyu Yang: In general, it just gives you a lot more transparency and control over how you run your large language model apps, how does it access your data, and how do you control your code, and customizations on top of the large language model. self-host might not be for everyone, right? First you probably need in-house expertise to run all the complicated infrastructure.

Chaoyu Yang: In general, it probably delays your time to market because there's that initial, roll out cost. You need developers to spend time to figure out that whole setup. And you also should consider the cost associated with running your own large language models. Especially it's a problem for low usage when you're just starting out.

Chaoyu Yang: You only have a small amount of users requesting access to your service. With commercial large language models, they typically charge by token per second. So that's probably not a problem. But for open-source language model, you're typically paying for the amount of compute used for hosting those models.

Chaoyu Yang: So one of the problems that we at BentoML run into a lot is people asking for the capability to scale the deployment to zero when it's not being actively used. But then they want instant startup time when a request comes in. So that's an area that we do a lot of optimization on. But in the longer run, long term, if you think your service will have a very high amount of usage, you will be able to saturate all the GPU resources.

Chaoyu Yang: Open-source large language models tend to be more cost effective, assuming the quality is good enough for your use case.

Krishnaram Kenthapadi: Yeah, I think that's a really amazing perspective. And just to add to that, yeah, sentiment I often hear, from conferences, like the one I mentioned, or from talking to customers is that, if you are building a new LLM application, maybe initially focus on identifying the product market fit, initially focus on identifying, getting some basic functionality end-to-end and seeing the feedback, get the gauge of the feedback from potential customers. And for that, maybe the path of least resistance might be possibly leveraging an existing endpoint, right?

Krishnaram Kenthapadi: OpenAI or other endpoints and building this. But then once you start, once you have set that up and you see that there is, there is a lot of interest in whatever you're offering, and then maybe start thinking about. Having the expertise for maybe leveraging an open-source LLM, right? For all the things that you highlighted, like whether it is keeping the data proprietary to yourself, you don't want to, you may need certain regulatory requirements or you may not want to be paying a lot to OpenAI once you, especially as your traffic gets larger and larger. And finally, I think going back to what you were mentioning at the beginning, right?

Krishnaram Kenthapadi: Like I think, this is also, I'm hearing a similar sentiment. Like if you're building a chatbot to answer based on, that your documentation or, the help articles that you have developed, you don't need to have a really fancy GPT-4 model to kind of summarize the response in a natural language format, right? You can use information retrieval based approaches to get the relevant documents and then all you need is a model that has some kind of natural language processing abilities to take these documents and summarize the relevant information in a kind of natural language format, right?

Krishnaram Kenthapadi: And we'll get to, we'll get more into, some of these dimensions shortly, like aspects like retrieval augmented generation, but, but it looks like you don't need a you don't need a GPT-4 model for just addressing this, right? You're not, so long as you're not trying to leverage all the world knowledge that it has captured, you may not need this.

Chaoyu Yang: That's a great point. Yeah, I think a lot of the customers, open-source users we see really successfully, self-host their models kind of go through that path of trying out building prototypes with a commercial available LLMs, and then evaluate the business outcome. That usually helps justify the cost and the long term investment of those large language model projects.

Chaoyu Yang: Because otherwise I've seen people go the other side, around, other way around. they look at the GPU cost, it seems very intimidating. whereas they don't have the business metrics to back it up to support that cost.

Krishnaram Kenthapadi: So before we go into the next topic, just out of curiosity, like you mentioned, sometimes customers expect the following, right?

Krishnaram Kenthapadi: They want to shut down the instance when there is no traffic and they want to get the instance back up with very low latency. If you're thinking of something like maybe 10 billion parameters, LLM right? Like some of the Llama, like LLMs right?

Krishnaram Kenthapadi: Do you have a sense of how much is that time, like if it's, if it's the instance is not running from where the instance is not running to get it up and running, what would be the time like?

Chaoyu Yang: I think in general, there's a couple trade offs there. If you absolutely need the latency and you are running a very commonly available model, there are possible ways that just have a shared instances among customers serving the same type of model. If you want a dedicated deployment, there are also ways to, for example, keep the instances warm.

Chaoyu Yang: And cache the images on the node, or ways that to, make sure the model load into memory faster. So it's a trade off between the cost and the cold start latency that it will need. and as we build and adapt to the infrastructure for developers, we tend to make all those options available for developers, so they can have more control and decide what works best for their scenarios.

Krishnaram Kenthapadi: And you mentioned about having shared instances across multiple customers. In that instance, like in that case scenario, like are there ways to have a clean separation, like making sure that the prompts from one customer somehow don't influence the state when a different is asking the same language model?

Chaoyu Yang: For us, it's always a clean separation. I think we, in the case of BentoCloud, we do kind of dedicated deployment for each customer. So they don't, the one model instance won't be handling two requests from two different customers at the same time.

Krishnaram Kenthapadi: Yeah, yes. I think that's a good point.

Krishnaram Kenthapadi: And that gets us, into, this dimension of say security of, LLM apps, right? So could we start, maybe start with, just highlighting what are the different security and safety considerations, and are there any lessons we can draw from, deploying predictive machine learning models?

Chaoyu Yang: Mm hmm. I think first is probably still the topic of data privacy, like where you manage your data, do the data and model governance, understanding what are the data set used for training, each specific model. I think that from a data science workflow perspective, still the same concept applies to large language models, especially in an enterprise setting.

Chaoyu Yang: I think with LLMs, they come with a greater capabilities, but also, things such as hallucination, can cause serious, problems in production. So I think, actually observability has never been more important with large language models because they can do so much and so unpredictable. So being able to understand how your prompt or your fine-tuned model behave in different scenarios, how your customers are engaging with those large language models.

Chaoyu Yang: Becomes very important and those observations can help you continue to improve your large language model application.

Krishnaram Kenthapadi: Yeah. Yeah. Yes. I think that's, that's spot on as we were kind of, we were just starting to discuss before the webinar started, right? I think one, sentiment I see is that, in for predictive machine learning models, often customers wanted monitoring and observability of what's going on, but, but they were fine with maybe having this case of observability, with the delay of maybe five minutes or 30 minutes or an hour, depending on how we have set up the monitoring pipeline.

Krishnaram Kenthapadi: But with the LLMs, because of all the reasons you kind of alluded to, right? Like hallucinations to generating toxic content or even leaking PII and so forth, customers are, it's showing me a lot of interest in having and AI safety or proxy layer that even detects these kinds of issues in real-time and, ensures that the end users don't get to see maybe the toxic content generated by the LLM or the any, PII leakage or bias or other types of issues with the LLM.

Krishnaram Kenthapadi: So they are really looking for, addressing this kind of safe security or safety considerations, right during real time. And that's an interesting shift that we are seeing compared to, say, the customers using little bit more predictable models, right? Like all the issues with LLMs, this seems to be a sea change.

Chaoyu Yang: We're actually seeing some customers use more traditional ML or, for example, use a fine-tuned BERT model to help classify if an outcome output is toxic or to detect if the user is asking a genuine question in the context of the business problem.

Chaoyu Yang: But I guess another thought in the back of my head is that there are still a lot of problem people trying to solve with a large language model can actually be solved with a domain specific machine learning model. I guess the lower barrier to entry and really easy to set up, makes people think about, you know, how that's, that can be impactful for their business.

Krishnaram Kenthapadi: Yeah. I think that's a great point. In fact, we are seeing this, light of, approaches where there is a benefit of using other language models or other kind of, maybe not very large parameter, even smaller size models to evaluate some of these dimensions. Like even, in the process of building Fiddler Auditor, which is an open-source project we released a few months back to kind of study the robustness and other issues with LLMs. We noticed that leveraging other LLMs to generate maybe variants of prompts or generate, identify how similar are the responses is quite valuable.

Krishnaram Kenthapadi: Do open-source LLM models specify their data sources for training, is that all clean data from a legal perspective? If you build a solution for a client leveraging and open-source LLM model, is the client at risk of getting sued over training data or is that fear not legitimate?

Chaoyu Yang: That's a great question. I think it really depends on the LLM provider.

Krishnaram Kenthapadi: Mm-hmm.

Chaoyu Yang: some of them do disclose the source data sources used for training. many of them you cannot find any reference at all. so definitely be careful, when choosing the open-source large language model. and,I guess another pretty common thing to look at is the license for the open-source model.

Chaoyu Yang: What's specified in the license that they allow commercial use. and,if it's available, definitely consult your in-house, legal team for help. I think from our perspective, we in general don't give any recommendation in terms of the legal advice here.

Krishnaram Kenthapadi: Yeah, I think, I think, related to this question, yeah, a comment I have heard, yeah, recently is that if, the large providers, right, the providers, like, Microsoft, Amazon, Google, and OpenAI's of the world, could potentially indemnify you as a customer, against any such liability issues.

Krishnaram Kenthapadi: But the kind of, rhetorical question is like, if you use an open-source LLM, you'd be on your own, right? If somebody goes and sues you for, for some of the, kind of maybe,copyright violations or other issues emerging from whatever is used as part of the training data.

Chaoyu Yang: That's right. I think what makes this more complicated is there are so many fine-tuned versions of Llama 2 available on HuggingFace. for, for example, if you need a Llama 2 that supports function calling, there are a bunch of them that's fine-tuned available, and those tend to be from individual developers or smaller companies.

Chaoyu Yang: And it's very hard to tell if those models will come with different type of legal risk.

Krishnaram Kenthapadi: So kind of somewhat related to this question, like, with the caveat that perhaps neither of us is an expert in legal domain, right? So we are not, we don't, we are both, we have great technical backgrounds, but not legal expertise.

Krishnaram Kenthapadi: One of your related dimension that I saw in a paper that came out a few months back, is that is, is it a large language model, or is it a, say, a diffusion model, a machine learning model, or is it a database? The context was that this paper showed that, the diffusion models, like stable diffusion and other models, tend to memorize instances in the training data, and you can, if you say, let's say there's a name of a person and there's an image, and that image appears a lot of times, likely because this person is a celebrity or is well known, and then at prompt time, if you query that person's name, it generates an image which looks almost identical to the original image.

Krishnaram Kenthapadi: So the paper brings up these kinds of questions, like, if this happens a lot then there could be a view of the model as a database, which is retrieving things in its, data store and surfacing that if, if the model, if this LLM or the diffusion model starts getting viewed as a database, the, the copyright implications are very different, right?

Krishnaram Kenthapadi: Compared to if it is a synthesized model. I would encourage you to, take a look at this paper, if you get a chance, or this is one of the papers we highlighted in the recent ICML and KDD tutorial that I presented along with some folks from other companies. So let me go to another question

Krishnaram Kenthapadi: Is it wiser to develop in a manner to be able to switch in and out of LLMs?

Chaoyu Yang: Sorry, do you mind repeating the question?

Krishnaram Kenthapadi: So I think what, the spirit of the question is, can you develop, your applications such that you can perhaps leverage LLM if needed, and, you can switch out an LLM, or maybe it's more like, can you develop the application in a manner such that, the, which LLM that you want to use can be switched on and off, right?

Krishnaram Kenthapadi: You might start with, using OpenAI's, APIs, and then as the application matures, you may decide to switch that out for, say, an open-source LLM.

Chaoyu Yang: I see, I see. from an infrastructure perspective, it's really easy to switch an LLM behind an endpoint. For example, our open-source project, OpenLLM, it offers a more sophisticated endpoint, but also offers an endpoint that's compatible with OpenAI.

Chaoyu Yang: So if your application is built around OpenAI SDK, you can literally switch the endpoint and all the code should still run. but the more challenging problem is you write, you wrote prompts around those models. Sometimes you may have a fine-tuned version, a fine-tuned model underneath. and I guess the cost of switching those and experimenting how the prompt performs in the new, with the new model, that can be a lot more time consuming.

Krishnaram Kenthapadi: Yeah, thanks for that comment. let me go to another question, which I think is somewhat related to the security discussion we have had just earlier. What licenses are safe to use for commercial products? Would you consider BSD, Apache, or GPL, or possibly others?

Chaoyu Yang: I think most of the ones you mentioned are pretty friendly commercial licenses.

Chaoyu Yang: I think these days it's, I guess if you are referring to the software layer, or the infrastructure layer, most of those software, I think it's generally pretty safe to use. We choose Apache 2 because that's mostly, widely acceptable for enterprises. but as we discussed earlier with large language model, it's still yet to be defined what, what, whether or not you see, say, a diffusion model as a database or how those software licenses apply to large models.

Krishnaram Kenthapadi: Yeah. So maybe just to elaborate or clarify, I think we are not so much concerned about those who trained these language models coming after you, right? Like we don't expect that, let's say Meta is going to go after all the companies using Llama 2, right? That's not, that's just the opposite of what their intent is.

Krishnaram Kenthapadi: We are more concerned about those who have claims to the copyrights of the data used in the training data. Potentially getting concerned that the application is violating their copyrights and wanting to sue those who are leveraging, let's say, an open-source model.

Krishnaram Kenthapadi: From that angle, I think, just from a license perspective, so I think anything which allows commercial use should be fine, right? This could be Apache or it could, it could also be slightly those that allow commercial use like say Elastic license, which many organizations are gravitating towards because they don't want their own competitors to start offering that as a managed service. Right.

Krishnaram Kenthapadi: Otherwise they are fine. customers using that tool for commercial purposes. So let me take another question. What testing tools do you use for prompt validation or review?

Chaoyu Yang: What prompt is to be used for validation and review?

Krishnaram Kenthapadi: Yeah, what do you use?

Chaoyu Yang: I think we, we are mostly building tools for developers and, we see a lot of our developers actually, use very different tools, like our customers. They use different tools for that. I guess I don't have a strong opinion on which tools to use. Typically, I find it most helpful if you do provide the interface that's available through Jupyter Notebooks.

Chaoyu Yang: So you can really easily experiment with one of the prompts, but at the same time you can easily scale it to, on a larger dataset and then figure out ways to visualize and understand them.

Krishnaram Kenthapadi: So if I may add some thoughts, I think,validating both the prompts and the responses is, is quite important.

Krishnaram Kenthapadi: And this validation can be along different dimensions, right? One, of course, one dimension would be, with respect to some kind of maybe labeled a benchmark, if there is some kind of ground truth dataset, you may want to validate, whether the responses from the LLM application for the prompts in the ground truth dataset are closer to the expected responses.

Krishnaram Kenthapadi: But I think we should not stop there. Like we should also, think about what happens if, say the input prompts, slight variance, right? If they retain the semantic meaning, but, the syntactically they look different or the responses changing a lot, again, this is more like robustness, or you can go one step further and see, see whether how robust is the LLM application to, different types of security attacks, whether these are say prompt injection attacks or other kinds of adversarial attacks.

Krishnaram Kenthapadi: And of course, you can, you can also go further beyond that. Maybe, say if you're, if the application or the domain demands it, requires it's always a great idea to do stress testing for bias or, stress testing for, PII leakage, or,other kinds of responsible AI dimensions and, and also like after doing all this testing, it's not enough to just test these ones. It's good. It's a good idea to set things up so that these kinds of tests are done regularly. Because just because you have vetted the LLM, before deployment, there is no reason to expect that it will continue to remain robust or free of biases at the runtime as well. So you have to kind of always think about both validating and then continuing to monitor or have the safety checks,once deployed as well.

Chaoyu Yang: Absolutely. I think I misunderstood the question a bit. But yeah, I think one thing interesting related to this is we've seen quite a lot of LLM applications, that's output structured data or able to become structured data, people tend to use even some of the model evaluation technique we've been developing in the MLOps space.

Chaoyu Yang: For example, the function calling case where you're returning a structured JSON. you can compare the output, even accuracy or if there's unstructured data, compare their similarity. Also cases like SQL generation, you can perhaps run the SQL and compare the result to get a more accurate view of how the model is performing.

Chaoyu Yang: And kind of set that up to automate the testing going forward as you change your application.

Krishnaram Kenthapadi: So in all those lines, like, are you seeing, or have you seen, instances where there are mitigation approaches as well, right? Let's say that you instruct an LLM to generate the response in a JSON format, but maybe it does that 90 percent of the time.

Krishnaram Kenthapadi: But for the remaining 10%, it generates something close enough to a JSON, but maybe there is some kind of syntax or it does not obey the JSON syntax, can you then, are people applying some kind of bandages, right, ways to fix that in those kinds of situations?

Chaoyu Yang: I think fine-tuning is a really good way of fixing that issue. In my experience, fine-tuned model, it's only seeing generated JSON data, you tend to follow that really well. If you're just doing that from prompt, you probably need a slightly larger model to be able to follow this, kind of, slightly more complicated instruction. but some of the magic words are like, "The output must be a valid JSON following certain specifications" and give it a few examples. That tend to work pretty well if you're working with a very large parameter size model.

Krishnaram Kenthapadi: How can we confirm if the output generated by LLMs specifically for the images is correct. So I guess the spirit of the question is, let's say there is a generative AI model that generates images, say maybe a diffusion model or otherwise, how do we go about checking whether the generated image is correct?

Chaoyu Yang: I see. That's a very interesting question. I've never thought about that. I guess one potential thing you could do is also use another model to describe the image and then compare the output's similarity. but I don't know if that gets to the accuracy you need. otherwise there's always a route of just human labeling and evaluating the dataset, the images generated.

Krishnaram Kenthapadi: So one line of work that I have seen or heard about is, set in, in many image generation settings, there may not be a notion of, a ground truth or the image to generate, right? If you, if you ask the model, generate an image of sunset on a beach with maybe horses walking around where like there's no single image, but what you can do then is you can take a look at the generated images and even apply, say, techniques in the reverse direction.

Krishnaram Kenthapadi: You can apply, say, object detection or object identification models and see whether the image contains in this case, a horse, whether the image is about a sunset, whether the image is in a beach setting, and so forth. And this may be a way to test the generated image, kind of like using models themselves to some extent.

Krishnaram Kenthapadi: And of course, like human evaluation goes one step further and can, can kind of address the issues that, subtle issues that you may not be able to identify just by approaches like image detection, object detection, but this may be some ways you can,you can, test whether the, the model has obeyed the instruction, but in terms of testing, whether the model generated the best possible image, that it becomes a little bit subjective and maybe, I don't know whether you can use another, kind of, image to text kind of model to, to assess the quality of two generated images, or maybe you may want to use human judgments in terms of which of these images is the best output for this model, for this core problem.

Krishnaram Kenthapadi: So let's go a little bit into different ways in which different LLM developers have been thinking about leveraging LLMs. so these go all the way from say aspects like prompt engineering to say retrieval augmented generation or fine-tuning or even if you have large enough proprietary data training your own LLM.

Krishnaram Kenthapadi: In your view, how should LLM developers decide between these different options?

Chaoyu Yang: I think even today, fine-tuning and, of course, training your own LLM still requires a lot of in-house expertise, to process the, to, to get the right data prepared, and also figure out the right recipe. Especially for training your own LLMs.

Chaoyu Yang: I think fine-tuning has become a lot easier nowadays, and it works for some of the domain specific use cases really well. But regardless, this type of setup has that initial a pretty complex process to, to set it up, not only for the training process, but we also discussed how you continue to evaluate their performance over time.

Chaoyu Yang: and, it's, it's a complexity that you need to think about if your team is worth the time, for your, for your use case. RAG, on the other hand, I think it's quite a bit easier to get started. I think one main benefit I see with RAG is it's a lot more transparent and gives you more control in terms of, you know, you can see exactly what part of the dataset the RAG application is accessing to.

Chaoyu Yang: If you want to implement something like access control, role based access control, you know, one user can only access certain amount of, you know, certain language of data in your database. You can easily add that on top of a RAG application. it also makes way easier to add or remove data, you know, to limit LLM from accessing certain data.

Chaoyu Yang: If you want to do, you know, data governance, data lineage to track what are the data being used, which version of the data set it were accessed, at certain point of time, and how those data set will reflect the real world changes to, to, to those data. That transparency makes it really easy for developers to control.

Chaoyu Yang: And I guess lastly, just, I think the most powerful LLM, LLM apps, tend to combine some of those approaches about, you could build a rack application on top of fine-tuned large language models, or even fine-tune the embedding to improve the retrieval, performance, and then use proper engineering, you know, to fine-tune, to improve the models, the overall performance.

Krishnaram Kenthapadi: Yes, I fully agree. I think there's often a thought that these approaches are kind of mutually exclusive, but as you pointed out just now, they're not. You can kind of combine, say, retrieval augmented generation along with fine-tuning, right? Because both of these are serving slightly different aspects. I recently saw this analogy with,say a patient going in a healthcare setting, right?

Krishnaram Kenthapadi: Getting some advice from the doctor. The analogy was more like the doctor specialization, is considered analogous to fine-tuning and all the historical context. about the patient was considered analogous to, the, the retrieval, right? The retrieval augmented generation, all the context that you retrieve for that specific patient.

Krishnaram Kenthapadi: So, so clearly like both of these can coexist, right? You don't, you don't, you'd, in spite of the doctor specializing in some domain, you still need the context of the patient and similarly, like just having the context of the patient may not always be enough, right? Like again, it depends on the application.

Krishnaram Kenthapadi: so, so along similar lines, I think if, when, when we think of a final application, I think it's It's good to think of all these, some of these issues that you, you highlighted, like, do we, do we want the LLM to, respond just based on the retrieved data? Then I think the retrieval, retrieval augmented generation is very important.

Krishnaram Kenthapadi: You don't want the LLM to just go and start discussing things outside this domain, right, or start hallucinating. At the same time, if you want, if your application is reasonably different from, maybe, the, general applications, then fine-tuning is always a good idea. so let, let's, let, let me go into, another, question, that,I see here.

Krishnaram Kenthapadi: So, yes, I think we kind of answered this question about. Should machine learning teams consider RAG over fine-tuning? Can both methods coexist when deploying an LLM? The answer is yes. But related to that, do you have thoughts on, say, when people go about fine-tuning an LLM, how often do you think they should fine-tune?

Krishnaram Kenthapadi: Like they might, they might fine-tune, say, let's say they have fine-tuned an LLM now, and they deployed, do you have a sense of what may be the parameters or what may be the considerations to decide when to fine-tune the model again?

Chaoyu Yang: That's very interesting cause it's quite different than the retraining process in the more traditional MLOps setup, where the changes in the data could, the distribution shift could actually affect the performance quite drastically. I think RAC kind of, solves some of those problems. If you have new data arrived, if there are changes in your, source, data.

Chaoyu Yang: It kind of reflect that on the fly, and also depends on what's the goal for fine-tuning. Is it for the model to memorize some of the fact, or is it more about having the model to follow a specific format or of the response? I mentioned earlier, you can fine-tune an LLM to understand, say, function definitions and return a structured JSON format.

Chaoyu Yang: For those cases, you probably won't need to frequently fine-tune the model, unless you are adding maybe a new type of syntax for the model that it doesn't work so well before.

Krishnaram Kenthapadi: Yeah. Thanks, Chaoyu. Let me go to kind of related,maybe high level question, right?

Krishnaram Kenthapadi: Like, what are the top, the three in your view? Like what, what are the top three biggest challenges, when running, LLMs and other, generative AI apps in product production?

Chaoyu Yang: Top three biggest challenges, I think. I can go, one is evaluation, understanding how your model performing, you know, and the second is security.

Chaoyu Yang: So make sure your models are behaving as the way you expect it. Don't generate toxic content, and don't hallucinate too much that mislead your user. and the last one is about, just overall,excellency of your infrastructure. Are you running things under acceptable latency? Can you support large amount of concurrent users accessing your service?

Chaoyu Yang: And are things cost efficient? Are you utilizing all the GPU resources? That's very expensive from the cloud provider.

Krishnaram Kenthapadi: Yeah. Yes. I think it makes a lot of sense. All these dimensions are really important. And, and between these, like, do you have thoughts on when somebody's just developing a new LLM application?

Krishnaram Kenthapadi: Do you have a sense of, should they focus on one versus the other or do they, should they consider all of these simultaneously?

Chaoyu Yang: I think those three, I think are most important when you are thinking about getting things into production. They're all very important, probably depends on your business.

Chaoyu Yang: You want to,you want to prioritize them differently. But when you're just starting out, I feel like it's still that validating the business impact, getting a prototype, end-to-end working. That's probably more important because that's how you justify investing more engineering and resources into the project.

Krishnaram Kenthapadi: So let me, let me take a question which is a little bit related. What are the concerns expressed by enterprise customers when they deploy generative AI for production in terms of the AI risk management? Do they care just about the performance, such as accuracy, precision, recall, of the models?

Krishnaram Kenthapadi: Or do they also care about robustness, fairness, explainability, and so on? if they care about these risks, what is the difference between generative AI and a common, traditional machine learning use case or model? so. I, I can

Chaoyu Yang: mm-hmm.

Krishnaram Kenthapadi: Take yes. stab at this and, I would love you to share your, add your thoughts as well.

Krishnaram Kenthapadi: As we kind of briefly discussed earlier, I think AI governance and risk management is definitely a top concern for enterprise customers. they definitely, they first, they, they do care about aspects like performance, right? A performance in terms of the quality of the responses, whether it is, accuracy or other related NLP related metrics or image related metrics for the models.

Krishnaram Kenthapadi: But in addition, the things that we hear a lot are things like, is there toxicity in the responses? is the, is the prompt or the response potentially leaking any, personally identifiable information or PII? How robust is this application? The application might look great in a demo setting, but when you actually deploy it in production, would it continue to remain robust?

Krishnaram Kenthapadi: Then depending on, again, the specific domain, like, especially like if the customer is in a regulated domain, they do care about aspects like, bias, the exhibited by the LLM or, the need to explain how the, the model is making its prediction, also like it, we are, we're seeing, the difference in approaching these kinds of risk management, depending on, where the application, is situated.

Krishnaram Kenthapadi: If the application is targeted for an end user or customer facing side, then the enterprises seem to care a lot more about these issues because they don't, they are concerned about, both reputational damage, if the model generates some toxic content or, or even worse, like what happened with Bing AI, encourages somebody to commit suicide and so forth, they are also,concerned about robustness and like aspects like that, when it comes to customer facing chatbots or other kinds of applications, but if the application is internal facing, maybe helping a loan officer, assisting a loan officer, or assisting, say, a domain expert in terms of doing some kind of enterprise search and consolidating all the data within the enterprise, then I think there is a little bit more appetite to do that.

Krishnaram Kenthapadi: And there is a little bit less concern about some of these risks. But broadly I would say that enterprises do care a lot about understanding these risks and also monitoring, both vetting the model for such risks before deployment and continuing to see, measure the risk over time. and to the extent needed, they want to identify approaches to mitigate that risk.

Krishnaram Kenthapadi: So a related dimension we see often is that these models may not exist in isolation. Often these models might exist to help some domain expert. So even understanding when such models can be relied upon. And when we should instead defer to the domain expert itself is a huge of huge value.

Krishnaram Kenthapadi: Understanding the failure modes of LLMs and other types of models and being able to defer to the domain expert when the models are not expected to work well is also important. It's not necessarily the required that the model should be accurate 100 percentage of the time. We just need to know when the model cannot be relied upon.

Krishnaram Kenthapadi: and going to this comment about, when should enterprises use predictive ML approaches versus, some of the LLM or other approaches? I think it, again, very much depends on the use case. If, if the use case is highly regulated and, there is a requirement to be able to articulate,how the model is making its prediction like in lending or similar settings or to articulate the fairness, or the lack of bias of the model, then, it may be better to go with well understood predictive machine learning settings models, but if the, on the other hand, if the application is a little bit open ended, and where you want to leverage all the benefits of natural language interface, or creativity of the diffusion models and so forth, then going for the generative AI models might be a good approach.

Krishnaram Kenthapadi: Chaoyu, would you like to add any thoughts?

Chaoyu Yang: I think you covered it really well. I guess, in general, people are seeing what amazing things large language models can do, but on the other side, we talk about responsible AI. We want to be aware of the biases and potentially toxic content generated from large language models.

Chaoyu Yang: The whole industry, a lot of open-source community and vendors are working really hard in solving those problems. I guess developers shouldn't be too intimidated by those risks. I guess a lot of new open-source tools and products are now available to help reduce those risks and understanding large language models behavior a lot better nowadays.

Krishnaram Kenthapadi: Let me take a question on the hardware requirements. What are the hardware requirements in terms of GPU, or otherwise to, to run useful open-source models, like say, Llama 2, 7 billion parameters and above?

Chaoyu Yang: That's a great question. We tend to think about that under the operability category. I guess the first question is does the model fit in the amount of GPU memory you have? I think, so basically for Llama-2 7B, I think the smallest we can run is probably just a T4 GPU. With very limited amount of memory, and with quantization, you can even fit slightly larger models to a smaller GPU.

Chaoyu Yang: But one thing to be aware there is large language model, especially combining with some of the latest, inference optimizations, such as continuous batching, having more available GPU memory actually helps you improve the latency and, throughput of your large language model service.

Chaoyu Yang: So in practice if you are looking to, provide this Llama-2 7B model being accessed by many applications and you want it to serve, a saturated GPU as much as possible, it's generally good to reserve, enough memory for running inferences, and doing all the catching that's necessary for the scheduler.

Krishnaram Kenthapadi: considering there is no ground truth, how do you test for performance of an LLM? Are there any standard, data sets or benchmarks, used for this?

Chaoyu Yang: This question is more, I guess it's more about evaluating a larger language model's capability itself. Yeah, that area, I'm not an expert. I think, more generally we see people, there are quite a lot of organizations focused on evaluating large language model itself out there, and a lot of benchmarks available.

Chaoyu Yang: I think in practice, people tend to focus a bit more on evaluating for specific, you know, domain specific use cases, how they're performing.

Krishnaram Kenthapadi: So, yeah, in fact, yeah, I was about to just reflect on this exact sentiment, right? I think there are a number of, there are a few resources for evaluating LLMs themselves.

Krishnaram Kenthapadi: For example, if you want to evaluate an LLM in a chatbot setting, there is something called a multi-turn or MT Benchmark. There are leaderboards. For example, LMSYS is a leaderboard for chatbots and other language models or Hugging Face has similar leaderboards. Yeah, we discussed some of these evaluation approaches in the tutorial I was referring to.

Krishnaram Kenthapadi: But that said, I think what's important in an enterprise setting is to evaluate in a specific application setting. If you are, say, a healthcare company or if you are a manufacturing company, these evaluations done in a generic setting may not carry over.

Krishnaram Kenthapadi: It's still a good idea to evaluate with respect to maybe, prompts and responses or otherwise for that specific domain. which is of interest in your setting and there are, we're starting to see tools being developed with these kinds of situations in mind. What are the suggestions to overcome context length restrictions with the open-source models?

Krishnaram Kenthapadi: Do you have thoughts on that?

Chaoyu Yang: I think there are some kind of research and experimental work to work around the context length limit. but I don't think there's any breakthrough that really gives you the same quality of performance. I guess another approach I've seen some developers experimenting, is more around offline syncing.

Chaoyu Yang: Let's say in the Retrieval Augmented Generation application, a user question may need tons of context. That's way too large for LLM to process. But if you can have the large language models thinking about those problems in the background and, you know, really make the knowledge more concise in the database, then you can be more efficient when retrieving relevant information.

Chaoyu Yang: It's kind of thinking if you ask me a really hard question right now, it might take me a while to figure out, you know, the answer to the question. But if it's a question I've been thinking in the background for a while, I can probably retrieve it from my memory and answer it immediately. so I think there are other approaches in terms of how you orchestrate your LLM apps, and on the infrastructure side, that can help you walk around the context lanes.

Krishnaram Kenthapadi: Yeah, thanks, Chaoyu. Like, I think since we're almost running out of time, I'll just read some of the other questions: Any practical techniques to protect LLMs against data poisoning? So this is the setting where an adversary injects some specific type of data as part of the training data so that the, the LLM would, with the goal of making the model respond in a very specific manner on a small niche aspects, but otherwise it becomes largely undetectable, Where do you see the market and specifically developer traction as the biggest bet going forward? Open-source or generative AI models versus closed-source models like OpenAI. What are the considerations to build these applications when it comes to multilingual situations or user access from mobile platforms?

Krishnaram Kenthapadi: I think there are many more questions that we would love to have gotten into, but we don't have time. I would love you to maybe share any concluding thoughts, any overall guidance to all the listeners today.

Chaoyu Yang: I guess relevant to that last question you just mentioned, I firmly believe in open-source language model and especially due to its transparency and control, it gives the developers to further customize those LLM applications and also easily embed, you know, evaluation, observability and security deep into their stack.

Chaoyu Yang: So I think that's. most likely the future for how LLMs are applying to different industries and enterprises. and, we're still early in this journey and really fortunate to have a large open-source community and, with our friends at Fiddler, to work on some of those challenges together.

Krishnaram Kenthapadi: Yeah, likewise, yeah, we're, we're very excited to be at the forefront when it comes to addressing some of the trustworthiness and observability challenges and looking forward to working with the BentoML and the broader community to help advance, the, this domain and, thank you all for joining us today and, it was a fantastic, conversation. thanks. Chaoyu for your time.

Chaoyu Yang: Thank you again for having me.

‍