Season 1 | Episode 9

Productionizing GenAI at Scale with Robert Nishihara

‍

In this episode, we’re joined by Robert Nishihara, Co-founder and CEO at Anyscale.

Enterprises are harnessing the full potential of GenAI across various facets of their operations for enhancing productivity, driving innovation, and gaining a competitive edge. However, scaling production GenAI deployments can be challenging due to the need for evolving AI infrastructure, approaches, and processes that can support advanced GenAI use cases.

About the guest

Robert Nishihara is one of the creators of Ray, a distributed framework for scaling Python applications and machine learning applications. Ray is used by companies across the board from Uber to OpenAI to Shopify to Amazon to scale their machine learning training, inference, data ingest, and reinforcement learning workloads. He is one of the co-founders and CEO of Anyscale, which is the company behind Ray.

Transcript

[00:00:00] Krishna Gade: In today's AI Explained, we are going to be talking about productionizing generative AI at scale. You know, this week has been a very exciting week, as you probably have seen the launch of Llama 3.1, uh, an open source model from Meta, which seems to uh, uh, really come very close in terms of accuracy to the closed source models.

[00:00:29] Krishna Gade: Um, it's a, uh, it's an exciting time that we are living in, and, I'm actually,the founder/CEO of Fiddler AI, and for those who don't know, I'll be your host today.We have a very special guest today. You know, that is, uh, Robert Nishihara, a CEO of Anyscale. So, welcome Robert. Um, so, just a brief intro about Robert. Robert is one of the creators of Ray, a distributed framework for scaling Python applications and machine learning applications.

[00:00:58] Krishna Gade: Ray is used by companies across the board, from Uber to OpenAI to Shopify to Amazon to scale their ML training, inference, data ingest and reinforcement learning workloads. Robert is one of the co-founders and CEO of Anyscale, which is the company behind Ray. And before that, he did his PhD in Machine Learning and Distributed Systems in Computer Science at UC Berkeley.

[00:01:22] Krishna Gade: And before that, he majored in math at Harvard. Uh, so excited to welcome you, Robert. Thank you for, you know, sharing this, uh, this opportunity with us.

[00:01:32] Robert Nishihara: I'm excited as well. Thanks for, uh, having me on.

[00:01:36] Krishna Gade: Awesome. Robert, uh, first up, uh, I would love to listen to your, to you. How did you like, you know, get, get into Ray, like how, you know, give, give us a brief intro into Ray and the Anyscale founding story.

[00:01:49] Robert Nishihara: Yeah. So today I'm spending all of my time working on distributed systems and systems for AI, but in grad school, I started grad school with no background in systems and Actually, I was spending all of my time more on the theoretical side of AI, trying to design algorithms for, uh, deep learning training or reinforcement learning, and proving theorems about, you know, how quickly these algorithms could learn from data and converge, and so it's really coming from that side of things.

[00:02:22] Robert Nishihara: Uh, but, what we found, and, you know, me and, and my other, uh, co workers in grad school, was that, we were, even though we were trying to spend our time designing algorithms, we were actually spending all of our time managing clusters, and building systems, and writing code to, um, scale, GPUs, and things like that.

[00:02:45] Robert Nishihara: And, that wasn't, you know, There wasn't an off the shelf tool that we felt we could just use to solve these problems. We ended up, uh, and it wasn't just us doing this, right? A lot of other researchers in AI were building their own tools and building their own systems for, um, managing compute. And we felt that, uh, The tooling here was really a bottleneck for AI research.

[00:03:09] Robert Nishihara: And so, we thought there was an opportunity to try to make that easier. And we started Ray, which is an open source project. Our goal was just to, you know, to build useful open source tools to make it easier for people doing machine learning to, uh, take advantage of AI. You know, multiple machines to scale in the cloud to, um, to sort of solve the distributed systems challenges for you around, you know, dividing up the work across different machines, um, scheduling different tasks on different machines, recovering if one machine crashes, um, you know, moving data efficiently.

[00:03:44] Robert Nishihara: There are many different software engineering challenges around scaling compute. And, um, and we just got started building Ray to really try to, um, solve these challenges. And one of the things that, you know, the underlying thing here is that AI was becoming more and more computationally intensive. That was kind of the core premise of all of this, that people needed to scale things in the first place.

[00:04:12] Robert Nishihara: If you could do all of this on just your laptop, uh, there wouldn't have been any problem to solve.

[00:04:18] Krishna Gade: Yeah, I remember those days when I was an ML engineer at Bing Search in 2000s, we would train simple models on our desktops, you know, and the world has changed a lot now. And you guys are working with some amazing tech forward companies like, you know, OpenAI, Uber, Pinterest.

[00:04:35] Krishna Gade: We were talking about it before the call, you know, what are some of the challenges that Ray helps solve? for these type of companies now, you know, when they're building these large scale AI models.

[00:04:47] Robert Nishihara: You know, the challenges have changed over time. I think for a lot of companies, one, um, have gone through or, you know, even now are starting to go through a transition of, um, adopting deep learning.

[00:05:01] Robert Nishihara: Okay, so this is something that You know, you mentioned Uber. They were a pioneer in machine learning and machine learning infrastructure and had been doing machine learning for many, many years. And of course, the starting point for them was not deep learning. It was smaller, you know, simpler models like XGBoost and so forth.

[00:05:19] Robert Nishihara: And the, um, when they went through this transition of really enabling deep learning in all of their products, Um, the, like, the underlying infrastructure challenge and systems challenge got a lot harder because it's far more compute intensive. All of a sudden, you know, you need to manage GPU compute instead of just CPU compute.

[00:05:39] Robert Nishihara: You need to manage a mixture of them. And that is, um, you know, you end up with potentially a different tech stack for deep learning and a different tech stack for classical machine learning. And, um, it's sort of. Providing all of these capabilities internally to the rest of the team is, is, you know, can be quite challenging.

[00:06:02] Robert Nishihara: Um, I mean, we see all sorts of challenges, challenges around one is just enabling distributed computing, right? Enabling distributed training, things like that. There are other challenges around, um, you know, getting, going the, the handoff from training the model and developing the model to deploying it, right?

[00:06:23] Robert Nishihara: We hear people say it takes us. You know, six weeks or 12 weeks to get from developing a model to getting it into production. And it could involve, that could involve, um, you know, handoff to a different team, rewriting it on a different tech stack. Um, and just a lot of challenges that people face are around also around just how How quickly their machine learning people or data scientists can, can iterate and move, right?

[00:06:50] Robert Nishihara: Are they, uh, are they spending most of their time focused on the machine learning problems, or are they spending most of their time, uh, managing clusters and, and infrastructure?

[00:06:59] Krishna Gade: Yeah, so let's go a couple of levels deeper, right? So I think distributed training is a very interesting problem, you know, um, You know, let's, let's basically, maybe you can walk us through, what would it take?

[00:07:11] Krishna Gade: Like, you know, are you like, how, how, how does the architecture look like? You know, are you, are you sort of like sharding the data? More towards these like different replicas, you know, how do you maintain the state, you know, give us like a taste of under the hood. Uh, what makes Ray like the such a robust platform for distributed training?

[00:07:30] Robert Nishihara: And, um, that's a great, great question. And just before I dive into training, there are three main workloads that we see people using Ray for. It's quite flexible, but, um, training is one of them, um, serving and just inference is another. And then the last is data processing, like data preparation or, you know, ingest and pre-processing.

[00:07:52] Robert Nishihara: And companies typically have all of, all of these workloads. And they're all challenging. Um, so, challenge, we see a number of challenges around training. The first is just Going from a single machine to, to multiple machines. Um, but a perhaps more subtle challenge is, as you're scaling training on more and more GPUs, and you mentioned Pinterest, this is a challenge that, um, it was one of the reasons Pinterest adopted Ray for training their models.

[00:08:23] Robert Nishihara: As you're scaling training on more and more GPUs, it's very easy to become bottlenecked by the data ingest and preprocessing step. So you may want a lot of training, You know, training is expensive because the GPUs are expensive, uh, so you want to keep them fully utilized. And that means feeding the data in, pre-processing the data, loading the data, and feeding it in fast enough to keep those GPUs busy.

[00:08:45] Robert Nishihara: And that may mean you need to scale the data in just in pre-processing even more, right? And so, uh, That can often be done on cheaper CPU machines. So you want to scale up the data ingest and pre-processing on a ton of CPU machines, a separate pool of CPU compute, and then pipeline that into training on GPUs.

[00:09:04] Robert Nishihara: It's, um, conceptually simple, but hard to do in practice. And, um, you know, both of these things may, you may want to scale them elastically. You may need to recover from failures if you're using spot instances or things like that. Um There are other challenges you run into at even larger scales around actual GPU hardware failures and recovering quickly from that.

[00:09:31] Robert Nishihara: In the regime where you start to have really large models, um, model checkpointing and recovering from checkpoints can be quite expensive. And so you need to, um, build out really efficient handling of checkpoints. So there are many challenges. Uh, those are the challenges that Ray solves. And I'll just to call out the, uh, the data ingest and pre-processing pipelining in, that is an area where we're seeing more and more, um, AI workloads becoming not just.

[00:10:03] Robert Nishihara: Uh, you know, requiring mixed CPU and GPU compute, and really being, um, you know, both GPU intensive and data intensive, and that's a regime where Ray does really well.

[00:10:15] Krishna Gade: So these days, I mean, uh, of course, like, uh, larger technology companies like Uber, Pinterest, uh, uh, will, will be probably building and training models from scratch, but a lot of the enterprises that we work with at Fiddler, they are looking forward to and are already working with these.

[00:10:32] Krishna Gade: You know, pre-trained models, you know, there, you know, there was the Llama 3. 1 launched this week, and basically there's quite a few other options. There's closed source models, like, you know, GPT and, and, uh, and Cloud. So where does Ray fit into this equation? So if I'm like an enterprise customer, I'm looking to build this conversational experience, you know, with like using one of the pre-trained models, how can I leverage Ray to fine-tune or, you know, kind of build this model with this LLM application?

[00:11:04] Robert Nishihara: That's that's, um, that's a good question. So not every company is going to be training really large models. Right. Um, although I do think the vast majority of companies will train some models. Um, but what every company will do for sure is they'll have a lot of data and they'll use AI to process that data and extract insights or, or draw conclusions from that data.

[00:11:31] Robert Nishihara: Right. And that is, it may sound like just a. data processing task, but there's going to be a lot of AI and a lot of inference being used in doing that, because you're going to want to draw like intelligent conclusions from your data. And that is again a workload that will combine, it also, it won't just be inference, there will be sort of regular processing and application logic combined with AI.

[00:11:58] Robert Nishihara: And so this is something that also falls into the regime of, um, you know, I think Like large scale processing mixed CPU and GPU compute combining machine learning models with other application logic. And this is, of course, there are when you think of large scale data processing you might think of systems like Spark which are mature, battle tested, and fantastic for data processing, um, but are also really built for a CPU centric world, right, and not really built with deep learning in mind.

[00:12:32] Robert Nishihara: And so when it comes to scaling data processing, um, in a way that uses deep learning, in a way that uses GPUs, and is often working with unstructured data like Images and text and video, then you end up with a hard data processing problem where you have, um, you're, you're scaling on CPU compute, GPU compute, you're running a bunch of models, and, Uh, these are the kinds of challenges we're seeing today that, uh, you know, that we can help with.

[00:13:06] Krishna Gade: So it seems to me that, uh, you know, Ray is, you know, definitely could be a platform of choice if you're thinking about scaling your model training inferences, especially optimizing your GPUs. to the fullest extent possible and it seems like that is like the unique sort of uh advantage uh now when it comes to productionization right so you know like basically you mentioned something about LLM inferencing and ML inferencing and um how does that work with Ray and you know uh what are some of the aspects of uh that you're trying to solve that what are some of the challenges that you've seen you know in productionizing you know GenAI and also your traditional machine learning workloads

[00:13:47] Robert Nishihara: Yeah and so with inference, we tend to divide it into online inference, where you're powering some real time application, right?

[00:13:57] Robert Nishihara: And offline inference, where you're perhaps, um, processing a larger dataset, and you care less It's less latency sensitive and more throughput or cost sensitive. So, with, um, you know, the challenges we see around online inference, these are things like, Um, well, the challenges actually change over time as you're building AI applications, right?

[00:14:25] Robert Nishihara: Um, we see kind of, we were talking about this the other day, but As businesses start adopting AI, they're often in this exploratory phase where they're trying to figure out how to use AI in their business, how to, um, what product even makes sense. And so at this point, the question is about quality. Are the models good enough?

[00:14:45] Robert Nishihara: Right? Is the quality high enough? Uh, how do I fine-tune to improve the quality? You know, how do I reduce hallucinations, use RAG, all these kinds of things. And. They, you know, they care about iteration speed just for experimentation, um, and there's also a lot of data pre-processing that has to be done at this stage to, um, you know, to see if you can just get your data into the model in various ways.

[00:15:08] Robert Nishihara: Um, then If, once they figure it out, once they validate, hey, this is the right product, I want to, it makes sense, people like it, um, and you start moving these applications to production, then the challenges change, right? At this point, you might care, start to care about cost, right? You might start to care about latency.

[00:15:27] Robert Nishihara: Is it responsive enough to, you know, for people to really, uh, engage with it? Reliability, like, how do I upgrade the models? Right, so the nature of the challenges changes. It always starts with quality, because that's just what determines if it's possible or not. But once you meet the quality bar, then the criteria very quickly changes to latency and cost and other factors.

[00:15:52] Krishna Gade: Got it, got it. And so, uh, so that's actually a very good point you mentioned, reliability, right? Like, uh, you know, one of the things that, uh, when you're productionizing model inference is like, how do you upgrade the models? How do you swap, uh, uh, an existing model with a, a challenger model that you actually have trained, you know, how do you deal with that?

[00:16:12] Krishna Gade: How does Ray deal, how does Ray help, you know, developers do these things where you upgrade those models, you iterate on, you know, A/B test model versions or like MOOC move from champion to challenger and, and in a manner that you're not affecting anything from latency or throughput and all those, all those things.

[00:16:33] Robert Nishihara: Yeah. Um, great question. So there's, and the challenges are different for, uh, large models and for small, small models. For example, on the, um, we see, Uh, companies that want to deploy thousands of models, right, that have, that might fine-tune or train one model per customer of theirs. And if they have thousands of customers, they end up with thousands of models.

[00:16:56] Robert Nishihara: And they might be Of course, they might be low utilization models, and so they end up needing to, you know, there's the operational challenge of managing and deploying many, many models. There's an efficiency question of how to, um, you know, perhaps serve all of these models from a shared or smaller pool of resources.

[00:17:18] Robert Nishihara: With larger models You sometimes end up, the challenges end up being around, um, GPU availability and things like this. And, and just to give one example, if you, as you deploy a model and then you upgrade a model, a natural way to do that is to sort of deploy, um, a, just like, you know, you have the old model, the old model that's serving in production, then you deploy a new, the new one, and then you like slowly shift traffic over.

[00:17:47] Robert Nishihara: And that done naively. We'll require double the amount of GPUs, right? And so, um, can you do this with, without requiring that many extra GPUs? Can you do it kind of in place or with maybe one extra GPU? And so, um, there are all sorts of of challenges around making that work well. Um, I mean, there are things like you may have bursty traffic, right?

[00:18:14] Robert Nishihara: You may, if you, if you have to reserve a, a fixed pool of GPUs, but then you have bursty traffic, there's going to be a lot of times when you're sort of provisioning for peak capacity, but you have unused, uh, capacity. And can you, you know, run other workloads, can you multiplex, like, other less critical workloads into that unused compute?

[00:18:34] Robert Nishihara: At those times, like these are, uh, for cost efficiency reasons. These are some of the kinds of challenges we can help work on.

[00:18:41] Krishna Gade: Yeah, makes sense. So I think a couple of audience questions. I think people are asking, uh, what's the relationship between Anyscale and Fiddler AI? As you may have listened to Anyscale is this large scale, uh, model training, deployment and inferencing, uh, product, uh, and, and, and, and so, and Fiddler is focusing on observability.

[00:19:00] Krishna Gade: So, you know, you can think about those two things as complimentary in your AI workflow. Uh, so, hopefully that clarifies, uh, some of those things, so, uh, uh, when you're thinking about your AI workflow. So, let's actually, uh, that's a relevant question around that, right? So, you know, there's multiple parts of this AI workflow we just talked about.

[00:19:19] Krishna Gade: There's data processing, you know, feature engineering, model training, model selection, deployment, you know, inferencing, monitoring. Uh, so when you kind of think about this, right, uh, especially in the GenAI setting, you know, from traditional AI now to GenAI, how should enterprises streamline this process?

[00:19:39] Krishna Gade: You know, now some of the things may have gone out of the window, you know, how does feature like, what does feature engineering mean in the case of like, you know, GenAI or LLM applications? What does model training mean? Is it fine-tuning? Uh, what does model selection mean? And then of course, like we are facing some of those challenges in observability of LLMs.

[00:19:59] Krishna Gade: How has it affected you, and what are some of the things that, that enterprises that you think are trying to solve in this operations of GenAI, end to end?

[00:20:09] Robert Nishihara: Um, there's so many, you know, different challenges that you mentioned, and the, you know, the challenges that, Like I was saying, the challenges really do change over the life cycle of deploying, um, generative AI or really deploying, uh, different AI applications in production.

[00:20:28] Robert Nishihara: Um, one of the, you know, we see a couple distinct phases around experimentation and just like fast iteration and then another around, um, scaling and managing many different models and, and, uh, the kinds of. The, um, the nature of the observability challenge is very different from, you know, previously in machine learning or in, in, uh, traditional, uh, Applications.

[00:20:59] Robert Nishihara: Um, I mean the challenges around evaluation are far more complex. And one thing we actually often recommend when people are getting started with building generative AI applications is to front load the model evaluation part, like to over invest in, um, internal, in building like internal evals early on.

[00:21:21] Robert Nishihara: Because that's going to determine, uh, that's going to determine the speed of everything else you do, right? If you're going to A new model is going to be released, you're going to swap that in, or you're going to fine-tune a model, and then you're going to ask yourself, is the new one better or worse? And if you have automated evals that you feel confident in, then you're going to be able to answer that question very quickly.

[00:21:44] Robert Nishihara: And new models are going to be released all the time, so that's a question you're going to have to ask a lot. And also, evals are one of the things that you can't obviously, you can't easily outsource, because they require domain expertise, right?

[00:22:00] Krishna Gade: It's very custom to what we have.

[00:22:03] Robert Nishihara: And the same is true with, um, you know, the way you sort of craft the, um, like the data, your, you know, your specific data into a form that can be used by the AI application, right?

[00:22:20] Robert Nishihara: There are things that are, um, I think easier to factor out or easier to outsource, like, um, in fact, like, A lot of the AI, you know, scaling, infrastructure, uh, components, that's less, or just the performance stuff making it as fast as possible on the GPU. These are things that, um, you could solve in house, but are, are more consistent from company to company, um, but the, a lot of the other, like, evaluation and, and data processing are, are quite, uh, uh, you know, leverage a lot of domain expertise.

[00:22:55] Krishna Gade: Got it. And, and so, uh, you know, when you sort of like, uh, uh, you know, as your customers, you know, think about. Using RAVE for both traditional ML training, like deep learning model training, and now expanding to LLM operations, like LLM fine-tuning. Like, how are you supporting that today? You know, for example, uh, would they be able to like, uh, you know, would they be able to coexist on the same platform, uh, that you're, you're providing?

[00:23:26] Krishna Gade: And, and so like, you know, like now, like, you know, customers are also kind of confused, right? There's like a lot of options today, like, you know, uh, for deep, for deep, for machine learning, this one for LLMs, that one, you know, how are you sort of solving those challenges for your customers? Because there's, there's clearly, you know, these two worlds have emerged in the last, you know, couple of years.

[00:23:47] Robert Nishihara: Well, I don't, I think people that we work with. Tend to don't, they don't tend to want a different tech stack for LLMs and another tech stack for computer vision models and another one for, you know, XGBoost models. if you can have. a common platform or framework that supports all of these things, that's advantageous.

[00:24:10] Robert Nishihara: Um, especially because AI is evolving very rapidly, it's moving very rapidly, right? There are going to be new types of models released, new optimizations, new frameworks, and so if you are, uh, a lot, you know, a lot of companies have gone through this Big migration to enable deep learning from, you know, from classical machine learning and only to find at the end of it that they like need to change things again in order to enable LLMs, right?

[00:24:39] Robert Nishihara: And that's not the end of it, right? There are all of these models are going to be multi modal and

[00:24:44] Robert Nishihara: Agentic workflows are coming out.

[00:24:46] Robert Nishihara: They're going to become more complex, right? So the, from the perspective of like the ML platform team, right? The people who are providing AI capabilities to the rest of the company.

[00:24:56] Robert Nishihara: You really want to optimize for flexibility and being able to, be relatively agnostic to, you know, different types of models, different types of, you know, even like hardware accelerators, different types of, frameworks, right? When it comes to, you know, I wouldn't just like, um, you know, just, uh, only support XGBoost or only support PyTorch, right?

[00:25:19] Robert Nishihara: Or only support, uh, one inference engine. All right, the more that you can position yourself to be able to immediately take the latest thing off the shelf and, uh, and work with that, the faster you're going to be able to move, because those new things are going to come.

[00:25:34] Krishna Gade: Right, so if I, let's say, like, you know, in the GenAI case, right, so if I were to Um, fine-tune, um, say Llama for my business use case. And could I basically have like a Ray inferencing module that would load up my fine tuned Llama and help me scale my inferencing? Like how, how would like, if I'm like an engineer doing that, like what would that process be like?

[00:25:58] Robert Nishihara: Yeah, so Ray has, um, you know, there are really two layers to Ray. There's the core system, which is just, um, scalable Python, basically.

[00:26:08] Robert Nishihara: It's the ability to take Python functions and Python classes and kind of execute them in the distributed setting, right? And then, but that's And that's, so that's where the flexibility of Ray comes from, because with Python, you know, functions and classes, you can kind of build anything, but it's too low level, right?

[00:26:25] Robert Nishihara: If you want to do training, or if you want to do build, do data processing or fine-tuning, then you, using that API, you'd have to, you know, build all of the training logic and data processing logic on top of these, you know, just functions and classes, which is too low level. And this is why Python has, you know, a rich library ecosystem.

[00:26:44] Robert Nishihara: And, uh, you know, Pandas and NumPy and all these different, uh, tools that you can just use off the shelf to build powerful Python applications. And, uh In the distributed setting, Ray tries to do something analogous. There's the course API, which is like Python, and then, or scalable Python, and then there's an ecosystem of scalable libraries.

[00:27:06] Robert Nishihara: And certainly not as many as, as, you know, Python, but, um, these include libraries for training and fine-tuning, libraries for data ingest and preprocessing, libraries for serving, right? And that is, uh, and they form an ecosystem, which is really powerful because in order to To do training or fine-tuning, you also typically want to, uh, process some data and load some data.

[00:27:31] Robert Nishihara: And so the fact that you can do these together in a single Python application is pretty powerful. Instead of, um, you know, I have run my, like, my big spark job to prepare the data, then I pull up a new system, like, for training, and I have to, like, glue these things together and manage a bunch of different frameworks, right?

[00:27:49] Robert Nishihara: The way people were doing things before is often more analogous to taking each Like Python library and making it a standalone separate distributed system when what you really want is a common Framework with different libraries on top that can all be used together.

[00:28:06] Krishna Gade: Yeah. Yeah. So so now when I'm actually doing that productionization, right?

[00:28:09] Krishna Gade: So how do like, you know teams like how are you seeing teams kind of? Maintain that consistent high performance and accuracy across these multiple production applications, you know What are, what is it, what is it that you found out, you know, are, are they evaluating the same metrics, you know, before, you know, pre deployment and post deployment of GenAI applications?

[00:28:29] Krishna Gade: You know, what are some of the other considerations around scalability, throughput that they are looking at?

[00:28:36] Robert Nishihara: Yeah, um, so you're talking about, like, quality metrics?

[00:28:41] Krishna Gade: Quality, yeah, quality, cost, performance, you know, give us like a, give us like a catalog of things that that your customers are looking at, you know, pre and post deployment of GenAI.

[00:28:51] Robert Nishihara: Certainly, some of those, those ones you mentioned, quality is probably the hardest to measure, but also, um, you know, the probably the one that's foremost and most important. and then latency and cost are really big ones. And even, you know, it might sound straightforward to measure latency, but There are a lot of subtleties with measuring latency, especially for LLMs, where you have, um, you know, you don't just produce one output, you produce a sequence of outputs, and depending on the application you're working with, you may care about the time to generate the first token, or you may care about, you know, how many tokens per second you're generating.

[00:29:33] Robert Nishihara: And you may care, you know, the, this may vary from, Both of those numbers can vary quite a bit depending on the amount of load on the system, right? degree to which you do batching. And so we often, expose knobs for, for sort of making that trade off of, uh, sort of, you know, more model replicas or lower batch sizes to improve decreased latency for generating new tokens or, uh, but at the cost of, throughput.

[00:30:06] Robert Nishihara: And there are, so there are some, some trade offs like that. And there are a lot of, of course, a lot of optimizations you can do to, improve both of these things.

[00:30:14] Krishna Gade: And what have you seen customers do in such a scenario? Like, you know, are they writing custom evaluators for these things? And, uh, you know, what have you seen customers do when they're trying to evaluate their quality of models?

[00:30:31] Robert Nishihara: Yeah, I mean, the starting point is always to, and on the quality side, like, of just how well the thing works, um, the starting point is always to look at the data by hand and kind of score quality and, and come up with good, um, like reference examples and, um, you know, that you can kind of run benchmarks on.

[00:30:54] Robert Nishihara: And it can be done with a small amount of data, but you're often hand labeling, um, a number of examples to begin with, and we also see people, you know, using AI quite a bit to come up with these reference examples, like, come up with hand labeled examples. Like, for example, uh, one of the applications we built in house was just, uh, question answering for our documentation, right?

[00:31:20] Robert Nishihara: If people want to use Ray, here's a chatbot you can use to, uh, Ask array questions and get answers, right? And so, we wanted to come up with evals for this, so that if you, um, you know, if we swap in a new model, right, Llama 3. 1 or whatever, then we can evaluate quickly how, how good that was, um, whether the change was beneficial or not.

[00:31:42] Robert Nishihara: And, in order to generate, uh, the, you know, the, the evals, what we did was, sort of, to generate a synthetic dataset of questions and answers. And we did that by taking, you know, our documentation, randomly selecting like one page from the documentation, feeding it into, you know, GPT-4, and asking GPT-4 to generate a, um, you know, some example, some questions and answers, or take a passage and generate a question that can be answered with that passage, and then pairing those together to form your dataset.

[00:32:19] Robert Nishihara: And Then, uh, you know, using those to evaluate your, uh, again, when you run the evals, you have the, the system take in a question, generate the answer, and then, and compare that to the reference answer. And you have another LLM do that comparison. And so there are a lot of steps and a lot of like, AI being used.

[00:32:39] Robert Nishihara: Uh, but those are like, that's off, that's a common pattern.

[00:32:44] Krishna Gade: Yeah. So, I mean, this is kind of interesting, right? You know, early on, like in ML world, evaluation used to be evaluating a closed form function, you know, like, you know, root main squared error, or, like, precision recall. Whereas now, evaluation means you're running a model for another model, right, as you just described.

[00:33:00] Krishna Gade: And, uh, are you seeing, like, this growth of evaluation models, like, in customers? Like, are you seeing that take off?

[00:33:08] Robert Nishihara: Right. It's so funny. It's, um, you know, in the past, Model evaluation, what do you mean model evaluation, right? It's just, you just compute the accuracy and how many of them does it get it right and how many does it get it wrong.

[00:33:21] Robert Nishihara: But that's, that's, uh, it's very different when the output is a sentence or an image.

[00:33:26] Krishna Gade: That's right.

[00:33:28] Robert Nishihara: By the way, the whole, you know, field has really, uh, shifted in a lot of ways over the past decade, right? If you, I'm sure you remember, Um, as after, you know, in the years after, um, ImageNet and after everyone was excited about deep learning, um, there's, a lot of the field was driven by, uh, the ImageNet benchmark.

[00:33:51] Robert Nishihara: And every year, people coming up with new models that performed better on the, you know, ImageNet benchmark. And the data set was static right. It was, um, just the ImageNet data set. You have, you split it into your train and test data. And it was all about, can you come up with better model architectures and better optimization algorithms to do better on that data set?

[00:34:11] Robert Nishihara: And now, you know, the optimization algorithm is, is, uh, more static, right? It's like variance, stochastic gradient descent. Um, the. The model architecture is of course, there's still a lot of innovation there, but it's more static than before because you have transformers and so forth. Um, and all of the innovation is really going on the dataset side, right?

[00:34:33] Robert Nishihara: That was like considered a static thing in the past and now that's actually where, you know, people are putting all of their energy and spending tons of money and, and using lots of AI to curate the data. And that's a, just a complete, paradigm shift.

[00:34:47] Krishna Gade: Yeah, absolutely. Awesome. Let's, let me take some audience questions.

[00:34:50] Krishna Gade: I think there's some questions around, uh, you know, training aspects of it. There's a specific question of how do you differentiate between PyTorch, DDP, and Ray distributed training library? Are they complementary? Um,

[00:35:03] Robert Nishihara: They are complementary. And so, um, you can use Ray, most people using Ray use Ray along with PyTorch or with different deep learning frameworks.

[00:35:14] Robert Nishihara: And, you know, one way to use Ray is to, to have, to take your PyTorch model and we have a ray train wrapper, which you can, uh, you know, pass in your model and then we can set up, uh, Ray will set up, you know, different processes on different machines and set up, uh, DDP or other, um, uh, you know, distributed training, uh, you know, protocols between the different PyTorch models.

[00:35:41] Robert Nishihara: Uh, you know, functions on the different, in the different processes. And so, essentially, in that case, it's like a thin wrapper around DDP, around PyTorch. And what Ray is adding is both, like, setting it up easily, having a standardized way to do this across different frameworks, and also a way to handle the data ingest and preprocessing and to feed that in, as well as the fault tolerance pieces around, um, you know, handling, um, Uh, application or machine failures.

[00:36:08] Robert Nishihara: So, it is very complementary. You can, in some sense you can think of frameworks like PyTorch and other um, And also various inference engines like vLLM, TensorRT-LLM, as being focused on, um, you know, running the model as efficiently as possible on the GPU or on the set of GPUs and like really single machine performance.

[00:36:29] Robert Nishihara: And then Ray handles the multi machine scaling challenge, like a lot of the distributed systems challenges. And so those, you know, are very complementary things.

[00:36:39] Krishna Gade: Yeah. Yeah. So, um, you know, how does like, I guess the question is probably around like RAG and embedding, you know, like. Maybe like, how does it work, uh, you know, for those use cases?

[00:36:51] Krishna Gade: Maybe you could take some specific examples of, you know, customers developing RAG applications or, or, you know, on top of like your platform.

[00:37:01] Robert Nishihara: Yeah. So we, there are many different pieces to the whole RAG pipeline, right? There's, um, of course there can be, and you may do a subset of these and not all of them, but there's, um, prepare, you know, first of all, there's embedding computation.

[00:37:15] Robert Nishihara: Right, which involves taking your data, um, processing the data, and there are many decisions to make about how you process and chunk the data and compute the embeddings. Like, that can be, uh, that can also be a large scale problem or a small scale problem depending on the data you have. Um, but the data preprocessing and, and like, letting you iterate there, uh, that's some, and scale that, that's something that, uh, you know, Ray does very well.

[00:37:40] Robert Nishihara: Then there's also, um, you know, the actual. Real time, like, inference part, where you are, you know, serving, um, uh, requests, right? And this is, there's often a number of models being composed together to do this. It's not necessarily just one LLM, right? You're, you have, uh, the retrieval stage, where you are, you know, retrieving, you know, uh, different embed, um, different pieces of content based on your embedding of the query.

[00:38:09] Robert Nishihara: You are often, you may use additional models to, you know, Um, sort of rank the context and decide what context to feed into the model. Right, you have Uh, you also can use other models to rewrite the query before you embed it. Um, and then, once you are, you know, then you feed that into your generation model, you can, you can generate the output, you may have other models kind of fact check or, or, uh, you know, uh, check the correctness of the model.

[00:38:36] Robert Nishihara: And so, the way you're doing serving is actually often running many different models together. And that's something we didn't talk about as much, but, uh, We, the inference problems we see with Ray are often starting to have more and more models deployed together, not just a single model, and, and, you know, many calls to different models.

[00:38:55] Robert Nishihara: So there's the serving side of things, which can have a growing amount of complexity. And there's also, you know, this iteration loop where you need to continually improve the quality of the application, right? It's not just the quality of the model, right? It's the quality of the end to end application.

[00:39:13] Robert Nishihara: That may mean fine-tuning the model. Right? It may mean, um, but it also may mean fine-tuning your embeddings. It may mean chunking your, your data, your original dataset in different ways. And so, it's actually very important to kind of develop, um, unit tests for different stages of the RAG pipeline, right?

[00:39:34] Robert Nishihara: Because if you swap, you make some change, um, and it gets better or worse, say it gets worse, um, you need to know where, what got worse, right? Is it the generation that got worse? Is it that you fed the wrong context in? Maybe you ranked the context incorrectly. Um, maybe you retrieved the wrong context in the first place.

[00:39:55] Robert Nishihara: So. Um, there's a whole, you know, software engineering practice that needs to be built up around this, um, in order to really, like, unit test these different pieces and, and, uh, sort of quickly identify where the room for improvement is, and, um, Where Ray comes in, or, you know, Ray doesn't solve all of these pieces, but Ray is useful for all of the compute pieces.

[00:40:20] Robert Nishihara: The data handling, the fine-tuning, the serving, um, and so that the people developing these RAG applications can focus mostly on, um, you know, the application logic and how information is being, is flowing from one place to another.

[00:40:36] Krishna Gade: Got it. Got it. That makes sense. And, and so I guess, uh, you know, like people want to know, uh, how it differentiates from, uh, I guess like, uh, you know, using like, like this whole rag architecture that is emerged like using a vector database and a modeling layer and an orchestration system together to sort of like hook, hook up your rag application.

[00:40:58] Krishna Gade: You know, you know, let's say you use one of the, uh, open source vector databases, say Quadrant or whatever, and maybe you have a Llama and you're trying to put all of these things together through a Llama index. How does that workflow differentiate from like using something like, you know, you know, in Ray and, you know, like what are the pros and cons here?

[00:41:20] Robert Nishihara: I mean, they're complementary. In fact, you know, Ray doesn't, um, Ray's very focused on compute pieces.

[00:41:26] Krishna Gade: Okay.

[00:41:27] Robert Nishihara: Don't actually, um, provide a vector database. We don't store your embeddings. That is, um, you know, and so people who are using Ray to build RAG applications are using Ray along with the vector database or along with Llama index, uh, and these other tools.

[00:41:46] Robert Nishihara: So. Where Ray ends up being really useful is when, um, you know, again, if you're running everything on your laptop, then that's fine, and that's really, that's not the regime where you need Ray. Of course, you know, it can be useful on your laptop for using multiple cores and stuff, but where Ray really adds value is when you need to scale things, and when you, it starts to, you know, Some step of the process starts to be too slow.

[00:42:14] Robert Nishihara: Maybe you need to fine-tune on, you know, more GPUs, or maybe you need to do the data pre-processing on more CPUs and just, you know, and these kinds of things.

[00:42:28] Krishna Gade: Yeah, basically moving from single node or like workstation based RAG applications to like a distributed, highly scalable RAG application using that compute layer, essentially.

[00:42:39] Robert Nishihara: And giving you the experience, you know, but making it feel very similar to just writing Python on your laptop.

[00:42:45] Krishna Gade: Yeah, yeah.

[00:42:45] Robert Nishihara: By the way, um, I actually want to just emphasize the sort of complexity we're seeing around inference because that is, it's been growing quite a bit. A lot of you think of machine learning serving as Taking a model and hosting the model behind an endpoint and maybe, you know, auto scaling the replicas of the model.

[00:43:05] Krishna Gade: Right.

[00:43:06] Robert Nishihara: But, um, we're starting to see applications or, you know, AI products that people are building where they want the AI, uh, you know, they want their product to sort of complete an end to end task like booking an Airbnb or sending an email for you or, you know, making a reservation at a restaurant or these kinds of things. Writing code is another example.

[00:43:28] Krishna Gade: Yep.

[00:43:28] Robert Nishihara: And. That is not, the way they're at least architecting these today, it's not a single call to a single model. And if you want to write code, or you want to book an Airbnb, there are many steps, right? You have to take some vague description that, um, you know, a person said about what kind of, Vacation they're looking to have, turning that into different, you may have a model, turn that into sort of different types of requirements.

[00:43:53] Robert Nishihara: You may have another model retrieve different like candidate Airbnbs. You may like, have another models, call score those that each candidate against the different Criteria. Um, you know, you may have a model like generate an explanation for each of, of the top ranked Airbnbs. Like why that is a good, um, uh, a good choice based on the criteria.

[00:44:17] Robert Nishihara: Like there, I'm sort of oversimplifying it, but they're completing these end to end tasks often involves many calls to different models. And so you end up with a serving challenge. That's not just a single model behind an endpoint. It's like highly dynamic where. You have, you know, calls to different models based on, that are determined based on, you know, the output of previous models.

[00:44:42] Robert Nishihara: Um, you may start with, uh, you know, GPT-4 for all of these just to prototype and get it working. But, uh, when you have hundreds of model calls stacked together, you often, at some point, find that, hey, latency really adds up, the cost really adds up. And, uh, You don't actually need a, like, a fully, generally, general purpose model for each of these things, you, you kind of, in some cases, a small or specialized model, uh, that's really good at one thing may, may do the trick.

[00:45:10] Robert Nishihara: And so you end up, um, starting to use fine-tuning and, and, uh, uh, open source models. And composing these things together. So it can actually get quite tricky. And I think this is the direction that inference for machine learning is going in that, you know, in the, in the coming years, you're going to see like very complex serving systems that have huge amounts of models being composed together to, in a very dynamic way to complete, uh, like complex tasks.

[00:45:39] Krishna Gade: Absolutely. I think there's this whole notion of model merging that has emerged where people want to blend these different. Models and, um, and, and, and merge together responses and hook up these workflows. This is great. Um, so I guess, uh, you know, finally, uh, you know, as you, as you sort of describe the, the sort of the stack, like, you know, you may have a vector database, you can have a modeling layer of your choice, you can have an orchestration system.

[00:46:07] Krishna Gade: Uh, you can have, uh, I think an array and can function as like this really robust distributing, distributed computing framework to bring all these together. And then, how does observability fit into this thing? You know, for example, would you consider that as like an additional add on into this workflow to evaluate these metrics that we talked about?

[00:46:24] Robert Nishihara: Yeah, I think of observability as critical and it comes up in a lot of ways because you often talk about, think about how hard it is to develop an application, but, what, where a lot of the time is spent is really debugging when something goes wrong. And many things can go wrong.

[00:46:47] Robert Nishihara: It can be that something is not behaving correctly, or it could be something is crashing, or it could be that it's working just fine, but it's too slow. there are many forms of, you could be debugging performance issues or debugging correctness issues. And, that is something that we've seen people spend endless, countless hours trying to, resolve.

[00:47:12] Robert Nishihara: And if you don't have, the information readily at your fingertips that you need to answer those questions, then you may have a really bad time. And, So it's often, of course, sometimes you don't know what information you wanted to log or store until the error happens, and at that point it's too late.

[00:47:32] Robert Nishihara: And so it's very hard to run production systems without observability tooling. In fact, you really, it's really essential.

[00:47:42] Krishna Gade: Absolutely. I think that's a great, uh, uh, Sort of line to sign off at this point. So, uh, uh, you know, thank you so much, uh, Robert, for joining us on this webinar and sharing your valuable thoughts.

[00:47:57] Krishna Gade: I've learned a lot, uh, on this. And I think, uh, for those of you who are thinking about moving from workstation based AI development for prototyping to productionization, you know, please definitely look at, uh, Ray and any scale is they're doing some great work working with tech forward companies. And then of course, you know.

[00:48:15] Krishna Gade: We, you know, as you sort of think about observability, uh, we are always there from Fiddler to help you out. That's about it for this AI Explained. Thank you so much, Robert.

[00:48:26] Robert Nishihara: Thank you.

‍