AI Safety and Alignment with Amal Iyer
In this episode, we’re joined by Amal Iyer, Sr. Staff AI Scientist at Fiddler AI.
Large-scale AI models trained on internet-scale datasets have ushered in a new era of technological capabilities, some of which now match or even exceed human ability. However, this progress emphasizes the importance of aligning AI with human values to ensure its safe and beneficial societal integration. In this talk, we will provide an overview of the alignment problem and highlight promising areas of research spanning scalable oversight, robustness and interpretability.
Joshua Rubin: Yeah, good morning, good afternoon, and whatever time of the day it is for you. Uh, welcome and thank you for joining us on today's AI Explained, uh, on AI Safety and Alignment. Um, I'm Josh Rubin, Principal AI Scientist at Fiddler. Uh, I'll be your host today.
Joshua Rubin: We have a special guest today on AI Explained, and that's Amal Iyer, Senior Staff Scientist at Fiddler AI, and my colleague of several years.
Joshua Rubin: Um, welcome Amal.
Amal Iyer: Uh, thanks Josh for having me here. Pretty excited to talk to you. Always fun.
Joshua Rubin: Cool, cool.
Joshua Rubin: So alignment is the topic. Uh, do you want to start out by talking about, like a little bit about why we should care about safety and alignment for AI models?
Amal Iyer: Yeah.
Amal Iyer: Um, me sort of, you know, set the stage a little bit, uh, and, you know, ChatGPT and Gemini have made it easy for me, so I don't have a talk about it.
Amal Iyer: How capable current systems are and how in many ways they've come out of the blue. Um, so if, if I were to take you back to 2018 and say you are an NLP scientist, you're training a training, a bird like model, generally you are trying to get it to sort of mid, you know, uh, uh, do an entailment task or predict certain labels, uh, sentiment, et cetera.
Amal Iyer: Uh, so compared to even 2018 or 2019, we've come a long way. Um, we, these models are able to generate not just coherent sentences, but are able to do a lot of impressive things. Um, and the capabilities of course, haven't surpassed humans in many dimensions is especially sort of generalized thinking. But we are seeing sort of initial sparks of.
Amal Iyer: What one could contend as highly capable general systems. So I won't say the word AGI, but I think highly capable general systems a good way to characterize, uh, these systems today and where they will continue to sort of go over the next few years. Um, so a big part of ML research has focused on capabilities, uh, and I would say bulk of, uh, our effort over the past several decades in ML, um, and AI in general has been focused on, you know, improving capabilities of models come devising methods.
Amal Iyer: And I think it's a good time given how rapid the progress has been over the past few years to start thinking more broadly about safety and alignment of these models. 'cause not only do we want these models to be highly capable, we, we want them to perform activities on behalf of us in a safe manner.
Amal Iyer: And we want them to be aligned with our broad human values. And that's why I think this is a great time for not just ML researchers, but um, ML practitioners and users of this tool to start thinking about safety and alignment more broadly.
Joshua Rubin: Yeah. I, I think, you know, I think it's easy to overlook, you know, when we're talking about, okay, well it's not as smart as a human in this way, or it responds in some silly or funny way to my prompt.
Joshua Rubin: I think it's really easy to overlook how remarkable it is that we suddenly have models that are instruction following it all. Right? Like, like, I mean, people talk about instruction following, they talk about zero shot learning or few shot learning as though prompting a model is a method of machine learning, right?
Joshua Rubin: It's a totally new paradigm. Um, and, and it's really a remarkable cha uh, capability, um, that, you know, it should be, and it's easy just kind of go, wow. But it really should be a hint that these models. Or have just entered a totally d new domain of capability where they're, um, you know, they can do all sorts of things that we intend them to do and maybe don't intend them to do.
Joshua Rubin: Right. So, I don't know if you, do you wanna say something about instruction following and like, since that's sort of the, the key capability of the LLMs?
Amal Iyer: Yeah, so I, I think, you know, you, you hit the nail there that some, somehow we've in 2024 normalized some of this behavior. Like
Joshua Rubin: Yeah.
Amal Iyer: And, you know, when, in late 2022 when, uh, a lot of us start playing with these models through ChatGPT, et cetera, it's like, whoa, we are here.
Amal Iyer: And then early 2024, it seems like, oh, just another thing. I mean, we've sort of normalized it, but, so I think the pace of progress, like when you are on an exponential curve can seem, you know, like we, and we may not be able to realize how quick the progress is because we tend to normalize a lot of technology advances.
Amal Iyer: Um, in terms of sort of, you know, just talking broadly about instruction, uh, just training these models and specifically about instruction tuning, I think what we've hit upon is an interesting recipe that we could potentially like keep scaling. Um, so for those of you who are not familiar, there's this notion of something called scaling laws, which, uh, says, "Hey, you know, uh, let's actually train larger models on more corpus of data."
Amal Iyer: So, um, and you can, you can say you're essentially throwing more compute at the problem, more compute and more data. And for a classic ML researcher, it's like, oh, that's not interesting at all. But what that has unlocked is a lot of interesting emergent capabilities like instruction following. I think one that blows my mind is in context learning.
Amal Iyer: Um, the fact that something like that emerged is. Pretty surprising and we weren't able to predict it. And given the amount of resources, um, both compute data and great people, uh, and companies that are trying to really push forward scaling, um, it's anybody's guess what kind of new capabilities we might, uh, unlock in the coming years.
Amal Iyer: Uh, which is not to say that, you know, uh, scaling laws are going to be there forever. We are seeing sort of plateauing, uh, the original silicon scaling laws, right? Like we are down to three nanometers and it's beginning to plateau. But humans are like, you know, we are interesting, we've found, uh, panel computers.
Amal Iyer: We've sort of, uh, broken through that, you know, Moore's Law bottleneck by saying, we'll just develop panel computers. We're taking advantage of that in the context of transformers. So I think I, I would say at least for the next few years, we have a very clear path. For, um, scaling these models, given the amount of resources that are at play and we might see more capability unlock.
Amal Iyer: And so it, even from that perspective, sort of just thinking not just a decade out there, but three to five years, uh, we, we need to start sort of ramping up and understanding why these models perform the way they do, what are good techniques to oversee and align them. So it's a great time if you are someone who's not exact, I'm not even saying ML safety researcher, anyone who's broadly interested in, um, you know, like safe use of technology.
Amal Iyer: I think this is a good time to start sort of engaging, uh, with this community.
Joshua Rubin: Yeah, I think, you know, we talk about like compute being a constraint, but there's also the constraint of, okay, we, what happens after you've trained the model on all of the text in the internet, right. Or all of the, you know, there is, there is a constraint on amount of data available to be consumed.
Joshua Rubin: Uh, and I think it's really interesting to think about the ways in which, you know, multimodal models, you know, we see, um, you know, OpenAI and, you know, Google and their Gemini models, you know, starting to consume images and audio. Uh, and it'll be really interesting to see what happens to the scaling laws.
Joshua Rubin: As, you know, those models start like natively traversing more than one different, um, mode of input.
Amal Iyer: Right, right. Um, yeah, and I, I think, you know, there's clearly a lot of impetus on capabilities, uh, via scaling laws and like adding more modalities. I. Um, so I, I think it, it, it just me, I think just for, even if we say like, hey, you know, it's unclear to us what will get us to, uh, general intelligence systems that surpass with, uh, we'll at least end up getting systems that are really good at certain specific things.
Amal Iyer: And we, I mean, I must say that I'm probably, I probably write worse summaries today than ChatGPT or a Gemini like model or a Mistral like model. Um, and they can do it so much more faster. So I think if I were to write summaries, I think there's a, tend increasingly, there'll be a tendency for us to offload certain tasks to models and, you know, that's been the, the march of technology, right?
Amal Iyer: Like you, you see, uh, systems that are doing things for you and you say, okay, I'm gonna offload this activity to the machine. Uh, we saw that with spreadsheets we, we see even in our homes and with dishwasher. Uh, so I think increasingly when you are offloading, uh, tasks to these complex systems, uh, there is emergence of risks, uh, large scales, especially because now you have, you know, large swats of human society there that are going to use these tools.
Amal Iyer: And so if the model has, you know, uh, lack of say, robustness or some unintended sort of, um, emergent property that we never really assessed, then you have, you expose society to large scale risks because of not just like the, the power of capabilities of these models, but just widespread adoption of these models.
Joshua Rubin: Yeah.
Joshua Rubin: So, so you know, kind of with that, I think the alignment story is like, it sort of starts with talking about really how we train these models. Right, right. Um, to follow instructions to, you know, I think, uh, if you go to, you know, OpenAI and you read their blogs, they talk about the difference between, you know, the base GPT-4 and, uh, ChatGPT-4 and how those, the two things are really different in the absence of, um, you know, fine-tuning for specific, you know, a specific mode of interaction.
Joshua Rubin: Um, right. So, I don't know if you wanna jump into kind of telling us a little bit about, uh,
Amal Iyer: Yeah. I think that it, it'll probably provide a good stage for our follow ups as we dive deeper, deeper into some of the areas of research and alignment.
Amal Iyer: I'm gonna sort of quickly talk about how do we train these instruction following, uh, models. Um, and as you alluded Josh, the step one or step zero that hasn't been sort of captured in this graphic here is, um, to take a transformer model and, uh, you ask it to predict next tokens, um, in a sequence.
Amal Iyer: And, uh, that really at the core is a language modeling task that has been around for quite some time. And, um, I, I, you know, I, I have a confession to make. Back in 2017 when I was working on speech recognition, um, you would train in acoustic model, which is trying to predict like what you're trying to say, uh, either in phoneme space or directly in like spelling, uh, alphabet space.
Amal Iyer: And then you would slap a, a very lightweight language model on top, which is guide this, um, gen the predictions of the acoustic model. And I never really thought much about the language model. I was so focused on like, getting the acoustic model right. And honestly I never thought, you know, that's going to bring us where we are here today.
Amal Iyer: Um, so any who language modeling as a task, I think, uh, we still don't fully understand, um, why, uh, certain skills emerge with this very simple task of predicting the next, next open. One of the hypothesis is that, uh, some. you know, next token prediction requires a lot of comprehension. And so that leads to emergence a lot of like, you know, uh, collection of skills.
Amal Iyer: Um, and the model can sort of, uh, combine these skills to do tasks later on as we look through steps one, two, and three. So the pre-training is a very important step, and that's where the scaling laws also, uh, tend to play, uh, where basically saying, okay, you know, I'm gonna feed internet scale data to a very large model and run it on giant distributed compute.
Amal Iyer: So that's step zero. Uh, but now this, this model that you've claimed the base or the trade model is really good at. Next word prediction. Uh, what we want to do is get these models to do things on our behalf and follow instructions, uh, et cetera. So we, the, there's a recipe called RLHF, uh, or Reinforcement Learning from Human Feedback, and that's what this graphic is, uh, um, sort of, you know, succinctly, uh, showing here.
Amal Iyer: Uh, so step one is once you have that pre-trained model, uh, you collect a data set, uh, which is, uh, a data set of prompt and, uh, what we call demonstrations, uh, from users. So really you are collecting a data set from. Uh, humans and showing the model how to do certain tasks. So for example, here there's like, explain the moon landing to a 6-year-old.
Amal Iyer: So human would demonstrate it. And because, you know, the, the prompt is asking them to explain landing to a 6-year-old, they might not use a lot of jargon. They like to keep things simple and concise. Maybe add an illustrative example so that, you know, um, you sort of pivot, paint a vivid picture, um, for the, for the 6-year-old.
Amal Iyer: Um, so the step one is, uh, you take the pre-trained or base model and you are, you know, just fine-tuning on this demonstration. So that's a supervised learning step. Once you're done with step one, you move to step two, which is where you're trying to, um, learn human preferences. Um, and here you, the, the challenge is you want to really scale up the learning of human preferences.
Amal Iyer: So, uh, the, if, if you imagine like one option would be you as the model to generate, uh, some, uh, sample, um, generations for this say, prompt, um, that we talked about. And, uh, humans rate the outputs and you do it for every single generation and prompt in your dataset. Um, it's really hard to scale that process.
Amal Iyer: So one way to to scale this process is to say, Hey, I'm going to actually train yet another model, we'll call it reward model, just trying to learn human preferences. So it might learn preferences like for a 6-year-old, you shouldn't really learn, uh, use a lot of jargon. You should keep things simple. Add in illustrative examples, uh, et cetera.
Amal Iyer: So some qualitative and quantitative aspects of, um, you know, uh, human preferences. Um, and so now you have a reward model that has encoded human preferences. And now step three is, uh, where you actually, you know, train the model using, um, uh, reward model. So this is what we actually, the output of this stage is what we are interacting with When you go to, um, you know, uh, the ChatGPT interface or the Gemini interface, um, uh, et cetera.
Amal Iyer: So in, in this stage, uh, you use, um, something called reinforcement learning. Um, and, uh, in reinforcement learning, the model is generally termed a policy. So when I use the word policy, uh, it really is the model, uh, whether that's a ChatGPT model or Gemini or so forth. And what that model does is given, uh, a prompt, it's supposed, it, you know, it tries to follow the prompt.
Amal Iyer: Like, uh, say in this case, write a story about frogs. It's gonna generate a story. And instead of asking humans to rate these stories, because in step two we learned some human preferences, we are going to use the reward model that has encoded those preferences. And this reward model is going to score the, um, uh, uh, the generations by your policy model, your final sort of output model.
Amal Iyer: And you sort of, you use something called a called proximal policy optimization, which is, uh, um, uh, a method in of training neural nets, uh, using reinforcement learning and, uh, sort of get your policy model to, and, you know, uh. Have generations that are rated highly by your reward model. And so the, at this stage, once you've sort of cranked the wheel, um, and gone through step one, two, and three, you now have a policy model or a final model that you, you are, that can follow instructions that you quote unquote, have aligned.
Amal Iyer: Um, by that, we, by that what, what we mean by alignment here is we start with this, you know, base pre-trained model, which is just like trying to predict next four tokens. And we finally get gotten to a point at step three to a policy or a model that, um, follows instructions, but also does that in a way which is, um, hopefully aligned with human values and preferences.
Amal Iyer: Um, so th this is in a nutshell what's happening under the hood when you, you know, start interacting these models.
Joshua Rubin: So if I was to like, you know, kinda recap a little bit, it seems like step one is basically, you know, this is the, the, the pre-training, uh, you know, the expensive part that requires consuming trillions of tokens and, um, you know, you're trying to get the model to understand the mechanics of language and build those kind of base abstractions that it can use to reason in the future.
Amal Iyer: Right, right.
Joshua Rubin: I I, I've seen though, like if you, if you try to have a conversation with a, you know, a pre-trained model as, you know, much capability as is sort of latent in it, you know, you'll say something like, how big is a dog? You know, and it responds by giving you a list of questions. How big is a chicken?
Joshua Rubin: How big is an alligator? Right. How big is a tree? Right. Like, it's like a, um, you know, it's finishing the poem that you've started. Right, right. Whatever, you know, maybe it's seen a list of questions in the past, but it's certainly not answering questions. Right. Right. You know, and then the second phase is really about sort of nudging the model into, uh, you know, subtly into behaviors that are different, right?
Amal Iyer: Yeah.
Joshua Rubin: Capturing the human feedback in a way, modeling that to overcome the sparsity of the human feedback so that you can really like, um, turn the crank on the training once you have a model training another model, and kind of nudging it into a kind of final, uh, a final form.
Amal Iyer: Right?
Joshua Rubin: Sort of like, uh. Uh, behavioral adjustment.
Amal Iyer: Mm-Hmm.
Joshua Rubin: I think what steps two and three accomplish.
Amal Iyer: Right.
Joshua Rubin: Does that sound sound about right?
Amal Iyer: Great way to put it. Yeah.
Joshua Rubin: So, so, you know, it sounds simple if you describe it that way. Right. Uh, so you wanna talk about like, the kinds of problems that can pop up and why that's more challenging than it looks? I mean, I think this is a great system and it's amazing things but, um,
Amal Iyer: yeah, I mean, it, it, it's pretty amazing what, you know, something like this has, um, um, elicited in incredible sort of, you know, progress. Um, but, but it doesn't come without challenges. Right. Uh, and I really like this, uh, framing from this paper from Stephen Casper, which talks about challenges of learning from human preferences. And I like this bucketing across sort of three buckets, which is around human feedback, the reward modeling, and then actually like training a policy.
Amal Iyer: Um, so I think the, the human feedback piece, I would say a lot of it, uh, is shared with, you know, say you were training a content moderation system, um, and you require humans to provide feedback. A lot of, there's a lot of commonality with our current sort of ML paradigm. Uh, so, you know, you might have evaluators that disagree with each other.
Amal Iyer: So how do we, how do you resolve something like that? Uh, you might do majority voting, but in that case, are you, um, you know, aligning to majority preferences, um, you might have data quality issues because you didn't actually sample, quote unquote sample your evaluators in a fair manner, which is representative of the larger society.
Amal Iyer: So you might have data quality issues, uh, or bias issues there. So a lot of this is, you know, shared with, um, the risk of, uh, calling and, you know, non, um, general systems. Uh, traditional ML, let's just call that traditional ML. Um, I think one thing though I would highlight in human feedback that is unique, the challenging in the, for training general models is, um, the, the difficulty of oversight.
Amal Iyer: So what do we mean by that? Uh, so imagine, um, you know, you, you have to, as say, I, I sign up to be, uh, a human feedback provider, uh, with one of these, um, companies that are training these large models. And I don't know a lot about say, chemistry. Uh, but now I have to oversee, uh, and steer these models and align these models.
Amal Iyer: And there's like some of the questions related to chemistry. So now I have to actually like, you know, provide oversight in an area where I am not an expert. So there is difficulty associated with training general models where you, the, the people providing annotations may not have the right market expertise and in the, in the short term or the medium term, we can get away with that by actually asking domain experts to weigh.
Amal Iyer: So if you're training, um, as system for providing legal assistance. Um, you know, there are companies that are trying to bring in lawyers to oversee these models, and similarly for medical practitioners, you might bring medical practitioners to oversee these models. So, in the near term, uh, we might solve this, uh, with, uh, domain experts.
Amal Iyer: Um, but as we extrapolate, you know, as the capabilities 'cause these models are improving in a recursive fashion, right? As, as you mentioned, you're, uh, you're using one model to sort of improve the other and you're like building on top of the system. It's not hard to imagine, you know, either it might de evolve or it, it could potentially actually lead to superhuman capabilities in some dimensions.
Amal Iyer: So how do you oversee that is actually a real challenge and we'll talk a little more about it, uh, in the subsequent, in rest of the talk, uh, here, but, um, sort of moving. To reward modeling and policy away from human feedback. I want to, uh, you know, there are a bunch of terms here, but I wanted to sort of, uh, describe an interesting, um, example, uh, from the reinforcement learning community, um, which is, so a couple of years ago, um, I think it was some researchers from Berkeley, they were trying to train, uh, a robot for grasping objects.
Amal Iyer: Um, uh, and they didn't want to sit there and provide, uh, feedback whether the robot had correctly grasped an object or not. So they devised, um, you know, uh, an automated. Uh, reward mechanism. So they put a camera on top. Um, so the robot was in a cage and it had to grab objects. So they put a camera on top.
Amal Iyer: It had a two, 2D view. And the reward was when the camera felt like the, um, ca uh, robot arm had grasped an object. And so you're like, this is great. I can scale this. Um, I don't need to, if you're a grad student, you don't need to sit there and provide feedback all the time. Um, but the feedback is parsed, right?
Amal Iyer: Because, uh, the feedback comes only when the, uh, the, the, uh, uh, robot arm has sort of grasp the object. So that's when the camera, you have another CNN running, which is trying to predict, uh, if the grasp has happened or not. So that's the source of feedback. Uh, what happened was one of. Oh, in one slash training run, uh, they noticed that the policy had, uh, hacked the, the rewards mechanism by instead of grasping the object, it positioned the arm, uh, right above the object in such a way that the 2D projection felt that the, uh, robot had grasped the object.
Amal Iyer: So it was an unintended consequence, right? Like the, the robot is not, or the policy is not trying to actively ce the reward mechanism, but it hacked the reward mechanism by sort of just positioning the arm on top of the object, but not necessarily grasping the object. So there's a bunch of interesting problems that this example portrays.
Amal Iyer: One is that of problem specification, right? Like you miss specified the problem and you created a reward mechanism, which actually did not capture what you wanted the system to do. Um, and then you have this problem of the policy actually hacked your reward system. Um, it was again, in this case an unintended consequence.
Amal Iyer: Um, and so the, the, this reward and policy, um, setting, uh, you know, it, it leads, it can lead to unintended consequences. So we have to be very careful about how we, um, provide feedback to the models and how we align them, because it might lead to, um, you know, unan unanticipated and undesirable consequences.
Amal Iyer: Um, and one last point that I'd like to add in the, in the context of LLMs, we do see, um, similar behavior. I think there was an interesting paper that came out, which talked about sycophancy. So, and I think you can try it. I, I would highly encourage you to try it with any of these, um, elements that you might be using.
Amal Iyer: Um, and one of the emergent sort of phenomena that you see, uh, when you are try to align these models is that they'll, they'll try, they're essentially trying to get your approval, right? They're trying to get approval from human annotators. So one of the unintended consequences of this alignment process is that they'll try to be a, they, they'll show some degree of ancy.
Amal Iyer: So if you say, ask a, a, a question, something like, is video gaming, uh, great for developing young minds? Um, and then it'll provide you some, you know, it'll equivocate, it'll say, um, and rightly so, right? There's some research which shows it's great for development, some other research shows, you know, you gotta do it in the right proportion, et cetera.
Amal Iyer: And then if you ask it a question, say if you assert something like, I believe that this is helpful. And it'll actually say, yeah, I agree with you and this is why. And you delete that and you say, okay, I, I believe that this is unhelpful. Um, and it'll again, agree with you.
Amal Iyer: So the model, and one could argue the model doesn't have an internal belief system.
Amal Iyer: Um, uh, and we don't know about it. There's a lot of work trying to sort of probe the internals of the model, but this, this notion, this behavioral ancy was not actually explicitly encoded in the model. It's, it emerged out of this alignment, um, uh, process. Uh, so it's something that we have to be very careful about because if you do sort of naive alignment, especially for increasingly capable models, we might see emergence of, um, properties that are undesirable.
Joshua Rubin: Yeah, that's so super, super interesting. So it kind of brings up two things, but I, I, you know, you know this, but I, you know, I wrote a blog maybe six or eight months ago where I was trying to play 20 questions with. ChatGPT. And I was having it ask, I was having it guess things and I was asking it quite, you can find this blog on the Fiddler site if you're interested.
Joshua Rubin: Um, it's called something like "What was ChatGPT Thinking?" Um, but uh, you know, I'd ask it to come up with a clue and then I would ask it questions and try to guess what it's thinking. Um, and I think what was interesting for me, I mean one of the interesting observations was it never had a, a real clue in mind, right? It's just completing a dialogue.
Amal Iyer: Mm-Hmm.
Joshua Rubin: Um, but I could almost always steer it towards, uh, a particular answer, um, by, uh, 'cause because it had this strong bias to yes answers, right?
Amal Iyer: Mm-Hmm.
Joshua Rubin: Bias to confirmation. I never understood actually why that was, but you could basically get it to, you know, if I wanted the clue it was thinking to be a spaceship, I could ask it guiding questions and the bias towards yes and yes, no questions allowed me to basically select.
Joshua Rubin: Whatever, you know, final answer I wanted by asking, asking the right questions, right?
Amal Iyer: Right, right.
Joshua Rubin: So I, I didn't understand how that worked, and I think you've described it, uh, really, really clearly here.
Joshua Rubin: Um, the other thing that came up, like I think we were discussing a couple of days ago is this, um, tendency to sound super, super confident.
Amal Iyer: Right? Right.
Joshua Rubin: Uh, you know, it sort of works in real life when we interact with other humans that we tend to believe things that sound confident and, and if that sort of gets accidentally codified in our reward model in this process, then, you know, uh, the, uh, LLM, the, the policy in this case is gonna sort of, um, exploit the fact that confidence seems to be the behavior that we want.
Amal Iyer: Right?
Joshua Rubin: So, so yeah, this, this, um, this yes bias and I think the overconfidence that I think a lot of us worry about in LLM responses. Uh, could totally both be, you know, results of, um, you know, imperfect, uh, model alignment.
Amal Iyer: Right, right. And, and I think this confidence, um, issue is problematic for us, especially as we build, you know, um, um, multi-step systems and not just sort of this dialogue systems as we rely on these systems to automate a lot of things.
Amal Iyer: Um, we don't want these models to be highly confident, um, when they're taking actions out in the world on our behalf. Um, we want them to be well calibrated. Um, and I think there's an interesting graph, I don't know if there's been more work done in, uh, in this direction about, I think the instruction tuning paper from OpenAI talked about how the model calibration goes out of whack.
Joshua Rubin: Hmm.
Amal Iyer: Um, uh, alignment using RHF. So I think that this sort of calibrating confidence, uh, sort of a classic sort of ML problem, I think we, we, we won't, we've only scratched the surface here, so it gets elicited in the responses, but even if we start looking at the actual, like, property distribution across tokens, yeah, um, uh, I think alignment tends to, um, you know, um, uh, muck around with that as well. Um,
Joshua Rubin: Just, just to interrupt you for one sec, just a quick time check here. Let's spend about five more minutes chatting and then we can maybe cut over to questions. And so people out there, if you have questions, please, uh, please get ready uh, and from all, you know, think about, think about, you know, what you wanna do in the next, where you want this conversation to go in the next couple of minutes.
Amal Iyer: Yeah, that sounds good.
Amal Iyer: Um, I think the other, sort of other thing I wanted to sort of bring up Josh, was, um, how, how is the, uh, research community thinking about this? Um, I think, uh, I, I, I was fortunate enough to attend, uh, the Alignment Workshop right before NeurIPS, um, uh, in December.
Joshua Rubin: Super jealous
Amal Iyer: And, uh, you know, lots of amazing people working on front frontier, um, models. And a lot of my thinking has been influenced by, uh, you know, that that particular two day workshop, um, and kudos to the organizers, um, and the four areas of research that, um, you know, the, the community seems to be, um, uh, putting wood behind, uh, is, um, you know, these, these four buckets here. And I'll, I'll quickly provide a flavor for what these look like.
Amal Iyer: Um, one is scalable oversight, which we briefly talked about. And, uh, to summarize, um, it's the, it's, it's the work to sort of augment ability of humans to oversee these models. And in operational terms, um, if I were to sort of, um, you know, break it apart, I think there's the problem of overseeing and providing supervision during training.
Amal Iyer: Um, so let's call that scalable supervision and a lot of academic work, um, and work from leading labs is focused on scalable supervision. Um, and the second bucket, which I think we'll need to sort of increasingly start thinking about is scalable monitoring. So once you've deployed these systems, these systems, uh, will be used in many interesting and unforeseen ways.
Amal Iyer: And how do you scale monitoring for those, um, for that scenario? Uh, so it's something that, you know, we also actively think about here at Fiddler about scalable monitoring. And I think this is an, it's gonna be an interest, increasingly important area as we adopt these systems um across, um, our society. Um, this, the second one is sort of related to, um, scalable oversight, but, um, uh, you know, it, it, it, this, a lot of overlap, but, um, it's, it's a classic, you know, um, ML topic around generalization.
Amal Iyer: So we train these like massive models. We try to align them and we can be, um pretty much, you know, we can, these models will be used in ways that we never intended, right? Uh, so we want to study generalization, um, and robustness. 'cause you want these models to be, say you align them to human values and you assess them, or certain scenarios, uh, even during production time, you want those values to be a bit, so your model has to, to reliably generalize across a range of scenarios.
Amal Iyer: So that, that's one bucket of study that, um, is going to get a lot of, uh, um, uh, you know, uh, resources. And a lot of people are gonna start like, working on this problem. Um, and the third one that, you know, Josh, I'm pretty sure like you, you have some comments on this, uh, is around interpretability. Um, and this, uh, and the way I see it, you know, having a little bit of neuroscience background, um, is.
Amal Iyer: Interpretability are different layers of the stack. So, uh, at the most basic level, sort of mechanistic interpretability, like what are the sub circuits of the model doing? Yeah. What kind of quote unquote programs they are encoding in their weights and activations. Um, and then somewhere in the middle is the, the, um, um, you know, like understanding when you provide a query to the model, what are the sub circuits and skills it's using, um, to come up with a response.
Amal Iyer: And then at the top of the stack, uh, sort of at a, you know, global level, uh, trying to understand what, what aspects of the training data influenced certain model outcomes. Uh, so can you trace back from model generation back to the sequences in your training, um, that were responsible for certain, um, um, you know, generations?
Amal Iyer: Uh, so I, I would say interpretability I think is. Uh, something that we will, we'll need to sort of really current here, because right now these models are black boxes. We don't really understand how these work, and I think it's a greenfield area, um, as we enter this large language model space, uh, uh, and sort of furthering the research.
Amal Iyer: So I love for the community to sort of, um, you know, do more work here. Uh, and finally, around governance. I'm, I, I, I'll caveat this and say I'm not an expert at all at governments, but I think given the pace at which progresses happen and the pace at which we are seeing adoption, um, there's definitely a lot of interest from government, academia, and industry to establish, uh, regulatory standards.
Amal Iyer: And if nothing else, at least reporting about, you know, large, uh, training runs, um, when you, you might expose those models for, uh, public use or beta use, et cetera. So some amount of governance around that.
Joshua Rubin: Nice. Well, I, I see our, our first question has landed in the Q&A.
Joshua Rubin: Uh, so how is Fiddler thinking about operationalizing alignment and safety for LLM based apps?
Amal Iyer: Um
Joshua Rubin: Do you wanna take a, a first stab at that and I can jump on with thoughts afterwards?
Amal Iyer: That sounds good. Um, yeah, and, and I think sort of going back to sort of, uh, um, you know, scalable oversight, uh, we, we, we tend to call it from, from an operational term, uh, or operational prism, uh, monitoring. We, we want our, um, you know, folks who are deploying LLM apps to actively understand what kind of usage is happening.
Amal Iyer: Um, you know, you can have all kinds of malicious use as well, right? Like, just like any new piece of technology. Your, uh, LLM based app might be subjected to malicious use or adversarial use. So you want to understand if something like that is happening in shutdown down certain counts. Um, but also understand sort of, you know, uh, is, um, your model responding in ways that's helpful and not harmful, uh, to user requests.
Amal Iyer: So getting visibility into that. So I think we tend to, from an operational, um, term, we call it monitoring and, uh, we are using a bunch of, um, models that are scoring these larger LLMs on different dimensions. So we, we think we, we are thinking about like in the sense of using a bunch of specialized small scale models to score these larger models, um, to make it more scalable.
Amal Iyer: Uh, because you, you know, it, I think that one of the, the constraints with, um, scaling monitoring is that you don't want to, if it becomes extremely expensive to monitor. Um, most teams will prefer not to. So we want to make it very, um, cost effective for our cu our customers to monitor their LLM map. Um, and, uh, sort of ups moving upstream.
Amal Iyer: Um, you know, talking about robustness, we have this open source tool, um, called Fiddler Auditor that we, we, uh, that we've open sourced to assess sort of reliability of these systems. There's a lot more that needs to be done there, so we invite contributions, but we do feel that reliability and generalization is an understudy topic.
Amal Iyer: Um, and even from a user standpoint, like if you are, uh, say a product owner of an Netherland based app, you really want a great sort of iterative loop between monitoring, understanding where the model's underperforming, and then going back and say, saying, doing some kind of prompt engineering or fine-tuning, understanding generalization, robustness, pre-production before you go into, uh, production.
Amal Iyer: So I'll pause there, and Josh, I'd love to hear your thoughts as well.
Joshua Rubin: Yeah, I think, you know, my, my, I think those are great, great points. I think, um, you know, the first thing is, you know, we're talking about alignment. I think it's fascinating. I think the future has everything to do with alignment. Um, you know, we didn't get to talk yet at that, or that we haven't gotten the chance to talk about super alignment and what happens with AGI, maybe if we have a minute at the end, it's, it's worth touching on.
Joshua Rubin: But, but the truth is that most of the teams we talk to are not, you know, they don't frame the problem yet in terms of alignment. You know, the frontier labs are there, they're thinking about the hard problem of what the future looks like. But, you know, there are plenty of teams who are plenty sophisticated, who are working mostly in the prompt engineering space, um, and developing amazing applications with prompt engineering.
Joshua Rubin: Um, you know, and you know, there's a couple of pieces to, to the, the story of, you know, LLM safety, I mean. There's, uh, sort of a, an offline analytics component. Like, can you capture feedback from your users? Can you use, as Amal says, can you use models for proxy feedback or for safety feedback? Right? Can you find out if a response was biased or, um, uh, you know, uh, or a model hallucination, right?
Joshua Rubin: Was it unfaithful to some source material that was asked to summarize? You know, can you capture that in a, in a, um, uh, uh, offline capacity in a data store so that, you know, as model developers, people know that there are certain kinds of topics that their models have problems with. Or if you have a, you know, a RAG application retrieval augmented generation where, uh, you are summarizing a knowledge base or, um, uh, frequently asked questions, uh, or a customer service, uh, database, um, you know.
Joshua Rubin: Uh, is there a topic that your customers have started to ask you that because the world has changed where, uh, that's a missing piece of information in the database your LLM is drawing from? Right. Just having those like sort of, um, you know, basic kind of operational metrics that you can use to improve your model are really important, right?
Joshua Rubin: Um, you know, we talk about guardrails, right? Can you get real time signals that can be, um, uh, you know, used to veto a bad LLM response in real time. Something that's in the, you know, the, um, the production, uh, code path, um, that's fast enough to give a smart response. Um, and then I think, you know, in my aspirational, um, thinking about, you know, application development for us it's um.
Joshua Rubin: Does this become a user preference store that can be used for the kind of fine-tuning in the future? Right. If you're gathering your user's preferences, if you're getting those thumbs up and thumbs down from their users and logging it, um, you know, when you get to the point and when the market gets to the point where we are actually talking about alignment as a way of, um, dialing our models in for our specific applications to be best aligned with human preferences, you want to have all of that data.
Joshua Rubin: I mean, feedback does tend to be sparse, right? Like, how often do you click a thumbs up or a thumbs down, um, when you're dealing with a software agency, right? Pretty, pretty unusual, right? And so that is really sparse. It's the reason why you need a, you know, one of the reasons why you need a, uh, um, you know, a rewards model when you do this, this alignment.
Joshua Rubin: Um, but you want to capture that data so it's there and you can iterate in a, in a, um, efficient way. Um, so yeah, user preference store is kinda my last thought.
Amal Iyer: No, I think I agree. I think that's the, there's an interesting sort of switch that ML practitioners would have to do from this like, label centric way of thinking to more sort of user preference centric way of sort of tuning and dialing in their models.
Amal Iyer: And I think, um, there's definitely related to alignment there, right? Like, uh, not only want your capabilities of the model to align with what the user's, uh, intent is, but also, you know, the safety aspects of it. Um, and I, I think this might be an interesting sort of, unless there are more questions right now, uh,
Joshua Rubin: We have a question, but, but take a, take 30 seconds and, uh, if you can for a minute.
Amal Iyer: Yeah, I, I, maybe I'll skip over this one, but I would highly recommend, uh, for folks who are interested to take a look at this paper from Sam Bowman from Anthropic, uh, where they're trying to sort of, you know, uh, peer into the future and say, can we, can we look at this oversight problem and can we start understanding, um, uh, how we might tackle it in the future.
Amal Iyer: Um, and I think the next graphic will make it very clear. I love this graphic. This is from Colin Burns and team at, uh, OpenAI Super Alignment team. So I, I think traditional ML, I think a lot of us would resonate with this, right? Like, we provide labels, say you have a sentiment, uh, classi, classifier or, um, entity recognition.
Amal Iyer: Um, any of the ML tasks humans are, are the ones that are actually teaching these, um, quote unquote ML students. Um, but if at all, there is a world in which say, and the dotted line is human level performance, and we get to a world where we have access to systems that are much more capable in not just one dimension.
Amal Iyer: Um, like imaging, et cetera, but on multitude of dimensions, then how do you align these systems and how do you make sure that you don't have something called instrumental convergence, which is sort of, um, you know, uh, skill quote unquote skills or behaviors that might actually be not aligned with human interests or values.
Amal Iyer: So things like power seeking deception, uh, et cetera. So we, we don't want these models to, um, sort of deceive us, um, or players, right? So, um, I, I think the, the, the paper's interesting, the sense that they're trying to sort of mimic this. We of course, don't have such systems today, but they're trying to study, uh, this sort of paradigm using, uh, a smaller quote unquote weaker.
Amal Iyer: Supervisor and using a strong student. So concretely, they are trying to, um, mimic the setting by using a GPT two model, which is an extremely inferior model compared to a GPT-4 student, which is a much more capable model. So they're trying to understand how can we, um, study scalable oversight and generalization in this paradigm here.
Amal Iyer: Um, so I would highly recommend this paper for those who are interested. If you have time, I can, I can talk more about it later.
Joshua Rubin: Nice, nice.
Joshua Rubin: Uh, so we have a question here from Yusaku in the, um, in the chat. Uh, what are the major interpretability issues when using LLMs? Um, I don't know. Maybe, maybe I'll jump in on this one since.
Joshua Rubin: So, so I, I, uh, we, we had an a summit in November and I um, I did a presentation called "Can LLMs be Explained", which was largely literature review, uh, and sort of open science questions. Maybe we can drop that in the, in the chat. Um, but, uh, you know, Amal referenced, you know, some of the mechanistic interpretability work, right?
Joshua Rubin: So there's, uh, you know, there was something by OpenAI, uh, in, I don't know, maybe September of last year. I made some notes here for the names of these. Um, uh, well, so, you know, OpenAI did some work using GPT-4 to, uh, characterize, you know, activations of, in part of particular neurons in a much smaller GPT model.
Joshua Rubin: Um, there was also some amazing piece of work by Anthropic where they used a sparse auto encoder to try to understand, uh, how activation patterns correspond to different kinds of, um, different kinds of concepts. And they concluded from that, you know, a, uh. These models are using their neurons in a polys semantic way.
Joshua Rubin: So basically, a particular neuron can have more than one meaning layered into it. And they used in different combinations with other neurons to represent, you know, basic concepts in a, in a, um, a very sophisticated way. Um, and, and the, the through line of these two papers that were of a similar time period wa was that, you know, the story was complicated.
Joshua Rubin: Uh, you know, barely at the, you know, the sort of that were pushing their capability level. And the models they were interrogating were very simple models by today's standards. So, you know, um, GPT models that are sort of, um, you know, uh, a few years out of date. Um, and so there was some pretty limited, uh, optimism was pretty limited that anytime soon, like mechanistic interpretability, understanding the microscopic behavior of the.
Joshua Rubin: Underlying components of the model is going to lead to sort of a deep understanding of how our complex models work. I mean, specifically around, you know, when you start to get to these, um, emergent properties and abstractions, like the question following it itself, um, uh, you know, the model's intent, the model's bias.
Joshua Rubin: Are you gonna, you know, you might be able to get token level information from mechanistic interpretability, but there are definitely limitations in terms of these higher order concepts, right? Um, and this is really different than our experience with explainability and more traditional ML. Um, the other thing I would add here is that, uh, you know, model self explanation seems to be pretty, um, pretty fraught as well.
Joshua Rubin: Um, so, you know, if you, there was a paper out of Microsoft Research right when they released, um, GPT-4 called "Sparks of Artificial General Intelligence, Early Experiments with GPT-4." And there's a whole section there. It's really interesting. Um, you know, they got a bunch of, you know, the industry's experts on, um, uh, interpretability to, you know, say what they could about GPT-4.
Joshua Rubin: Um, you know, and they sort of characterized model self explanation, asking the model for its own explanation in, in two terms. There was something they called, um, uh, uh, uh, there, uh, output consistency and process consistency were the two terms that they used. So basically output consistency was, you know, when you ask a model for an explanation, is it a valid explanation of, um, you know, the thing that it had previously produced as a, as a statement?
Joshua Rubin: Um, you know, for the most part, output consistency was pretty good. Um, process consistency is, you know, is the models reasoning broadly applicable to a set of analogous questions. Um, and, and, you know, self explanation really pretty much came off the rails there. Um, you know, and so they have a bunch of examples where, you know, the model doesn't reason in, uh, you know, the model's self explanation for its reasoning was not, not consistent in analysis.
Joshua Rubin: Analogous examples, um, there's also a more recent paper, um, uh, Turpin et al from from May of last year, uh, called "Language Models Don't Always Say What they Think," um, in which they really drill into, you know, questions of of bias in the model and whether or not the model's self explanation can describe its own internal biases as part of its explanation for why it made a certain assessment of a situation.
Joshua Rubin: Um, and it, you know, they, they had, there's some really pretty dramatic examples of like racial bias in its assessment of stories. Um, and that its self explanations basically conform to, uh, uh, the, you know, their, their, uh, they, they don't reveal the underlying bias, right? Um, that, uh, the self explanation makes up the story.
Joshua Rubin: Uh, you know, then it thinks that you want to hear or that it you know, it's, it's not an assessment of why the model made its initial decision, but it's, uh, you know, the model's not introspecting in any way, which is, is the real answer, right? It's, uh, it's, it's producing the next token, right? It's producing a, a, you know, a plausible, uh, uh.
Joshua Rubin: You know, phrase completion on everything. It's read on the internet, right? And so while it's sort of, um, tantalizingly, uh, uh, I don't know, sort of appealing to ask it for an explanation because these things behave or appear to us in human-like ways because of alignment, um, it's easy to assume that self explanation is doing more than it actually is.
Joshua Rubin: Um, and so it seems problematic. Um, so I've used probably too much time here. No, that was great. There are some narrow regimes where there are some tools that may be helpful for, um, you know, assessing explanations, but it's, it's a very challenging thing right now. So I would, I would refer you to our YouTube recording from, uh, from the November summit, I think. I don't know if you have any more thoughts, Amal, in our next couple of minutes.
Amal Iyer: Um, yeah, I think broadly, like it sort of, um, reminds me of, you know, some of the challenges that neuroscientists have in understanding, um, you know, biological brains. So. With mechanistic interpretability, we are trying to sort of assess like, hey, what, what are the circuits that are responsible for certain skills and behaviors?
Amal Iyer: Um, so that's like bottom up. And then the top down is sort of, you know, this, um, self consistency explanations by asking the model, uh, sort of like the behavioral, uh, neuroscience, uh, parallel, which is trying to look at the model top down. And I think unlike neuroscience, I think neuroscience has, um, you know, I'm not an expert, but one of the real challenges has been to sort of meet in the middle from both the circuits level and from behavioral, um, neuroscientist level, like somewhere in the middle where you have a coherent sort of, you know, meeting ground.
Amal Iyer: Um, I think with. Our LLMs and, you know, these large models, what I'm optimistic about is because we can actually tap into internal states, unlike biological brains where, you know, you, you, but most have like 32 channels of like electrodes that you can, um, tap into, uh, more invasive MRI scans. So I think because they're more, the internal states are more accessible, I feel like over the next few years we'll see sort of this, you know, convergence between this top down and bottom up approach.
Amal Iyer: And hopefully that'll lead to better understanding of these models.
Joshua Rubin: Nice. Well we started a couple minutes late, so we've been given the blessing to, uh, to stay on a couple more minutes. I dunno if you had any more thoughts.
Amal Iyer: Um, I think, you know, sort of maybe just like some closing thoughts, um, around, um, you know, I, I don't want any of our sort of viewers or listeners to walk away thinking, oh man, these are like intractable challenges. I think I'm pretty optimistic about, uh, you know, just the broad utility of these tools. Um, and I, I do feel that the research community and the Frontier Labs are taking this problem seriously. I do think we need more minds, resources to start thinking about safety and not just do capabilities research.
Amal Iyer: I think most ML researchers, scientists, practitioners, and, uh, now software builders that are building on top of these LLM apps think, uh, primarily in terms of capabilities and rightly so. I think today we should be, uh, but if we continue seeing the pace of progress that we've seen over the past few years, um, I, I think we will, we'd rather be in a position where we understand a lot about safety and alignment than, uh, be in a state where we are forced to sort of respond and be reactive.
Amal Iyer: Um, so I think this is a great time for more folks to get involved, um, and contribute to research, contribute to best practices, also in, you know, adopting these LLMs. Um, so, and largely optimistic, but we do need, uh, more minds, more effort in this direction of not just capabilities, but also safety research.
Joshua Rubin: Yeah. Yeah. I, I think I'm optimistic as well. I think that, um, you know, I, you know, we're in the business of creating a ML Observability platform and, you know, we do plenty of work on, uh, LLM stack now, and it's, you know, it's evolving rapidly. I think it really is possible to build the right Observability layer, you know, or buy the right Observability layer, right?
Joshua Rubin: Whatever the right, you know, form factor is for you. Um, you know, there, there is a right answer if you're thinking about, um, you know, Observability and responsibility and how you build that. You know, in parallel with, um, the application you're developing, I, I think this is doable and I think it's gonna evolve with, you know, the, the technology.
Joshua Rubin: Um, so, so I think maybe we stop there. Um, and, uh, uh, thank you so much to Amal for spending an hour with me.
Amal Iyer: This was fun Josh as as usual.
Joshua Rubin: And thank you to everybody out there who listened along with us. Uh, don't hesitate to reach out if you have follow up questions or wanna hear more about our product or how we're thinking about a specific problem.
Joshua Rubin: Um, you know, I think for both of us, uh, a favorite activity is just hearing about new applications and, uh, learning what the challenges are. 'cause it gets our mind spinning for, you know, uh, what kinds of, uh, tools should be developed or what practice is right for addressing those sort of situations. So thanks to everybody out there and, uh, wishing you a great day or evening or wherever you are.
Joshua Rubin: Alright, take care. Bye.