Keynote: Benchmarking in the Era of Foundation Models

Gen AI Summit 2023 - Open or Closed

Percy Liang, Director of Stanford CRFM, Cofounder at Together.xyz

 

Gen AI Summit_ Open or Closed_...n the era of Foundation Models

Thu, May 25, 2023 10:25AM • 19:39

SUMMARY KEYWORDS

models, benchmarking, benchmarks, evaluate, generate, text, measure, tasks, human, metrics, scenarios, cases, citations, language, helm, accuracy, effort, calibration, important, ongoing effort

SPEAKERS

Percy Liang

 

Percy Liang  00:00

All right. Hello, everyone. I'm really happy to be here. Thanks so much for having me. We are living in an era of stagnation. Incredibly exciting time. But I want to talk about benchmarking. So, that's marks oriented API. If it weren't for benchmarks it weren't for ImageNet what would the deep breath the deep learning revolution look like in computer vision? For squat or super glue? Lower deep learning? benchmarks, give the community Northstar something new held. And over time, we see benchmark progress. But I also want to point out that benchmarking includes values includes what we as a community care about measuring the decision of what has to focus on what languages what domains, all of these things determine the shape that technology falls. So it's very important. That's why number one takeaway from this talk, I want to impress upon you how central benchmarking is we like to think about methods at all or products or, or demos, but none of this would be possible without kind of the measurement that you have drives at home. So now we are here we are in 2023. This slide is probably not the day to almost keep on coming out every every month. There's just an explosion of different models and micro models, as well as more generally, which we all know about. Or I don't need to tell you how amazing and impressive it is. But I'll see you then. And I want to stress at this moment about devices here. Or ability to measure the quality of these models. The pace of innovation is happening so quickly, that we're not able to measure that on a day to put on a benchmark, it will be it will last a few years because people will make progress towards it. And now, the pace of innovation models has outstripped our ability to keep up. So we need to do something about it. Okay, language models so everyone knows what I like as long as but here's one way to think about it. It's just a box that takes text in and produces text. So in some sense, it's a very simple object. Right? It can be even simpler. So for what's so hard about this well what's hard about it is that this deceptively simple object means to a number of different applications of internet today, different applications generally seem to generate emails, they can revise, explain jokes, and just with a single model, and that's what the paradigm shift is about. This is why we have general purpose foundation models that can do all these things rather than African footballs for a reasonable task. But that makes benchmarking a little bit of a nightmare. So what do we want from benchmarking or whatever you want to ask you guys want to understand what was lost and do and what they've had to do? Go on Twitter. You can see all the amazing things, all the demos. You can also see all the people now to be for sale at this and so once that's where our what is the objective way to think about looking balls are suitable. We develop veteran indices. Everyone's evaluated on slightly different things. Which model is better? We need to standardize and also we need to round the conversation out. There's policymakers there's many of you who are trying to make a decision of which model to choose. And, you know, as a scientist, I believe that we need to have the facts once you have the facts that you can make informed decisions. And finally, benchmarking classically has served as a guide post for researchers we want to use benchmarks to develop socially beneficial and reliable models. We want these benchmarks to be ecologically valid, not toy tasks with things that show up in the real world. We want to go beyond just accuracy and measure other concerns such as bias, robustness, calibration, efficiency, and I think we need to consider human values into profit streams processor. So last year, we put out the first installment of our benchmarking efforts at the Center for a strong foundation model called the holistic evaluation of language models. And this is an ongoing effort. I'm going to tell you about Halloween design the next one. So the first principle is referenceable. So first principle is broad coverage. Usually not benchmarks, common values because think about the advantage or harness, is there an optimal list of tasks, question answering text classification and language following and so on. But then you wonder, what, why this list and why not others what's missing? And it seems we need something more something more systematic. Given the large capability surface of these models. I think we have to approach it with a little bit more rigor than that. So when we decided to build a taxonomy across tasks, and for each task, you know, what domain are we talking about? Who's generating this tests when the tests was one language? And we recognize that many of these cells are empty because we don't have support for them? And this is this recognition or witness that this is an ongoing process, but benchmark is not that we have to say figure is aspirational things that we want to measure and then we try our best to give them the resource constraints. The second thing I want to point out is that usually benchmarks focus on one master it's easy, everyone wants one metric one ranking, but the wellness conference, but we don't want to just care about accuracy. We want to care about calibration, which captures how is the models know what they don't know. As I'll show later, the models actually are very prominently wall which is across How can plus sale so if there are cases where you can just change the your input a little bit and the output dramatically changes. These models might not be robust that they do that. Are these models fair across a wide range of demographic groups? Are they bias that they generate toxic content? Are they efficient? And the third thing is terrorization. Previously, as I mentioned, benchmarking was a little bit ad hoc. If you think about this matrix which has off the top which has different models as the columns and the different data sets as the rows, you see that people were not benchmarking their models on the same thing. So when helm we took a very systematic approach and said, Here are all the models that we see and we get our hands on essentially, and here are all the studios, let's systematically cover the space for that anymore. Okay, so let me give you an idea of what types of scenarios are we're looking at initially, and this is a mock event, so it's only limited to this. So we have question estimate information, extraction, summarization, elasticity, classification, and sentiment analysis, various types of text classification tasks, and many more. There's many new ones that involve coding and recently, there's a legal data sets is medical data sets and so on. And for each of these scenarios, we evaluate a model of what a model is, and we measure accuracy. And this involves both classic accuracy did you get the question by the long, but also for some tasks? summarization we found that human evaluation is is absolutely necessary. For any paper you read that says, reports on Rouge you should really take with a grain of salt because those are not really that correlated. We looked at calibration again, which is this idea of if a model outputs I'm 70% confident that this is the right answer. That means 70% of the time otherwise. And this comment is really important because if you think about threading these laws through a larger system, then you want to look at uncertainty otherwise the system becomes very brittle. Robustness has to do with if you change words, maybe introduced highballs, lowercase does the model tradition, change. I feel all should be embedded into the software provisions. But we noticed that some of the models actually change the prediction as a result. there and this is a broad topic we've probably looked at it's kind of a narrow subset which is thinking about gender and race. And if you have outputs with one gender and your other gender are the two producing similarly, like quality of results, bias, which is measuring the generations of these models, and how certain types of words as a mathematician, are more or less with certain types of gendered words. toxicity is something that's you don't know about. Efficiency. I think this is something that's obviously very important if you want to employ a foundational role. You don't want something to take 10 seconds and this is this is a little bit tricky, because you on one hand, it's easy to recreate the model. But that depends on how many one hundreds, the model provider decided to allocate that are you going to queue? So we also compute an idealized implant, which is, in certain cases, we got access to a lot of architecture we can standardize across the hardware so we can get a much more honest answer about the moment. So one way to think about what we're doing in the help is that we have scenarios as rows. These are the situations and use cases where we want to apply language models, and on the columns are the metrics, this is capturing what we care about what we want the language volunteer to do. So let's go to the laws themselves. So we evaluated 30 models from 2001 and tropic poker, Luther King science we're gonna have video AI to live down next there's more that we've added over time and alfalfa. In Europe, and many other open models. Right now. Some of these models are closed, and we only have HT access. Some of them are open, where we can tell them all the weights and for those we evaluate them through that. So one thing I want to highlight about the results is if you go to this website, what we're committed to is full transparency. This means that we have not only the tables which gives you the ranking of the different models and putting different criteria, but what it is 56.9 So usually don't always I don't know because I feel bad. Well, you can click double click on that. And you can look at the individual instances of furious, you know, some question and some answer and you can look at the model predictions, and then you can know for yourself. And this is especially important in these generative cases where you have an output of a model and it gets rated and it gets rated 4.5. What does what does that mean? I think we have to be skeptical of these metrics because these are complex systems and our measurement is imperfect. So that's why we commit to full transparency. Because we do the rods we have the generations and people can judge from themselves. All the code is open source, it's public. It's a community project. People can contribute scenarios, metrics models, and this is a growing effort. So this is we've been updating about monthly since we came out in December. And we're very busy because there's so many models that are coming out we haven't done quite GP for yet, but that's coming soon. And some of the older models, we still need to add this as well. But I want to switch gears a little bit and talk about some of the surrounding projects on a theme of evaluation that we've been looking at. So an extension of how helm started with language models, but in the future, we are going to be thinking about foundation pumps more generally. So multi model models back and see images and text to the interface would be to have not just text Boolean, but sequence of text and images and then coming out not just text but text and images. So this is a kind of a general paradigm. We're starting with text image models such as dolly and see what confusion and again we're trying to be as systematic as possible. We're evaluating tons of models on different aspects, from quality to originality to whether the models who came knowledge and whether the bias of toxic and so on. Another thing I want to mention is a lot of these models are used for human. So if you're using charts, you're right it's not like you enter one if you're not interactive. And right now our ability to measure your activity is actually quite, quite poor and limited. And so this is an effort to try to think about evaluating not just the ELA, but evaluating the human LM interaction. So if you think about human you're a human but Allah has one system, how well does that unit function? And what we realized is that, you know, by and large, a lot of the top models that like organize entries are three is doing quite well but not always. There are cases where improvement on automatic metrics don't result in improvement on interactive metrics. And there are these cases where the automatic metric, maybe the model will do something that's sort of locally optimal, so generate something where it's actually hard for the humans to work with now. So thinking in terms of the human interaction, I think is important. Next step as we as these models are being deployed to humans. A related project here is thinking about the verifiability of generative search engines. So it's interesting, I think that you have a question where they answered questions, not with the search results, but actually a generated text with citations. So now you say, great, they have citations. That means that that means we should trust the answers. But if you take a careful look, I found we found that only 74% of the citations actually support the standard. So this isn't, you know, embarrassing low. So, you know, just because there's a citation doesn't mean that the generated text is actually faithful. And this is again, another ongoing effort that there's now we should probably add a bar to that list to evaluate. Another interesting thing about this is that there's a trade off the three perceived utility and the accuracy of the citations. So notice, it's a procedural. So what happens is that some of the models are generating very helpful sounding fluid tests, but it's all made up and you're looking at how accurate they are phenomenon, whereas main chat is actually generally things which are very painful, but it was useful because they just copy and paste from the sources. So that's interesting that there's a trade off here and again, without benchmarking as rigorously, we might not do and uncover this and it meaningful progress. Okay, so let me I'm here. So I want to impress on you that benchmarks are such an important aspect of Ai do the thing that breeds AI? It determines where we are headed, as as a community. And when we fall models are very tricky. It's not just your dog versus cat last car that they're trying to get out, I think is is pretty easy right now. But like as long as they can do so many things, how do we benchmark something that's so? So so raw helm is our effort that tries to undertake this ambitious path, so it's holistic? So we're looking at across a wide number of scenarios, all the models, I should get a hands on, looking at all the metrics, not just accuracy. And I want to underscore that this is really something that we can't do alone. So this shouldn't be a community effort. So many of you have interesting use cases from many domains like law, medicine or finance or coding or or poetry generation. Please come and contribute because once we have the critical mass of of scenarios, then we have as a committee, we know what we're trying to accomplish. And then we can measure that and we see are we making progress or not? And hopefully, this resource will be useful to all of you as to the dashboard to go and say what's going on with ambition models, and if it gives you the answer, so I think this image is with a lighthouse is really as simple as symbolizing what helm is about is trying to shine a light on this incredibly fast and exciting thanks for your time

Previous
Previous

Fireside Chat: Pioneering the AI Revolution

Next
Next

I Will Teach You to Be Rich | Ramit Sethi | Talks at Google - YouTube