How do we know if one AI Model is better then another AI Model?
From "Lowering the Loss" to Human Side-by-Sides
How do we know if one AI Model is better then another AI Model? In the outermost sense, better could mean more accessible or usable than another, but in any comparison, how do we determine whether anything is better than anything? Is Twitter better than Facebook? Are green apples better than red apples?
Boom after boom in recent Deep Learning advancements, the measure of success is determined by how accurately a model and predict the patterns of its training data. This is called lowering the loss. We have some data: if we learn from half of it, can it reproduce the other half on its own?
From what I’ve seen in ML Research, it had not been (as much a) common practice to ask humans whether an AI Model is better. This is seen as HCI research (Human-Computer Interactions). Human participation in evaluating AI introduces a lot of complexity and can be quite time consuming. It also felt like “lowering the loss” had so much low hanging fruit as you use more data. In recent years, there’re been more hybrid labs that emphasize “human-centered AI.” Even now, human-evals (with Human Side-by-Sides) and algorithmic-evals (like lowering the loss) tend to come from separate communities.
Increasingly, there is a cooperation of both types of evals, from RLHF to autoevals, but I would say that there are two endpoints in evaluation of AI: (1) studying the AI model and (2) studying the human (using AI). (I have a lot to learn about the human studies side of things. In my dissertation, I eye-balled my figures with the help of neighboring statisticians. As a grad student, I had worked with both human evaluators (paper) and algorithmic/proxy evaluations (paper) without really distinguishing between the nature of each approach.)
Recently, I’ve seen content presented by scientists, like Dr. Mike Mozer, who bring new light through bridging between the insights of parallel research communities— especially now that our machines emulate human behaviors more and more. We definitely need more forerunners like Mike.
For example, I was at CHI last month, and there was a course on Empirical Research Methods for Human-Computer Interaction, which I took to refresh on the basics of human studies. I asked the instructor if the process for using human participants to study AI is different than studying humans as they use AI, and he felt he wasn’t qualified to answer. On the ML side, it’s clear that the center of focus is the AI model, in the HCI side, it’s clear that the human is the center of focus. To me, it seems like the meeting of the two is largely still in the making.
As we involve more human participation in evaluating AI, here are two common practices I’ve been reading up on:
In the coming months, there will be a lot more work around Human AI Alignment and the Authorial Leverage that AI provides.