AI Benchmarks and Benchmarking

27 Aug

Written By Alex Grzankowski & Benjamin Henke

Alex Grzankowski & Ben Henke

Benchmarking is the practice of evaluating the performance of hardware, software, or algorithms by running a series of tests and comparing the results against a set of predetermined standards or the performance of other systems.

AI can pass a law school entrance exam. So what?

Key Points:

While AI benchmarking is in its infancy, an increasing number of benchmarks promise to examine AI system capabilities, such as reasoning, mathematical capabilities, and value alignment.
While traditional benchmarking, such as of a system’s video processing capabilities, reflects relative agreement about what makes for good performance, AI benchmarks make widely contested bets about what counts, e.g., as good reasoning or moral performance.
An important source of this lack of agreement is unclarity about the nature of the feature being evaluated, especially about whether it is a mere performance notion or if it instead involves features of internal processing.

In computer science, benchmarking refers to the practice of evaluating the performance of hardware, software, or algorithms by running a series of tests and comparing the results against a set of predetermined standards or the performance of other systems. These standards and comparisons set benchmarks. Benchmarking contributes to understanding the efficiency, speed, and capabilities of computer systems. Benchmarks can also contribute to innovation and advancement by promoting competition focused on a common goal. Some aspects of AI benchmarking are familiar, but AI benchmarking also raises a number of unique and important issues that go beyond output performance.

The Familiar

The methodology of evaluating AI systems has its roots in conventional computing evaluation, where tests are designed to measure a computer’s capability in specific computational tasks. On such benchmarks, ‘better performance’ is defined by the speed and efficiency with which the machine accomplishes the tasks. A prevalent issue in such traditional evaluations, and which applies to AI benchmarking, is the pertinence and representativeness of the test tasks to the real-world applications they are intended to mimic. Any discrepancy between the two could lead to higher scores on the benchmark not reflecting better performance in the wild, leading to a problem of what we’ll call ‘task-validity.’ The concern over task validity is particularly acute in AI benchmarking, as the tasks asked of the system in the wild are substantially more diverse. A benchmarking for ‘reasoning,’ for example, can only examine performance on a small subset of ‘reasoning’ tasks that might be asked of the system by users.

Benchmarking for Quality

Another difference between AI benchmarking and traditional benchmarking is its target. Traditional benchmarking typically relies on wide agreement about what makes for ‘better’ performance on task (often, the speed with which the system does it). Thus, while people may disagree about exactly which tasks are representative of the tasks ‘in the wild,’ they can agree, relative to a set of tasks, that performing them faster is better, and thus that higher scores on the benchmark reflect better performance. AI benchmarks, by contrast, often assess the quality of AI outputs, which may vary widely from case to case, rather than merely the speed with which it produces them. To take our example above, for example, there is room for substantial disagreement about what counts as good reasoning in the first place. In fact, researchers widely disagree about what we’ll call ‘rating validity.’

Internal Features and Performance

AI Benchmarking tends to focus on system outputs such as produced sentences or images, but some features of interest may demand attending to ‘internal’ features of a system and how they unfold. For example, two AI models might perform equally well on distributing loans and so match each other on a typical performance benchmark, but one model might rely on representations or information we deem irrelevant or even unethical when making a mortgage decision such as a candidate’s race or gender. Such issues suggest that in addition to outward performance, benchmarkers should attend to how outputs are reached. Which methods to appeal to on this front (e.g. mechanistic interpretation, causal abstraction, etc) remain a matter of controversy.

Construct Validity

Features that capture the attention of both AI researchers and the general public often extend beyond mere performance metrics. Consider two AI models that achieve identical scores on the Law School Admission Test (LSAT). Despite matching performance benchmarks, their approaches to problem-solving could differ significantly. One model may use brute force or rely heavily on memorizing its training data, while the other could demonstrate genuine reasoning capabilities. To effectively evaluate features like intelligence and rationality, benchmarkers need to justify why observed behaviors are indicative of these underlying qualities rather than merely signs of superficial imitation. This issue is known in cognitive science as the problem of ‘construct validity.’

Comparing Validities

It’s important to keep apart the three notions of validity articulated here — task validity, rating validity, and construct validity — as they present different kinds of challenges for AI systems. One issue that permeates the existing literature on AI benchmarks is a conflation of these different issues, likely due to a conflation of the notion of good performance — such as generating good ‘arguments’ — with good underlying capabilities — such as reasoning well.

Alex Grzankowski & Benjamin Henke