Emerging trends in LLM benchmarking
2 min read
There is a struggle to build the best possible LLM. Companies like OpenAI, Anthropic, Google, Meta etc. are all in the game. But how do we measure the performance of these LLMs? This process is called benchmarking.
Benchmarking is a process of executing a set of standardized tests called benchmarks. Designing a good benchmark is hard. Many have worked on creating good benchmarks for LLMs. HumanEval, Big-Bench, LiveBench, HELM etc. are many more popular benchmarks. There are also IQ tests and other similar tests.
The recent trends however shows that there are three distinct trends.
- Comprehensive evaluation
Many benchmarks like HELM like to provide a holistic picture of LLMs. They have many tests across multiple criteria and evaluate the models against all of those criteria. In the end you get spider web view.
This gives you a clear picture of what problems are harder to solve with LLMs. Logical correctness and robustness for example is harder to achieve with LLMs. Harmlessness and Readability is much easier problem to solve than others.
- Narrow specialized evaluation.
In real world, companies are likely going to use LLMs for specific product features. For example Github co-pilot is about code recommendations. So this is another trend where researchers are building specialized benchmarks that test models against very narrow and highly specialized work.
The advantage of such benchmarks is that it tells you which model you need to pick for your use case. Also, it makes model developers think innovatively to train their model to improve in specific behaviors.
We are seeing new benchmarks being developed to improve LLM understanding in finance, medicine, mathematical proofs, logical reasoning, detection of humor and sarcasm etc.
Conclusion
Two trends seem to be common when it comes to benchmarking. Massively complex benchmarks that measure holistic performance of the LLMs. Secondly, super narrow and highly specialized benchmarks.