Recently A.I. systems have gained the ability to process natural language in a way that can substantially increase attorney productivity. At the core of these systems are language models.[1] Several new language models are released every week. Companies usually evaluate the quality of these models using standard benchmark tests. There are hundreds of these tests, but this brief focuses on just one of them – the MMLU (Massive Multitask Language Understanding) benchmark.
Developed in 2021, MMLU tests models with multiple-choice questions spanning 57 subjects—from elementary mathematics to professional fields like law and medicine. While newer benchmarks now better assess complex reasoning capabilities, MMLU remains valuable for the breadth subjects it tests and its specific legal subcategory which consists of over 1,700 questions. For legal professionals, MMLU’s professional law subset offers insights into how well these models understand legal concepts and reasoning.
Because I regularly use private versions of these language models to assist in some of my legal work, I needed a better understanding of how these models are evaluated. So, I began to write my own benchmarking scripts to test available models for legal reasoning ability.
[1] Often called large language models (LLMs), though this term is becoming outdated as some effective models are smaller and many now handle multiple types of data beyond text.
Below are both the publicly published MMLU benchmark scores and the scores I got from my private tests. Because of time and budget constraints, I could not test each model against the entire 1,700 set of questions in the professional law dataset. My private results are just average results from testing models on 25 questions at a time.
The table below illustrates the significant discrepancy between published scores and actual performance on legal questions:
Publicly-Released Scores
My MMLU Benchmark Score
Publicly-Released Scores
My MMLU Benchmark Score
Publicly-Released Scores
My MMLU Benchmark Score
Note that private scores on the legal subset of questions consistently give results 5.2% to 21.5% lower than the public scores. In the worst case, Claude Sonnet 3.0 scored only 60% which means that with a given legal question it would only provide the correct answer 60% of the time. Below is an example of a question from the professional law dataset. (Feel free to skip onto the Conclusion if you wish.)
Question: On December 30, a restaurant entered into a written contract with a bakery to supply the restaurant with all of its bread needs for the next calendar year. The contract contained a provision wherein the restaurant promised to purchase "a minimum of 100 loaves per month at $1 per loaf. " On a separate sheet, there was a note stating that any modifications must be in writing. The parties signed each sheet. Both sides performed fully under the contract for the first four months. On May 1, the president of the bakery telephoned the manager of the restaurant and told him that, because of an increase in the cost of wheat, the bakery would be forced to raise its prices to $1.20 per loaf. The manager said he understood and agreed to the price increase. The bakery then shipped 100 loaves (the amount ordered by the restaurant) to the restaurant, along with a bill for $120. The restaurant sent the bakery a check for$100 and refused to pay any more. Is the restaurant obligated to pay the additional $20?
Choose the correct answer from the following choices:
A) Yes, because the May 1 modification was enforceable even though it was not supported by new consideration. |
B) Yes, because the bakery detrimentally relied on the modification by making the May shipment to the restaurant. |
C) No, because there was no consideration to support the modification. |
D) No, because the modifying contract was not in writing; it was, therefore, unenforceable under the UCC. |
This testing reveals an important nuance: specialized benchmarks provide more accurate guidance than general performance scores. While the 5% – 21% gap in legal reasoning deserves attention, even 80% accuracy can be useful. Top models can analyze legal questions instantly and at scale—dramatically enhancing efficiency in document review, contract analysis, or preliminary research when properly supervised. Understanding these specific capabilities allows legal professionals to leverage A.I.’s strengths while implementing appropriate guardrails around its current limitations.