{"id":13636,"date":"2025-03-17T16:44:29","date_gmt":"2025-03-17T16:44:29","guid":{"rendered":"https:\/\/prizmlaw.com\/site\/?p=13636"},"modified":"2025-04-18T17:37:40","modified_gmt":"2025-04-18T17:37:40","slug":"llm-benchmarking","status":"publish","type":"post","link":"https:\/\/prizmlaw.com\/site\/2025\/03\/17\/llm-benchmarking\/","title":{"rendered":"Evaluating A.I. Language Model  Performance for Legal Reasoning"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"13636\" class=\"elementor elementor-13636\">\n\t\t\t\t<div class=\"elementor-element elementor-element-418a1da e-flex e-con-boxed e-con e-parent\" data-id=\"418a1da\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-a5e1661 elementor-widget elementor-widget-pix-fancybox\" data-id=\"a5e1661\" data-element_type=\"widget\" data-widget_type=\"pix-fancybox.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div class=\"card mb-3 mb-sm-0 pix-info-card   \"  ><div class=\"card-inner\"><a href=\"\" ><div class=\"card-img-overlay p-0 d-flex flex-column justify-content-end animating fade-in-up\"><div class=\"card-img-overlay-content card-content-box bg-light-opacity-8 pix-p-20\" ><h6 class=\"card-text  text-dark-opacity-5  animate-in\" data-anim-type=\"fade-in\" data-anim-delay=\"800\" style=\"\"><\/h6><h3 class=\"text-heading-default font-weight-bold mb-0 animate-in\" data-anim-type=\"fade-in\" data-anim-delay=\"400\" style=\"\">My Work Evaluating A.I. Language Model Performance for Legal Reasoning<\/h3><\/div><\/div><img decoding=\"async\" class=\"card-img-bottom animating fade-in-Img\" src=\"https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/LLM_Legal_Benchmark_Image.png\" alt=\"\"><\/a><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4184461 elementor-widget elementor-widget-heading\" data-id=\"4184461\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">TLDR: Quick Summary <\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-591f414 elementor-widget elementor-widget-pix-feature-list\" data-id=\"591f414\" data-element_type=\"widget\" data-widget_type=\"pix-feature-list.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div id=\"duo-icon-669758\" class=\"slide-in-container w-100  \" ><div class=\"py-2 \"  ><div class=\"pix-feature-list   font-weight-bold     py-2 d-flex align-items-center\" ><div class=\"d-inline-flex align-items-center pix-mr-10 text-secondary\" style=\"font-size:1.2em;position:relative;line-height:1em;text-align:center;\"><svg class=\"pixfort-icon \" width=\"24\" height=\"24\"  data-name=\"Duotone\/pixfort-icon-category-label-1\" viewBox=\"2 2 20 20\"><g fill=\"none\" fill-rule=\"evenodd\"><path fill=\"var(--pf-icon-color)\" fill-opacity=\".25\" d=\"M18.7509844,5 L9.51625802,5 C8.77711521,5 8.06398929,5.27286824 7.51364195,5.76627467 L3.05202725,9.76627467 C2.97084854,9.83905433 2.89369766,9.91620522 2.82091799,9.99738392 C1.71490367,11.2310364 1.81837481,13.127711 3.05202725,14.2337253 L7.51364195,18.2337253 C8.06398929,18.7271318 8.77711521,19 9.51625802,19 L18.7509844,19 C20.4078386,19 21.7509844,17.6568542 21.7509844,16 L21.7509844,8 C21.7509844,6.34314575 20.4078386,5 18.7509844,5 Z\" transform=\"matrix(-1 0 0 1 23.806 0)\"\/><path fill=\"var(--pf-icon-color)\" d=\"M18.7509659,7 C19.3032506,7 19.7509659,7.44771525 19.7509659,8 L19.7509659,16 C19.7509659,16.5522847 19.3032506,17 18.7509659,17 L9.51623956,17 C9.26985862,17 9.03214998,16.9090439 8.84870087,16.7445751 L4.38708617,12.7445751 C3.97586869,12.3759037 3.94137831,11.7436788 4.31004975,11.3324613 L4.34751808,11.2928932 L8.84870087,7.25542489 C9.03214998,7.09095608 9.26985862,7 9.51623956,7 L18.7509659,7 Z\" transform=\"matrix(-1 0 0 1 23.806 0)\"\/><\/g><\/svg><\/div><span class=\"text-dark-opacity-5\">Language models power A.I. tools that can substantially boost attorney productivity such as ChatGPT, Anthropic\u2019s Claude, and Google\u2019s NotebookLM.[1]<\/span><\/div><div class=\"pix-feature-list   font-weight-bold     py-2 d-flex align-items-center\" ><div class=\"d-inline-flex align-items-center pix-mr-10 text-secondary\" style=\"font-size:1.2em;position:relative;line-height:1em;text-align:center;\"><svg class=\"pixfort-icon \" width=\"24\" height=\"24\"  data-name=\"Duotone\/pixfort-icon-category-label-1\" viewBox=\"2 2 20 20\"><g fill=\"none\" fill-rule=\"evenodd\"><path fill=\"var(--pf-icon-color)\" fill-opacity=\".25\" d=\"M18.7509844,5 L9.51625802,5 C8.77711521,5 8.06398929,5.27286824 7.51364195,5.76627467 L3.05202725,9.76627467 C2.97084854,9.83905433 2.89369766,9.91620522 2.82091799,9.99738392 C1.71490367,11.2310364 1.81837481,13.127711 3.05202725,14.2337253 L7.51364195,18.2337253 C8.06398929,18.7271318 8.77711521,19 9.51625802,19 L18.7509844,19 C20.4078386,19 21.7509844,17.6568542 21.7509844,16 L21.7509844,8 C21.7509844,6.34314575 20.4078386,5 18.7509844,5 Z\" transform=\"matrix(-1 0 0 1 23.806 0)\"\/><path fill=\"var(--pf-icon-color)\" d=\"M18.7509659,7 C19.3032506,7 19.7509659,7.44771525 19.7509659,8 L19.7509659,16 C19.7509659,16.5522847 19.3032506,17 18.7509659,17 L9.51623956,17 C9.26985862,17 9.03214998,16.9090439 8.84870087,16.7445751 L4.38708617,12.7445751 C3.97586869,12.3759037 3.94137831,11.7436788 4.31004975,11.3324613 L4.34751808,11.2928932 L8.84870087,7.25542489 C9.03214998,7.09095608 9.26985862,7 9.51623956,7 L18.7509659,7 Z\" transform=\"matrix(-1 0 0 1 23.806 0)\"\/><\/g><\/svg><\/div><span class=\"text-dark-opacity-5\">The MMLU benchmark is a standard test for evaluating language models. It consists of thousands of multiple-choice questions. Language model scores are the percentage of questions answered correctly.<\/span><\/div><div class=\"pix-feature-list   font-weight-bold     py-2 d-flex align-items-center\" ><div class=\"d-inline-flex align-items-center pix-mr-10 text-secondary\" style=\"font-size:1.2em;position:relative;line-height:1em;text-align:center;\"><svg class=\"pixfort-icon \" width=\"24\" height=\"24\"  data-name=\"Duotone\/pixfort-icon-category-label-1\" viewBox=\"2 2 20 20\"><g fill=\"none\" fill-rule=\"evenodd\"><path fill=\"var(--pf-icon-color)\" fill-opacity=\".25\" d=\"M18.7509844,5 L9.51625802,5 C8.77711521,5 8.06398929,5.27286824 7.51364195,5.76627467 L3.05202725,9.76627467 C2.97084854,9.83905433 2.89369766,9.91620522 2.82091799,9.99738392 C1.71490367,11.2310364 1.81837481,13.127711 3.05202725,14.2337253 L7.51364195,18.2337253 C8.06398929,18.7271318 8.77711521,19 9.51625802,19 L18.7509844,19 C20.4078386,19 21.7509844,17.6568542 21.7509844,16 L21.7509844,8 C21.7509844,6.34314575 20.4078386,5 18.7509844,5 Z\" transform=\"matrix(-1 0 0 1 23.806 0)\"\/><path fill=\"var(--pf-icon-color)\" d=\"M18.7509659,7 C19.3032506,7 19.7509659,7.44771525 19.7509659,8 L19.7509659,16 C19.7509659,16.5522847 19.3032506,17 18.7509659,17 L9.51623956,17 C9.26985862,17 9.03214998,16.9090439 8.84870087,16.7445751 L4.38708617,12.7445751 C3.97586869,12.3759037 3.94137831,11.7436788 4.31004975,11.3324613 L4.34751808,11.2928932 L8.84870087,7.25542489 C9.03214998,7.09095608 9.26985862,7 9.51623956,7 L18.7509659,7 Z\" transform=\"matrix(-1 0 0 1 23.806 0)\"\/><\/g><\/svg><\/div><span class=\"text-dark-opacity-5\">The published public MMLU scores are not very useful for evaluating which A.I. language models to use for legal reasoning tasks.<\/span><\/div><div class=\"pix-feature-list   font-weight-bold     py-2 d-flex align-items-center\" ><div class=\"d-inline-flex align-items-center pix-mr-10 text-secondary\" style=\"font-size:1.2em;position:relative;line-height:1em;text-align:center;\"><svg class=\"pixfort-icon \" width=\"24\" height=\"24\"  data-name=\"Duotone\/pixfort-icon-category-label-1\" viewBox=\"2 2 20 20\"><g fill=\"none\" fill-rule=\"evenodd\"><path fill=\"var(--pf-icon-color)\" fill-opacity=\".25\" d=\"M18.7509844,5 L9.51625802,5 C8.77711521,5 8.06398929,5.27286824 7.51364195,5.76627467 L3.05202725,9.76627467 C2.97084854,9.83905433 2.89369766,9.91620522 2.82091799,9.99738392 C1.71490367,11.2310364 1.81837481,13.127711 3.05202725,14.2337253 L7.51364195,18.2337253 C8.06398929,18.7271318 8.77711521,19 9.51625802,19 L18.7509844,19 C20.4078386,19 21.7509844,17.6568542 21.7509844,16 L21.7509844,8 C21.7509844,6.34314575 20.4078386,5 18.7509844,5 Z\" transform=\"matrix(-1 0 0 1 23.806 0)\"\/><path fill=\"var(--pf-icon-color)\" d=\"M18.7509659,7 C19.3032506,7 19.7509659,7.44771525 19.7509659,8 L19.7509659,16 C19.7509659,16.5522847 19.3032506,17 18.7509659,17 L9.51623956,17 C9.26985862,17 9.03214998,16.9090439 8.84870087,16.7445751 L4.38708617,12.7445751 C3.97586869,12.3759037 3.94137831,11.7436788 4.31004975,11.3324613 L4.34751808,11.2928932 L8.84870087,7.25542489 C9.03214998,7.09095608 9.26985862,7 9.51623956,7 L18.7509659,7 Z\" transform=\"matrix(-1 0 0 1 23.806 0)\"\/><\/g><\/svg><\/div><span class=\"text-dark-opacity-5\">Here I privately test several language models on the \u2018professional law\u2019 subset of MMLU questions.<\/span><\/div><div class=\"pix-feature-list   font-weight-bold     py-2 d-flex align-items-center\" ><div class=\"d-inline-flex align-items-center pix-mr-10 text-secondary\" style=\"font-size:1.2em;position:relative;line-height:1em;text-align:center;\"><svg class=\"pixfort-icon \" width=\"24\" height=\"24\"  data-name=\"Duotone\/pixfort-icon-category-label-1\" viewBox=\"2 2 20 20\"><g fill=\"none\" fill-rule=\"evenodd\"><path fill=\"var(--pf-icon-color)\" fill-opacity=\".25\" d=\"M18.7509844,5 L9.51625802,5 C8.77711521,5 8.06398929,5.27286824 7.51364195,5.76627467 L3.05202725,9.76627467 C2.97084854,9.83905433 2.89369766,9.91620522 2.82091799,9.99738392 C1.71490367,11.2310364 1.81837481,13.127711 3.05202725,14.2337253 L7.51364195,18.2337253 C8.06398929,18.7271318 8.77711521,19 9.51625802,19 L18.7509844,19 C20.4078386,19 21.7509844,17.6568542 21.7509844,16 L21.7509844,8 C21.7509844,6.34314575 20.4078386,5 18.7509844,5 Z\" transform=\"matrix(-1 0 0 1 23.806 0)\"\/><path fill=\"var(--pf-icon-color)\" d=\"M18.7509659,7 C19.3032506,7 19.7509659,7.44771525 19.7509659,8 L19.7509659,16 C19.7509659,16.5522847 19.3032506,17 18.7509659,17 L9.51623956,17 C9.26985862,17 9.03214998,16.9090439 8.84870087,16.7445751 L4.38708617,12.7445751 C3.97586869,12.3759037 3.94137831,11.7436788 4.31004975,11.3324613 L4.34751808,11.2928932 L8.84870087,7.25542489 C9.03214998,7.09095608 9.26985862,7 9.51623956,7 L18.7509659,7 Z\" transform=\"matrix(-1 0 0 1 23.806 0)\"\/><\/g><\/svg><\/div><span class=\"text-dark-opacity-5\">These private test scores show that the tested models perform between 5% - 21% worse on legal reasoning than the published public score.<\/span><\/div><div class=\"pix-feature-list   font-weight-bold     py-2 d-flex align-items-center\" ><div class=\"d-inline-flex align-items-center pix-mr-10 text-secondary\" style=\"font-size:1.2em;position:relative;line-height:1em;text-align:center;\"><svg class=\"pixfort-icon \" width=\"24\" height=\"24\"  data-name=\"Duotone\/pixfort-icon-check-circle-2\" viewBox=\"2 2 20 20\"><g fill=\"none\" fill-rule=\"evenodd\"><path fill=\"var(--pf-icon-color)\" fill-opacity=\".25\" d=\"M12,4 C7.581722,4 4,7.581722 4,12 C4,16.418278 7.581722,20 12,20 C16.418278,20 20,16.418278 20,12 C20,7.581722 16.418278,4 12,4 Z\"\/><path fill=\"var(--pf-icon-color)\" d=\"M19.2928932,5.29289322 C19.6834175,4.90236893 20.3165825,4.90236893 20.7071068,5.29289322 C21.0976311,5.68341751 21.0976311,6.31658249 20.7071068,6.70710678 L12.2071068,15.2071068 C11.8165825,15.5976311 11.1834175,15.5976311 10.7928932,15.2071068 L7.29289322,11.7071068 C6.90236893,11.3165825 6.90236893,10.6834175 7.29289322,10.2928932 C7.68341751,9.90236893 8.31658249,9.90236893 8.70710678,10.2928932 L11.5,13.085 L19.2928932,5.29289322 Z\"\/><\/g><\/svg><\/div><span class=\"text-dark-opacity-5\">OpenAI\u2019s \u201co1\u201d models performed the best and provided the correct answer 80% of the time.<\/span><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a525988 elementor-widget elementor-widget-heading\" data-id=\"a525988\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Background<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b6c116 elementor-widget elementor-widget-text-editor\" data-id=\"1b6c116\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p style=\"font-weight: 400;\">Recently A.I. systems have gained the ability to process natural language in a way that can substantially increase attorney productivity. At the core of these systems are language models.<a href=\"applewebdata:\/\/3DB4CDD2-1DCE-4DCA-8B41-91BA6B0ADA0F#_ftn1\" name=\"_ftnref1\">[1]<\/a> Several new language models are released every week. Companies usually evaluate the quality of these models using standard benchmark tests. There are hundreds of these tests, but this brief focuses on just one of them &#8211; the MMLU (Massive Multitask Language Understanding) benchmark.<\/p><p style=\"font-weight: 400;\">Developed in 2021, MMLU tests models with multiple-choice questions spanning 57 subjects\u2014from elementary mathematics to professional fields like law and medicine. While newer benchmarks now better assess complex reasoning capabilities, MMLU remains valuable for the breadth subjects it tests and its specific legal subcategory which consists of over 1,700 questions. For legal professionals, MMLU&#8217;s professional law subset offers insights into how well these models understand legal concepts and reasoning.<\/p><p style=\"font-weight: 400;\">Because I regularly use private versions of these language models to assist in some of my legal work, I needed a better understanding of how these models are evaluated. So, I began to write my own benchmarking scripts to test available models for legal reasoning ability.<\/p><p><a href=\"applewebdata:\/\/3DB4CDD2-1DCE-4DCA-8B41-91BA6B0ADA0F#_ftnref1\" name=\"_ftn1\">[1]<\/a> Often called large language models (LLMs), though this term is becoming outdated as some effective models are smaller and many now handle multiple types of data beyond text.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7431142 elementor-widget elementor-widget-heading\" data-id=\"7431142\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Private Benchmark Test Results<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ef835e7 elementor-widget elementor-widget-heading\" data-id=\"ef835e7\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"> (MMLU Professional Law Question Subset)<\/h4>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-536f162 elementor-widget elementor-widget-text-editor\" data-id=\"536f162\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p style=\"font-weight: 400;\">Below are both the publicly published MMLU benchmark scores and the scores I got from my private tests. Because of time and budget constraints, I could not test each model against the entire 1,700 set of questions in the professional law dataset. My private results are just average results from testing models on 25 questions at a time.<\/p><p style=\"font-weight: 400;\">The table below illustrates the significant discrepancy between published scores and actual performance on legal questions:<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-dc7903c e-flex e-con-boxed e-con e-parent\" data-id=\"dc7903c\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-bf3bade e-n-tabs-mobile elementor-widget elementor-widget-n-tabs\" data-id=\"bf3bade\" data-element_type=\"widget\" data-settings=\"{&quot;horizontal_scroll&quot;:&quot;disable&quot;}\" data-widget_type=\"nested-tabs.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<div class=\"e-n-tabs\" data-widget-number=\"200522462\" aria-label=\"Tabs. Open items with Enter or Space, close with Escape and navigate using the Arrow keys.\">\n\t\t\t<div class=\"e-n-tabs-heading\" role=\"tablist\">\n\t\t\t\t\t<button id=\"e-n-tab-title-2005224621\" class=\"e-n-tab-title\" aria-selected=\"true\" data-tab-index=\"1\" role=\"tab\" tabindex=\"0\" aria-controls=\"e-n-tab-content-2005224621\" style=\"--n-tabs-title-order: 1;\">\n\t\t\t\t\t\t<span class=\"e-n-tab-title-text\">\n\t\t\t\tOpen AI Models\t\t\t<\/span>\n\t\t<\/button>\n\t\t\t\t<button id=\"e-n-tab-title-2005224622\" class=\"e-n-tab-title\" aria-selected=\"false\" data-tab-index=\"2\" role=\"tab\" tabindex=\"-1\" aria-controls=\"e-n-tab-content-2005224622\" style=\"--n-tabs-title-order: 2;\">\n\t\t\t\t\t\t<span class=\"e-n-tab-title-text\">\n\t\t\t\tGoogle Gemini Models\t\t\t<\/span>\n\t\t<\/button>\n\t\t\t\t<button id=\"e-n-tab-title-2005224623\" class=\"e-n-tab-title\" aria-selected=\"false\" data-tab-index=\"3\" role=\"tab\" tabindex=\"-1\" aria-controls=\"e-n-tab-content-2005224623\" style=\"--n-tabs-title-order: 3;\">\n\t\t\t\t\t\t<span class=\"e-n-tab-title-text\">\n\t\t\t\tAnthropic Models\t\t\t<\/span>\n\t\t<\/button>\n\t\t\t\t\t<\/div>\n\t\t\t<div class=\"e-n-tabs-content\">\n\t\t\t\t<div id=\"e-n-tab-content-2005224621\" role=\"tabpanel\" aria-labelledby=\"e-n-tab-title-2005224621\" data-tab-index=\"1\" style=\"--n-tabs-title-order: 1;\" class=\"e-active elementor-element elementor-element-7eb5478 e-con-full e-flex e-con e-child\" data-id=\"7eb5478\" data-element_type=\"container\">\n\t\t\t\t<div class=\"elementor-element elementor-element-389220b elementor-widget elementor-widget-pix-comparison-table\" data-id=\"389220b\" data-element_type=\"widget\" data-widget_type=\"pix-comparison-table.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div class=\"w-100 \" ><div class=\"container\" id=\"pix-event-2908967\"><div class=\"sticky-top pix-sticky-top-adjust \"><div class=\"row pix-py-20 pix-comparison-head bg-white rounded-xl \"   ><div class=\"col-12 col-md-4 col-lg-6 d-flex align-items-center font-weight-bold text-heading-default\"><div class=\"pix-px-15\"><div  class=\"pix-heading-el text-left \"><h3 class=\"text-heading-default font-weight-bold h3 heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">OpenAI<\/h3><\/div><\/div><\/div><div class=\"col justify-content-center d-none d-md-flex align-items-center\"><div class=\"mb-0     h6 text-heading-default\" >o1-Mini<\/div><\/div><div class=\"col justify-content-center d-none d-md-flex align-items-center\"><div class=\"mb-0     h6 text-heading-default\" >o1-Preview<\/div><\/div><div class=\"col justify-content-center d-none d-md-flex align-items-center\"><div class=\"mb-0     h6 text-heading-default\" >GPT-4 Turbo<\/div><\/div><\/div><\/div><div class=\"row pix-py-20 pix-my-10 shadow-hover-none rounded-lg\"  ><div class=\"col-12 col-md-4 col-lg-6 mb-2 mb-sm-0 pb-2 pb-md-0\"><div class=\"pix-px-15\"><div class=\"d-flex align-items-center\"><div  class=\"pix-heading-el text-left \"><h6 class=\"text-heading-default font-weight-bold h6 heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Public Scores<\/h6><\/div><\/div><div class=\"pix-el-text w-100  \" ><p class=\" m-0 text-body-default  \" >Publicly-Released Scores<\/p><\/div><\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_1_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >82.5%<\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_2_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >86.9%<\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_3_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >75.0%<\/div><\/div><\/div><div class=\"row pix-py-20 pix-my-10 shadow-hover-none rounded-lg\"  ><div class=\"col-12 col-md-4 col-lg-6 mb-2 mb-sm-0 pb-2 pb-md-0\"><div class=\"pix-px-15\"><div class=\"d-flex align-items-center\"><div  class=\"pix-heading-el text-left \"><h6 class=\"text-heading-default font-weight-bold h6 heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Private Scores<\/h6><\/div><\/div><div class=\"pix-el-text w-100  \" ><p class=\" m-0 text-body-default  \" >My MMLU Benchmark Score<\/p><\/div><\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_1_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >80.0%<\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_2_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >80.0%<\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_3_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >64.0%<\/div><\/div><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div id=\"e-n-tab-content-2005224622\" role=\"tabpanel\" aria-labelledby=\"e-n-tab-title-2005224622\" data-tab-index=\"2\" style=\"--n-tabs-title-order: 2;\" class=\" elementor-element elementor-element-b230dc4 e-con-full e-flex e-con e-child\" data-id=\"b230dc4\" data-element_type=\"container\">\n\t\t\t\t<div class=\"elementor-element elementor-element-da0a036 elementor-widget elementor-widget-pix-comparison-table\" data-id=\"da0a036\" data-element_type=\"widget\" data-widget_type=\"pix-comparison-table.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div class=\"w-100 \" ><div class=\"container\" id=\"pix-event-2587718\"><div class=\"sticky-top pix-sticky-top-adjust \"><div class=\"row pix-py-20 pix-comparison-head bg-white rounded-xl \"   ><div class=\"col-12 col-md-4 col-lg-6 d-flex align-items-center font-weight-bold text-heading-default\"><div class=\"pix-px-15\"><div  class=\"pix-heading-el text-left \"><h3 class=\"text-heading-default font-weight-bold h3 heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Gemini<\/h3><\/div><\/div><\/div><div class=\"col justify-content-center d-none d-md-flex align-items-center\"><div class=\"mb-0     h6 text-heading-default\" >Flash 2.0 Thinking<\/div><\/div><div class=\"col justify-content-center d-none d-md-flex align-items-center\"><div class=\"mb-0     h6 text-heading-default\" >Flash Pro 1.5<\/div><\/div><\/div><\/div><div class=\"row pix-py-20 pix-my-10 shadow-hover-none rounded-lg\"  ><div class=\"col-12 col-md-4 col-lg-6 mb-2 mb-sm-0 pb-2 pb-md-0\"><div class=\"pix-px-15\"><div class=\"d-flex align-items-center\"><div  class=\"pix-heading-el text-left \"><h6 class=\"text-heading-default font-weight-bold h6 heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Public Scores<\/h6><\/div><\/div><div class=\"pix-el-text w-100  \" ><p class=\" m-0 text-body-default  \" >Publicly-Released Scores<\/p><\/div><\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_1_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >N\/A<\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_2_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >89.5%<\/div><\/div><\/div><div class=\"row pix-py-20 pix-my-10 shadow-hover-none rounded-lg\"  ><div class=\"col-12 col-md-4 col-lg-6 mb-2 mb-sm-0 pb-2 pb-md-0\"><div class=\"pix-px-15\"><div class=\"d-flex align-items-center\"><div  class=\"pix-heading-el text-left \"><h6 class=\"text-heading-default font-weight-bold h6 heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Private Scores<\/h6><\/div><\/div><div class=\"pix-el-text w-100  \" ><p class=\" m-0 text-body-default  \" >My MMLU Benchmark Score<\/p><\/div><\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_1_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >72.0%<\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_2_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >68.0%<\/div><\/div><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div id=\"e-n-tab-content-2005224623\" role=\"tabpanel\" aria-labelledby=\"e-n-tab-title-2005224623\" data-tab-index=\"3\" style=\"--n-tabs-title-order: 3;\" class=\" elementor-element elementor-element-f774ef1 e-con-full e-flex e-con e-child\" data-id=\"f774ef1\" data-element_type=\"container\">\n\t\t\t\t<div class=\"elementor-element elementor-element-5dde767 elementor-widget elementor-widget-pix-comparison-table\" data-id=\"5dde767\" data-element_type=\"widget\" data-widget_type=\"pix-comparison-table.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div class=\"w-100 \" ><div class=\"container\" id=\"pix-event-6411951\"><div class=\"sticky-top pix-sticky-top-adjust \"><div class=\"row pix-py-20 pix-comparison-head bg-white rounded-xl \"   ><div class=\"col-12 col-md-4 col-lg-6 d-flex align-items-center font-weight-bold text-heading-default\"><div class=\"pix-px-15\"><div  class=\"pix-heading-el text-left \"><h3 class=\"text-heading-default font-weight-bold h3 heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Anthropic<\/h3><\/div><\/div><\/div><div class=\"col justify-content-center d-none d-md-flex align-items-center\"><div class=\"mb-0     h6 text-heading-default\" >Sonnet 3.7<\/div><\/div><div class=\"col justify-content-center d-none d-md-flex align-items-center\"><div class=\"mb-0     h6 text-heading-default\" >Sonnet 3.5<\/div><\/div><div class=\"col justify-content-center d-none d-md-flex align-items-center\"><div class=\"mb-0     h6 text-heading-default\" >Sonnet 3.0<\/div><\/div><\/div><\/div><div class=\"row pix-py-20 pix-my-10 shadow-hover-none rounded-lg\"  ><div class=\"col-12 col-md-4 col-lg-6 mb-2 mb-sm-0 pb-2 pb-md-0\"><div class=\"pix-px-15\"><div class=\"d-flex align-items-center\"><div  class=\"pix-heading-el text-left \"><h6 class=\"text-heading-default font-weight-bold h6 heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Public Scores<\/h6><\/div><\/div><div class=\"pix-el-text w-100  \" ><p class=\" m-0 text-body-default  \" >Publicly-Released Scores<\/p><\/div><\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_1_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >N\/A<\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_2_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >89.5%<\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_3_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >81.5%<\/div><\/div><\/div><div class=\"row pix-py-20 pix-my-10 shadow-hover-none rounded-lg\"  ><div class=\"col-12 col-md-4 col-lg-6 mb-2 mb-sm-0 pb-2 pb-md-0\"><div class=\"pix-px-15\"><div class=\"d-flex align-items-center\"><div  class=\"pix-heading-el text-left \"><h6 class=\"text-heading-default font-weight-bold h6 heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Private Scores<\/h6><\/div><\/div><div class=\"pix-el-text w-100  \" ><p class=\" m-0 text-body-default  \" >My MMLU Benchmark Score<\/p><\/div><\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_1_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >72.0%<\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_2_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >72.0%<\/div><\/div><div class=\"col mt-2 mt-sm-0 text-center pix_comparison_3_title d-md-flex align-items-center justify-content-center\"><div class=\"text-center  font-weight-bold    d-flex align-items-center justify-content-center text-heading-default\" >60.0%<\/div><\/div><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-9eb6b9d e-flex e-con-boxed e-con e-parent\" data-id=\"9eb6b9d\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-dcadeb0 elementor-widget elementor-widget-text-editor\" data-id=\"dcadeb0\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p style=\"font-weight: 400;\">Note that private scores on the legal subset of questions consistently give results 5.2% to 21.5% lower than the public scores. In the worst case, Claude Sonnet 3.0 scored only 60% which means that with a given legal question it would only provide the correct answer 60% of the time. Below is an example of a question from the professional law dataset. (Feel free to skip onto the Conclusion if you wish.)<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4c1b348 elementor-widget elementor-widget-heading\" data-id=\"4c1b348\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Sample MMLU Professional Law Test Question<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c425c44 elementor-widget elementor-widget-pix-accordion\" data-id=\"c425c44\" data-element_type=\"widget\" data-widget_type=\"pix-accordion.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div class=\"accordion w-100 accordion-card bg-white2 rounded-lg2\" id=\"accordion-c425c44\"><div class=\"card\">\n               <div class=\"card-header pix-mb-10 shadow-sm rounded-lg bg-white\" id=\"headingpix-tab-c425c44-b7aa921\" >\n                   <button class=\"btn btn-link d-flex text-left\" type=\"button\" data-toggle=\"collapse\" data-target=\"#collapsepix-tab-c425c44-b7aa921\" aria-expanded=\"true\" aria-controls=\"collapsepix-tab-c425c44-b7aa921\"><span class=\"d-inline-flex align-self-center text-heading-default svg-202 text-20 pix-mr-10\"><svg class=\"pixfort-icon \" width=\"24\" height=\"24\"  data-name=\"Duotone\/pixfort-icon-question-mark-circle-1\" viewBox=\"2 2 20 20\"><g fill=\"none\" fill-rule=\"evenodd\"><path fill=\"var(--pf-icon-color)\" fill-opacity=\".25\" d=\"M12,2 C6.4771525,2 2,6.4771525 2,12 C2,17.5228475 6.4771525,22 12,22 C17.5228475,22 22,17.5228475 22,12 C22,6.4771525 17.5228475,2 12,2 Z\"\/><path fill=\"var(--pf-icon-color)\" d=\"M12,5.5 C14.1573641,5.5 16,6.97647552 16,9.14061234 C16,10.2091201 15.6044228,10.9032609 14.7794339,11.6327228 L13.978875,12.2945095 L13.9200337,12.3443537 C13.6276079,12.5968507 13.4121745,12.8227286 13.2560966,13.0410158 C13.1631905,13.1854389 13.0995502,13.3529799 13.0592159,13.5557596 C13.0112095,13.7971112 13,14.016305 13,14.5 C13,15.0522847 12.5522847,15.5 12,15.5 C11.4477153,15.5 11,15.0522847 11,14.5 L11.0037664,14.09106 C11.0120779,13.725968 11.0363611,13.4736848 11.0976432,13.16559 C11.1844993,12.728922 11.3369366,12.3276122 11.5994156,11.9214805 C11.8850117,11.5188117 12.2076479,11.1805337 12.6129501,10.8305729 L13.317907,10.2514444 C13.8554705,9.80478272 14,9.57719525 14,9.14061234 C14,8.19061882 13.1381103,7.5 12,7.5 C11.0036626,7.5 10.4746926,7.8752163 10.1723422,8.6927508 C9.980771,9.21074591 9.40555388,9.4753648 8.88755877,9.28379363 C8.36956367,9.09222246 8.10494478,8.51700534 8.29651595,7.99901023 C8.8868293,6.40284402 10.1596349,5.5 12,5.5 Z M12,16.45 C12.5522847,16.45 13,16.8977153 13,17.45 L13,17.5 C13,18.0522847 12.5522847,18.5 12,18.5 C11.4477153,18.5 11,18.0522847 11,17.5 L11,17.45 C11,16.8977153 11.4477153,16.45 12,16.45 Z\"\/><\/g><\/svg><\/span><span class=\"d-inline-flex font-weight-bold text-heading-default\" >Question<\/span><\/button>\n               <\/div>\n\n               <div id=\"collapsepix-tab-c425c44-b7aa921\" class=\"collapse \" aria-labelledby=\"headingpix-tab-c425c44-b7aa921\">\n                 <div class=\"card-body\"><p style=\"font-weight: 400;\"><strong>Question<\/strong>: On December 30, a restaurant entered into a written contract with a bakery to supply the restaurant with all of its bread needs for the next calendar year. The contract contained a provision wherein the restaurant promised to purchase \"a minimum of 100 loaves per month at $1 per loaf. \" On a separate sheet, there was a note stating that any modifications must be in writing. The parties signed each sheet. Both sides performed fully under the contract for the first four months. On May 1, the president of the bakery telephoned the manager of the restaurant and told him that, because of an increase in the cost of wheat, the bakery would be forced to raise its prices to $1.20 per loaf. The manager said he understood and agreed to the price increase. The bakery then shipped 100 loaves (the amount ordered by the restaurant) to the restaurant, along with a bill for $120. The restaurant sent the bakery a check for$100 and refused to pay any more. Is the restaurant obligated to pay the additional $20?<\/p><\/div>\n               <\/div>\n             <\/div><div class=\"card\">\n               <div class=\"card-header pix-mb-10 shadow-sm rounded-lg bg-white\" id=\"headingpix-tab-c425c44-26790af\" >\n                   <button class=\"btn btn-link d-flex text-left\" type=\"button\" data-toggle=\"collapse\" data-target=\"#collapsepix-tab-c425c44-26790af\" aria-expanded=\"true\" aria-controls=\"collapsepix-tab-c425c44-26790af\"><span class=\"d-inline-flex align-self-center text-heading-default svg-202 text-20 pix-mr-10\"><svg class=\"pixfort-icon \" width=\"24\" height=\"24\"  data-name=\"Duotone\/pixfort-icon-check-circle-1\" viewBox=\"2 2 20 20\"><g fill=\"none\" fill-rule=\"evenodd\"><path fill=\"var(--pf-icon-color)\" fill-opacity=\".25\" d=\"M12,2 C6.4771525,2 2,6.4771525 2,12 C2,17.5228475 6.4771525,22 12,22 C17.5228475,22 22,17.5228475 22,12 C22,6.4771525 17.5228475,2 12,2 Z\"\/><path fill=\"var(--pf-icon-color)\" d=\"M15.2928932,8.29289322 C15.6834175,7.90236893 16.3165825,7.90236893 16.7071068,8.29289322 C17.0976311,8.68341751 17.0976311,9.31658249 16.7071068,9.70710678 L11.2071068,15.2071068 C10.8165825,15.5976311 10.1834175,15.5976311 9.79289322,15.2071068 L7.29289322,12.7071068 C6.90236893,12.3165825 6.90236893,11.6834175 7.29289322,11.2928932 C7.68341751,10.9023689 8.31658249,10.9023689 8.70710678,11.2928932 L10.5,13.085 L15.2928932,8.29289322 Z\"\/><\/g><\/svg><\/span><span class=\"d-inline-flex font-weight-bold text-heading-default\" >Answers<\/span><\/button>\n               <\/div>\n\n               <div id=\"collapsepix-tab-c425c44-26790af\" class=\"collapse \" aria-labelledby=\"headingpix-tab-c425c44-26790af\">\n                 <div class=\"card-body\"><p style=\"font-weight: 400;\"><strong>Choose the correct answer from the following choices<\/strong>:<\/p><table style=\"font-weight: 400;\"><tbody><tr><td width=\"623\"><p>A) Yes, because the May 1 modification was enforceable even though it was not supported by new consideration.<\/p><\/td><\/tr><tr><td width=\"623\"><p>B) Yes, because the bakery detrimentally relied on the modification by making the May shipment to the restaurant.<\/p><\/td><\/tr><tr><td width=\"623\"><p>C) No, because there was no consideration to support the modification.<\/p><\/td><\/tr><tr><td width=\"623\"><p><b><i>D) No, because the modifying contract was not in writing; it was, therefore, unenforceable under the UCC.<\/i><\/b><\/p><\/td><\/tr><\/tbody><\/table><\/div>\n               <\/div>\n             <\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e0478e2 elementor-widget elementor-widget-heading\" data-id=\"e0478e2\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Conclusion\n<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8e1d3d8 elementor-widget elementor-widget-text-editor\" data-id=\"8e1d3d8\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p style=\"font-weight: 400;\">This testing reveals an important nuance: specialized benchmarks provide more accurate guidance than general performance scores. While the 5% &#8211; 21% gap in legal reasoning deserves attention, even 80% accuracy can be useful. Top models can analyze legal questions instantly and at scale\u2014dramatically enhancing efficiency in document review, contract analysis, or preliminary research when properly supervised. Understanding these specific capabilities allows legal professionals to leverage A.I.&#8217;s strengths while implementing appropriate guardrails around its current limitations.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9bff50b elementor-widget elementor-widget-heading\" data-id=\"9bff50b\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Next Steps<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e5dbc05 elementor-widget elementor-widget-text-editor\" data-id=\"e5dbc05\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<ol><li>I plan to build up my testing system to efficiently and thoroughly test more models using a broader range of benchmark tests. I\u2019m currently working on evaluations and explanations for the <a href=\"https:\/\/huggingface.co\/datasets\/TIGER-Lab\/MMLU-Pro\">MMLU-Pro<\/a>, <a href=\"https:\/\/huggingface.co\/datasets\/Idavidrein\/gpqa\">GPQA<\/a>, <a href=\"https:\/\/huggingface.co\/papers\/2412.15204\">LongBench v2<\/a>, and <a href=\"https:\/\/huggingface.co\/datasets\/maveriq\/bigbenchhard\">BIG Bench Hard<\/a><\/li><li>I&#8217;m getting more interested in developing more specialized benchmark tests. For example, because my practice focuses on estate planning and probate issues in California, evaluation questions specific to this legal domain and jurisdiction would be much more valuable than general legal questions. Domain-specific benchmarks could test an AI&#8217;s understanding of particular practice areas and jurisdictional knowledge, providing attorneys with much more relevant performance data for their specific needs. This approach would likely benefit lawyers across all specialties who are considering A.I. adoption.<\/li><li>While accuracy is one important metric, there are privacy and cost considerations as well. Next month I plan to release an evaluation of different A.I. tools in terms of ethical duties regarding client confidentiality.<\/li><\/ol>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f4efd41 elementor-widget elementor-widget-heading\" data-id=\"f4efd41\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Additional Information\n<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f7760ed elementor-widget elementor-widget-text-editor\" data-id=\"f7760ed\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<ul><li>Original MMLU Paper: <a href=\"https:\/\/arxiv.org\/abs\/2009.03300\">https:\/\/arxiv.org\/abs\/2009.03300<\/a><\/li><li>MMLU Dataset: <a href=\"https:\/\/huggingface.co\/datasets\/cais\/mmlu\">https:\/\/huggingface.co\/datasets\/cais\/mmlu<\/a><\/li><\/ul>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>My Work Evaluating A.I. Language Model Performance for Legal Reasoning TLDR: Quick Summary Language models power A.I. tools that can substantially boost attorney productivity such as ChatGPT, Anthropic\u2019s Claude, and Google\u2019s NotebookLM.[1]The MMLU benchmark is a standard test for evaluating&#8230;<\/p>\n","protected":false},"author":1,"featured_media":13695,"comment_status":"closed","ping_status":"open","sticky":false,"template":"elementor_header_footer","format":"standard","meta":{"_siteseo_robots_primary_cat":"4","pagelayer_contact_templates":[],"_pagelayer_content":"","footnotes":""},"categories":[27],"tags":[],"class_list":["post-13636","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-legal-tech"],"_links":{"self":[{"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/posts\/13636","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/comments?post=13636"}],"version-history":[{"count":69,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/posts\/13636\/revisions"}],"predecessor-version":[{"id":13717,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/posts\/13636\/revisions\/13717"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/media\/13695"}],"wp:attachment":[{"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/media?parent=13636"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/categories?post=13636"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/tags?post=13636"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}