by Grayson Adkins, update March 28, 2024
This notebook provides practical benchmarking for large language models (LLMs) on a variety of technical tasks, such as converting code from one programming language to another, writing bash one-liners, explaining code snippets, and more.
Attribution
This notebook is based on a new benchmark called yet-another-applied-llm-benchmark created by of Nicholas Carlini, a research scientist at Google DeepMind. He has a good write-up on the motivation and details on his personal blog.
You want:
The tasks are taken from real-world usage of LLMs by technical users (primarily by Carlini himself). The goal is to practically inform the evaluator if a given model is capable of performing these challenging tasks. Failure to perform a task is not an indication that the model is bad. That is, the benchmark is not designed to be a scientific benchmark used to decide if one model is better than another. Instead, it should be used to see how models are progressing in accurately performing useful technical tasks.
The benchmark includes:
>>
operator. For example, 'Write a "hello world" program in python' >> LLMRun() >> PythonRun() >> SubstringEvaluator("hello world")
, instructs an LLM to write a program that prints "hello world" and then checks for a matching string in the output. According to Carlini, this format supports more sophisticated behavior than other benchmarks. draw_flag_bmp.py
tasks a model to write a program in C that draws the flag of Italy in BMP format. GPT-4 is then asked what the output image depicts. If GPT-4's output includes "Italy" or "Italian", then we assume the model got it right. (Carlini acknowledges that this is an imperfect approach, but, again, this benchmark is not meant to be scientifically rigorous.) The core component of the benchmark framework is the evaluator.py
file, which defines a series of classes and functions that together form a flexible and extensible system for running tests, capturing outputs, and evaluating those outputs against expected results or criteria.
TestFlagDraw = "Write a C program that draws an american flag to stdout as a bmp." >> LLMRun() >> \
ExtractCode() >> CRun() >> LLMVisionRun("What flag is shown in this image?") >> \
(SubstringEvaluator("United States") | \
SubstringEvaluator("USA") | \
SubstringEvaluator("America"))
In this example, the nodes LLMRun()
, ExtractCode()
, CRun()
, LLMVisionRun()
, SubstringEvaluator()
are each instances of their respective classes. Each class defines a set of functions that implement the desired behavior of the node. The output of a node becomes the input of the next node of the sequence. For nodes that require code execution, a Docker or Podmand container is spun up to safely run the code in a sandbox environment.
Users can run tests individually or all of them at once. The framework also conveniently includes a script for generating a results matrix in HTML format.
The benchmark is easily extensible, both in terms of adding new tests and new LLMs. As of the time of this writing, it supports the following LLMs:
Additionally, Trelis Research has added support for custom models that implement the OpenAI API format. In this notebook, I also demonstrate testing against Mixtral 8x7B Instruct AWQ and OpenChat 3.5.
## Remove existing benchmark repo from local files
# import shutil
# shutil.rmtree('/content/yet-another-applied-llm-benchmark')
%cd /content
!git clone https://github.com/gadkins/yet-another-applied-llm-benchmark.git
%cd yet-another-applied-llm-benchmark
!pip install -qUr requirements.txt
!pip install -qUr requirements-extra.txt
!pip install -qU python-dotenv
/content Cloning into 'yet-another-applied-llm-benchmark'... remote: Enumerating objects: 12000, done. remote: Counting objects: 100% (568/568), done. remote: Compressing objects: 100% (185/185), done. remote: Total 12000 (delta 391), reused 535 (delta 375), pack-reused 11432 Receiving objects: 100% (12000/12000), 72.45 MiB | 19.87 MiB/s, done. Resolving deltas: 100% (2760/2760), done. Updating files: 100% (10736/10736), done. /content/yet-another-applied-llm-benchmark
# Unecessary in Google Colab (but critical on a local machine)
!sudo apt-get install podman
Reading package lists... Done Building dependency tree... Done Reading state information... Done podman is already the newest version (3.4.4+ds1-1ubuntu1.22.04.2). The following package was automatically installed and is no longer required: libfuse2 Use 'sudo apt autoremove' to remove it. 0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.
!which podman
/usr/bin/podman
In addition to the LLMs defined in the code, you can run this benchmark against your own custom models or just open-source models that are not included in the benchmark.
For help deploying production-ready model APIs, see the Model Serving notebook. I also have available some one-click templates to easily deploy the following models on Runpod:
Next, we'll create config.json
and set up the following:
api_key
(or empty string if not applicable), API endpoint
where the model is hosted, and Hugging Face model_id
. %%writefile config.json
{
"container": "podman",
"hparams": {
"temperature": 0.7
},
"llms": {
"Mistral-7B-Instruct-v0.1-AWQ": {
"api_key": "EMPTY",
"endpoint": "https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/",
"slug": "TheBloke/Mistral-7B-Instruct-v0.1-AWQ"
},
"Mixtral-Instruct-AWQ": {
"api_key": "EMPTY",
"endpoint": "https://mc1s4jnygce5b5-8000.proxy.runpod.net/v1/",
"slug": "casperhansen/mixtral-instruct-awq"
},
"openchat_3.5": {
"api_key": "EMPTY",
"endpoint": "https://i0vbjq7enev3du-8080.proxy.runpod.net/v1",
"model_id": "openchat/openchat_3.5"
},
"openai": {
"api_key": "YOUR_OPENAI_API_KEY"
},
"mistral": {
"api_key": "TODO"
},
"cohere": {
"api_key": "TODO"
},
"anthropic": {
"api_key": "TODO"
},
"moonshot": {
"api_key": "TODO"
}
}
}
Writing config.json
For testing purposes, follow these instructions:
We'll be using "gpt-3.5-turbo" which can be accessed via free accounts.
In llm.py
update the following variables.
llm = LLM("gpt-3.5-turbo")
or whatever model you want eval_llm = LLM("gpt-3.5-turbo", override_hparams={'temperature': 0.1})
In evaluator.py
PYTHON_ENV = "python3.11"
to PYTHON_ENV = "python"
In docker_controller.py
(if not using podman
(or) docker
):
I_HAVE_BLIND_FAITH_IN_LLMS_AND_AM_OKAY_WITH_THEM_BRICKING_MY_MACHINE_OR_MAKING_THEM_HALT_AND_CATCH_FIRE
to True
If you prefer running it locally:
Add the respective Python path in evaluator.py .
These changes will enable you to use "gpt-3.5-turbo" for testing.
Let's makes sure one basic test is working on the free gpt-3.5-turbo
model.
!PYTHONPATH='.' python tests/print_hello.py
gpt-3.5-turbo CACHE MISS ['Write a python program that prints the string "hello world" and tell me how it works in a sentence'] gpt-3.5-turbo CACHE MISS ['Take the below answer to my programming question and return just the complete code in a single file so I can copy and paste it into an editor and directly run it. Include any header and main necessary so I can run it by copying this one file. DO NOT MODIFY THE CODE OR WRITE NEW CODE. Here is the code: \nprint("hello world")\n\nThis program uses the print() function in Python to output the string "hello world" to the console when the program is executed.'] # Initial Query > Write a python program that prints the string "hello world" and tell me how it works in a sentence # LLM Generation ## Query > Write a python program that prints the string "hello world" and tell me how it works in a sentence ## Output > print("hello world") > > This program uses the print() function in Python to output the string "hello world" to the console when the program is executed. # Extract Code I extracted the following code from that output: > ``` > print("hello world") > ``` # Run Code Interpreter Running the following program: > ``` > print("hello world") > ``` And got the output: ``` hello world ``` # Substring Evaluation Testing if the previous output contains the string `hello world`: True True
Now let's try running that same basic test on our custom model (here I'm using my own instance of Mistral-7B-Instruct-v0.1-AWQ on Runpod.io).
!PYTHONPATH='.' python main.py --model Mistral-7B-Instruct-v0.1-AWQ --test print_hello --run-tests
Running Mistral-7B-Instruct-v0.1-AWQ, iteration 0 Model name: Mistral-7B-Instruct-v0.1-AWQ Model ID: None API Endpoint: https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/ print_hello.py Run Job TestPrintHello Test Passes: TestPrintHello
Awesome! Now let's try running all the tests on our custom model. We'll also generate a report to summarize the results.
!PYTHONPATH='.' python main.py --model Mistral-7B-Instruct-v0.1-AWQ --run-tests --generate-report
Model name: Mistral-7B-Instruct-v0.1-AWQ Model ID: None API Endpoint: https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/ Running Mistral-7B-Instruct-v0.1-AWQ, iteration 0 Model name: Mistral-7B-Instruct-v0.1-AWQ Model ID: None API Endpoint: https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/ fix_torch_backward.py Run Job TestTorchBackwardExplain Test Fails: TestTorchBackwardExplain from fix_torch_backward.py Run Job TestTorchBackwardFix Test Passes: TestTorchBackwardFix git_merge.py Run Job TestGitMerge Test Fails: TestGitMerge from git_merge.py Run Job TestGitMergeConflict Test Fails: TestGitMergeConflict from git_merge.py jax_onehot.py Run Job TestJaxOneHot Test Fails: TestJaxOneHot from jax_onehot.py fix_threading_issue.py Run Job TestQuestionThreadedFix Test Fails: TestQuestionThreadedFix from fix_threading_issue.py jnp_nn_bugfix.py Run Job TestFixJnpBug Test Fails: TestFixJnpBug from jnp_nn_bugfix.py implement_assembly_interpreter.py Run Job TestImplementAssembly Test Fails: TestImplementAssembly from implement_assembly_interpreter.py convert_to_c.py Run Job TestProgramRewriteC Test Fails: TestProgramRewriteC from convert_to_c.py rust_parallel_wordcount.py Run Job TestRustParCount Test Fails: TestRustParCount from rust_parallel_wordcount.py Run Job TestRustParCountNoLib Test Fails: TestRustParCountNoLib from rust_parallel_wordcount.py print_hello.py Run Job TestPrintHello Test Passes: TestPrintHello baking_help.py Run Job TestMissingStep Test Fails: TestMissingStep from baking_help.py python_chess_game_prefix.py Run Job TestPyChessPrefix Test Fails: TestPyChessPrefix from python_chess_game_prefix.py git_cherrypick.py Run Job TestGitCherrypick Test Fails: TestGitCherrypick from git_cherrypick.py find_bug_in_paper.py Run Job TestFindBugPaper Test Fails: TestFindBugPaper from find_bug_in_paper.py Run Job TestFindBugPaperEasy Test Fails: TestFindBugPaperEasy from find_bug_in_paper.py explain_code_prime.py Run Job TestExplainPrime Test Fails: TestExplainPrime from explain_code_prime.py merge_into_16.py Run Job TestMake16Files Test Fails: TestMake16Files from merge_into_16.py Run Job TestMake16FilesEasy Test Fails: TestMake16FilesEasy from merge_into_16.py base64_qanda.py Run Job TestBase64Thought Test Fails: TestBase64Thought from base64_qanda.py what_is_automodel.py Run Job TestWhatIsAutoModel Test Fails: TestWhatIsAutoModel from what_is_automodel.py extract_emails.py Run Job TestExtractEmail Test Fails: TestExtractEmail from extract_emails.py regex_remove_5_words.py Run Job TestRegex Test Fails: TestRegex from regex_remove_5_words.py numpy_advanced_index.py Run Job TestNumpyAdvancedIndex Test Fails: TestNumpyAdvancedIndex from numpy_advanced_index.py Run Job TestNumpyAdvancedIndexEasier Test Fails: TestNumpyAdvancedIndexEasier from numpy_advanced_index.py fix_tokenizer.py Run Job TestSimpleFix Test Fails: TestSimpleFix from fix_tokenizer.py convert_dp_to_iterative.py Run Job TestProgramRemoveDP Test Fails: TestProgramRemoveDP from convert_dp_to_iterative.py explain_code_prime2.py Run Job TestExplainPrime2 Test Fails: TestExplainPrime2 from explain_code_prime2.py what_is_inv.py Run Job TestWhatIsInv Test Fails: TestWhatIsInv from what_is_inv.py strided_trick.py Run Job TestProgramStrided Test Fails: TestProgramStrided from strided_trick.py identify_uuencode.py Run Job TestIsUU Test Fails: TestIsUU from identify_uuencode.py convert_to_c_simple.py Run Job TestProgramRewriteCSimple Traceback (most recent call last): File "/usr/lib/python3.10/subprocess.py", line 1154, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File "/usr/lib/python3.10/subprocess.py", line 2021, in _communicate ready = selector.select(timeout) File "/usr/lib/python3.10/selectors.py", line 416, in select fd_event_list = self._selector.poll(timeout) KeyboardInterrupt During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/content/yet-another-applied-llm-benchmark/main.py", line 196, in <module> main() File "/content/yet-another-applied-llm-benchmark/main.py", line 179, in main result = run_all_tests(model, use_cache=False, File "/content/yet-another-applied-llm-benchmark/main.py", line 82, in run_all_tests ok, reason = run_one_test(test, test_llm, llm.eval_llm, llm.vision_eval_llm) File "/content/yet-another-applied-llm-benchmark/main.py", line 40, in run_one_test for success, output in test(): File "/content/yet-another-applied-llm-benchmark/evaluator.py", line 182, in __call__ for output1, response1 in self.node1(orig_output): File "/content/yet-another-applied-llm-benchmark/evaluator.py", line 183, in __call__ for output2, response2 in self.node2(output1): File "/content/yet-another-applied-llm-benchmark/evaluator.py", line 544, in __call__ out = invoke_docker(self.env, {"main.c": code.encode(), File "/content/yet-another-applied-llm-benchmark/docker_controller.py", line 240, in invoke_docker proc = subprocess.run(run_cmd, cwd="/tmp/fakedocker_%d"%env.fake_docker_id, stdout=subprocess.PIPE, stderr=subprocess.PIPE) File "/usr/lib/python3.10/subprocess.py", line 505, in run stdout, stderr = process.communicate(input, timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1165, in communicate self._wait(timeout=sigint_timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait time.sleep(delay) KeyboardInterrupt ^C
!PYTHONPATH='.' python regenerate_report.py
[dict_keys(['print_hello.py.TestPrintHello'])] print_hello.py.TestPrintHello BAD {} print_hello.py.TestPrintHello
If you pass the --generate-report
option to the python main.py
command, you can see a summary of the tests results in HTML format. Alternatively, you can run the script generate-report.py
, which will run all tests for the default model in llm.py
(llm = LLM(...)
).
You can see an example report below.
%%html
<iframe src="https://nicholas.carlini.com/writing/2024/evaluation_examples/index.html" width="1000" height="1000"></iframe>