model card update

2025-05-04 04:41:01 +00:00 · 2022-05-29 18:39:55 +02:00 · 2022-05-29 18:39:55 +02:00 · e77a4cd147
parent f67f2088cc
commit e77a4cd147
1 changed files with 435 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -4,8 +4,441 @@ language:
 - en
 tags:
 - text-generation
-license:
- MIT
+- pytorch
+- causal-lm
+license: apache-2.0
 ---

 # GPT-4chan
+
+## Model Description
+
+GPT-4chan is a language model fine-tuned from [GPT-J 6B](https://huggingface.co/EleutherAI/gpt-j-6B) on 3.5 years worth of data from 4chan's _politically incorrect_ (/pol/) board. 
+
+## Training data
+
+GPT-4chan was fine-tuned on the dataset [Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board](https://zenodo.org/record/3606810).
+
+## Training procedure
+
+The model was trained for 1 epoch following [GPT-J's fine-tuning guide](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/howto_finetune.md).
+
+## Intended Use
+
+GPT-4chan is trained on anonymously posted and sparsely moderated discussions of political topics. Its intended use is to reproduce text according to the distribution of its input data. It may also be a useful tool to investigate discourse in such anonymous online communities. Lastly, it has potential applications in tasks suche as toxicity detection, as initial experiments show promising zero-shot results when comparing a string's likelihood under GPT-4chan to its likelihood under GPT-J 6B.
+
+### How to use
+
+The following is copied from the [Hugging Face documentation on GPT-J](https://huggingface.co/docs/transformers/main/en/model_doc/gptj#generation). Refer to the original for more details.
+
+For inference parameters, we recommend a temperature of 0.8, along with either a top_p of 0.8 or a typical_p of 0.3.
+
+For the float32 model (CPU):
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
+tokenizer = AutoTokenizer.from_pretrained("ykilcher/gpt-4chan")
+
+prompt = (
+    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
+    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
+    "researchers was the fact that the unicorns spoke perfect English."
+)
+
+input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+
+gen_tokens = model.generate(
+    input_ids,
+    do_sample=True,
+    temperature=0.8,
+    top_p=0.9,
+    max_length=100,
+)
+gen_text = tokenizer.batch_decode(gen_tokens)[0]
+```
+
+For the float16 model(GPU):
+```python
+from transformers import GPTJForCausalLM, AutoTokenizer
+import torch
+
+from transformers import GPTJForCausalLM
+import torch
+
+model = GPTJForCausalLM.from_pretrained(
+    "ykilcher/gpt-4chan", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
+)
+tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
+
+prompt = (
+    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
+    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
+    "researchers was the fact that the unicorns spoke perfect English."
+)
+
+input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+
+gen_tokens = model.generate(
+    input_ids,
+    do_sample=True,
+    temperature=0.8,
+    top_p=0.9,
+    max_length=100,
+)
+gen_text = tokenizer.batch_decode(gen_tokens)[0]
+```
+
+### Limitations and Biases
+
+This is a statistical model
+
+Dataset from 2016 to 2019 and biased.
+
+Will be offensive.
+
+Do not deploy without appropriate measures.
+
+## Evaluation results
+
+<figure>
+
+| Task                                                      | Metric          |     GPT-J-6B |       stderr |    GPT-4chan |       stderr | Significant   |
+|:----------------------------------------------------------|:----------------|-------------:|-------------:|-------------:|-------------:|:--------------|
+| copa                                                      | acc             |   0.83       |  0.0377525   |   0.85       |  0.035887    |               |
+| blimp_only_npi_scope                                      | acc             |   0.787      |  0.0129537   |   0.712      |  0.0143269   | -             |
+| hendrycksTest-conceptual_physics                          | acc             |   0.255319   |  0.0285049   |   0.251064   |  0.028347    |               |
+| hendrycksTest-conceptual_physics                          | acc_norm        |   0.191489   |  0.0257221   |   0.187234   |  0.0255016   |               |
+| hendrycksTest-high_school_mathematics                     | acc             |   0.218519   |  0.0251958   |   0.248148   |  0.0263357   | +             |
+| hendrycksTest-high_school_mathematics                     | acc_norm        |   0.251852   |  0.0264661   |   0.3        |  0.0279405   | +             |
+| blimp_sentential_negation_npi_scope                       | acc             |   0.733      |  0.0139967   |   0.734      |  0.01398     |               |
+| hendrycksTest-high_school_european_history                | acc             |   0.260606   |  0.0342774   |   0.278788   |  0.0350144   |               |
+| hendrycksTest-high_school_european_history                | acc_norm        |   0.278788   |  0.0350144   |   0.315152   |  0.0362773   | +             |
+| blimp_wh_questions_object_gap                             | acc             |   0.835      |  0.0117436   |   0.841      |  0.0115695   |               |
+| hendrycksTest-international_law                           | acc             |   0.264463   |  0.0402619   |   0.214876   |  0.0374949   | -             |
+| hendrycksTest-international_law                           | acc_norm        |   0.404959   |  0.0448114   |   0.438017   |  0.0452915   |               |
+| hendrycksTest-high_school_us_history                      | acc             |   0.289216   |  0.0318223   |   0.323529   |  0.0328347   | +             |
+| hendrycksTest-high_school_us_history                      | acc_norm        |   0.29902    |  0.0321333   |   0.323529   |  0.0328347   |               |
+| openbookqa                                                | acc             |   0.29       |  0.0203132   |   0.276      |  0.0200112   |               |
+| openbookqa                                                | acc_norm        |   0.382      |  0.0217508   |   0.362      |  0.0215137   |               |
+| blimp_causative                                           | acc             |   0.761      |  0.013493    |   0.737      |  0.0139293   | -             |
+| record                                                    | f1              |   0.885049   |  0.00314367  |   0.878443   |  0.00322394  | -             |
+| record                                                    | em              |   0.8765     |  0.00329027  |   0.8702     |  0.003361    | -             |
+| blimp_determiner_noun_agreement_1                         | acc             |   0.995      |  0.00223159  |   0.996      |  0.00199699  |               |
+| hendrycksTest-miscellaneous                               | acc             |   0.274585   |  0.0159598   |   0.305236   |  0.0164677   | +             |
+| hendrycksTest-miscellaneous                               | acc_norm        |   0.260536   |  0.015696    |   0.269476   |  0.0158662   |               |
+| hendrycksTest-virology                                    | acc             |   0.349398   |  0.0371173   |   0.343373   |  0.0369658   |               |
+| hendrycksTest-virology                                    | acc_norm        |   0.325301   |  0.0364717   |   0.331325   |  0.0366431   |               |
+| mathqa                                                    | acc             |   0.267002   |  0.00809858  |   0.269012   |  0.00811786  |               |
+| mathqa                                                    | acc_norm        |   0.270687   |  0.00813376  |   0.261642   |  0.00804614  | -             |
+| squad2                                                    | exact           |  10.6207     |  0           |  10.6123     |  0           | -             |
+| squad2                                                    | f1              |  17.7413     |  0           |  17.8734     |  0           | +             |
+| squad2                                                    | HasAns_exact    |  15.5027     |  0           |  17.2571     |  0           | +             |
+| squad2                                                    | HasAns_f1       |  29.7643     |  0           |  31.8        |  0           | +             |
+| squad2                                                    | NoAns_exact     |   5.75273    |  0           |   3.98654    |  0           | -             |
+| squad2                                                    | NoAns_f1        |   5.75273    |  0           |   3.98654    |  0           | -             |
+| squad2                                                    | best_exact      |  50.0716     |  0           |  50.0716     |  0           |               |
+| squad2                                                    | best_f1         |  50.0778     |  0           |  50.077      |  0           | -             |
+| mnli_mismatched                                           | acc             |   0.376627   |  0.00488687  |   0.320586   |  0.00470696  | -             |
+| blimp_animate_subject_passive                             | acc             |   0.781      |  0.0130847   |   0.79       |  0.0128867   |               |
+| blimp_determiner_noun_agreement_with_adj_irregular_1      | acc             |   0.878      |  0.0103549   |   0.834      |  0.0117721   | -             |
+| qnli                                                      | acc             |   0.513454   |  0.00676296  |   0.491305   |  0.00676439  | -             |
+| blimp_intransitive                                        | acc             |   0.858      |  0.0110435   |   0.806      |  0.0125108   | -             |
+| ethics_cm                                                 | acc             |   0.559846   |  0.00796521  |   0.512227   |  0.00802048  | -             |
+| hendrycksTest-high_school_computer_science                | acc             |   0.25       |  0.0435194   |   0.2        |  0.0402015   | -             |
+| hendrycksTest-high_school_computer_science                | acc_norm        |   0.27       |  0.0446196   |   0.26       |  0.0440844   |               |
+| iwslt17-ar-en                                             | bleu            |  20.7322     |  0.795602    |  21.4685     |  0.64825     | +             |
+| iwslt17-ar-en                                             | chrf            |   0.450919   |  0.00526515  |   0.452175   |  0.00498012  |               |
+| iwslt17-ar-en                                             | ter             |   0.787631   |  0.0285488   |   0.733514   |  0.0201688   | +             |
+| hendrycksTest-security_studies                            | acc             |   0.363265   |  0.0307891   |   0.391837   |  0.0312513   |               |
+| hendrycksTest-security_studies                            | acc_norm        |   0.285714   |  0.0289206   |   0.285714   |  0.0289206   |               |
+| hendrycksTest-global_facts                                | acc             |   0.25       |  0.0435194   |   0.29       |  0.0456048   |               |
+| hendrycksTest-global_facts                                | acc_norm        |   0.22       |  0.0416333   |   0.26       |  0.0440844   |               |
+| anli_r1                                                   | acc             |   0.322      |  0.0147829   |   0.297      |  0.0144568   | -             |
+| blimp_left_branch_island_simple_question                  | acc             |   0.867      |  0.0107437   |   0.884      |  0.0101315   | +             |
+| hendrycksTest-astronomy                                   | acc             |   0.25       |  0.0352381   |   0.25       |  0.0352381   |               |
+| hendrycksTest-astronomy                                   | acc_norm        |   0.335526   |  0.038425    |   0.348684   |  0.0387814   |               |
+| mrpc                                                      | acc             |   0.683824   |  0.0230483   |   0.536765   |  0.024717    | -             |
+| mrpc                                                      | f1              |   0.812227   |  0.0162476   |   0.63301    |  0.0247985   | -             |
+| ethics_utilitarianism                                     | acc             |   0.509775   |  0.00721024  |   0.525374   |  0.00720233  | +             |
+| blimp_determiner_noun_agreement_2                         | acc             |   0.977      |  0.00474273  |   0.99       |  0.003148    | +             |
+| lambada_cloze                                             | ppl             | 405.646      | 14.5519      | 388.123      | 13.1523      | +             |
+| lambada_cloze                                             | acc             |   0.0199884  |  0.00194992  |   0.0116437  |  0.00149456  | -             |
+| truthfulqa_mc                                             | mc1             |   0.201958   |  0.014054    |   0.225214   |  0.0146232   | +             |
+| truthfulqa_mc                                             | mc2             |   0.359537   |  0.0134598   |   0.371625   |  0.0136558   |               |
+| blimp_wh_vs_that_with_gap_long_distance                   | acc             |   0.342      |  0.0150087   |   0.441      |  0.0157088   | +             |
+| hendrycksTest-business_ethics                             | acc             |   0.29       |  0.0456048   |   0.28       |  0.0451261   |               |
+| hendrycksTest-business_ethics                             | acc_norm        |   0.3        |  0.0460566   |   0.29       |  0.0456048   |               |
+| arithmetic_3ds                                            | acc             |   0.046      |  0.0046854   |   0.0065     |  0.00179736  | -             |
+| blimp_determiner_noun_agreement_with_adjective_1          | acc             |   0.978      |  0.00464086  |   0.988      |  0.00344498  | +             |
+| hendrycksTest-moral_disputes                              | acc             |   0.283237   |  0.0242579   |   0.277457   |  0.0241057   |               |
+| hendrycksTest-moral_disputes                              | acc_norm        |   0.32659    |  0.0252483   |   0.309249   |  0.0248831   |               |
+| arithmetic_2da                                            | acc             |   0.2405     |  0.00955906  |   0.0455     |  0.00466109  | -             |
+| qa4mre_2011                                               | acc             |   0.458333   |  0.0456755   |   0.425      |  0.0453163   |               |
+| qa4mre_2011                                               | acc_norm        |   0.533333   |  0.045733    |   0.558333   |  0.0455219   |               |
+| blimp_regular_plural_subject_verb_agreement_1             | acc             |   0.968      |  0.00556839  |   0.966      |  0.00573384  |               |
+| hendrycksTest-human_sexuality                             | acc             |   0.396947   |  0.0429114   |   0.389313   |  0.0427649   |               |
+| hendrycksTest-human_sexuality                             | acc_norm        |   0.343511   |  0.0416498   |   0.305344   |  0.0403931   |               |
+| blimp_passive_1                                           | acc             |   0.885      |  0.0100934   |   0.878      |  0.0103549   |               |
+| blimp_drop_argument                                       | acc             |   0.823      |  0.0120755   |   0.784      |  0.0130197   | -             |
+| hendrycksTest-high_school_microeconomics                  | acc             |   0.277311   |  0.0290794   |   0.260504   |  0.0285103   |               |
+| hendrycksTest-high_school_microeconomics                  | acc_norm        |   0.39916    |  0.0318111   |   0.390756   |  0.0316938   |               |
+| hendrycksTest-us_foreign_policy                           | acc             |   0.34       |  0.0476095   |   0.32       |  0.0468826   |               |
+| hendrycksTest-us_foreign_policy                           | acc_norm        |   0.35       |  0.0479372   |   0.4        |  0.0492366   | +             |
+| blimp_ellipsis_n_bar_1                                    | acc             |   0.841      |  0.0115695   |   0.846      |  0.0114199   |               |
+| hendrycksTest-high_school_physics                         | acc             |   0.271523   |  0.0363133   |   0.264901   |  0.0360304   |               |
+| hendrycksTest-high_school_physics                         | acc_norm        |   0.271523   |  0.0363133   |   0.284768   |  0.0368488   |               |
+| qa4mre_2013                                               | acc             |   0.401408   |  0.0291384   |   0.362676   |  0.028579    | -             |
+| qa4mre_2013                                               | acc_norm        |   0.383803   |  0.0289082   |   0.387324   |  0.0289574   |               |
+| blimp_wh_vs_that_no_gap                                   | acc             |   0.969      |  0.00548353  |   0.963      |  0.00597216  | -             |
+| headqa_es                                                 | acc             |   0.251276   |  0.0082848   |   0.238877   |  0.00814442  | -             |
+| headqa_es                                                 | acc_norm        |   0.286652   |  0.00863721  |   0.290664   |  0.00867295  |               |
+| blimp_sentential_subject_island                           | acc             |   0.421      |  0.0156206   |   0.359      |  0.0151773   | -             |
+| hendrycksTest-philosophy                                  | acc             |   0.26045    |  0.0249267   |   0.241158   |  0.0242966   |               |
+| hendrycksTest-philosophy                                  | acc_norm        |   0.334405   |  0.0267954   |   0.327974   |  0.0266644   |               |
+| hendrycksTest-elementary_mathematics                      | acc             |   0.251323   |  0.0223405   |   0.248677   |  0.0222618   |               |
+| hendrycksTest-elementary_mathematics                      | acc_norm        |   0.26455    |  0.0227175   |   0.275132   |  0.0230001   |               |
+| math_geometry                                             | acc             |   0.0104384  |  0.00464863  |   0.0187891  |  0.00621042  | +             |
+| blimp_wh_questions_subject_gap_long_distance              | acc             |   0.883      |  0.0101693   |   0.886      |  0.0100551   |               |
+| hendrycksTest-college_physics                             | acc             |   0.205882   |  0.0402338   |   0.205882   |  0.0402338   |               |
+| hendrycksTest-college_physics                             | acc_norm        |   0.245098   |  0.0428011   |   0.22549    |  0.0415831   |               |
+| hellaswag                                                 | acc             |   0.49532    |  0.00498956  |   0.488747   |  0.00498852  | -             |
+| hellaswag                                                 | acc_norm        |   0.66202    |  0.00472055  |   0.648277   |  0.00476532  | -             |
+| hendrycksTest-logical_fallacies                           | acc             |   0.294479   |  0.0358117   |   0.269939   |  0.0348783   |               |
+| hendrycksTest-logical_fallacies                           | acc_norm        |   0.355828   |  0.0376152   |   0.343558   |  0.0373113   |               |
+| hendrycksTest-machine_learning                            | acc             |   0.223214   |  0.039523    |   0.339286   |  0.0449395   | +             |
+| hendrycksTest-machine_learning                            | acc_norm        |   0.178571   |  0.0363521   |   0.205357   |  0.0383424   |               |
+| hendrycksTest-high_school_psychology                      | acc             |   0.273394   |  0.0191093   |   0.286239   |  0.0193794   |               |
+| hendrycksTest-high_school_psychology                      | acc_norm        |   0.269725   |  0.0190285   |   0.266055   |  0.018946    |               |
+| prost                                                     | acc             |   0.268254   |  0.00323688  |   0.256298   |  0.00318967  | -             |
+| prost                                                     | acc_norm        |   0.274658   |  0.00326093  |   0.280156   |  0.00328089  | +             |
+| blimp_determiner_noun_agreement_with_adj_irregular_2      | acc             |   0.916      |  0.00877616  |   0.898      |  0.00957537  | -             |
+| wnli                                                      | acc             |   0.464789   |  0.0596131   |   0.43662    |  0.0592794   |               |
+| hendrycksTest-professional_law                            | acc             |   0.273794   |  0.0113886   |   0.284876   |  0.0115278   |               |
+| hendrycksTest-professional_law                            | acc_norm        |   0.292699   |  0.0116209   |   0.301825   |  0.0117244   |               |
+| math_algebra                                              | acc             |   0.0117944  |  0.00313487  |   0.0126369  |  0.00324352  |               |
+| wikitext                                                  | word_perplexity |  10.8819     |  0           |  11.4687     |  0           | -             |
+| wikitext                                                  | byte_perplexity |   1.56268    |  0           |   1.5781     |  0           | -             |
+| wikitext                                                  | bits_per_byte   |   0.644019   |  0           |   0.658188   |  0           | -             |
+| anagrams1                                                 | acc             |   0.0008     |  0.000282744 |   0.0125     |  0.00111108  | +             |
+| math_prealgebra                                           | acc             |   0.0126292  |  0.00378589  |   0.0195178  |  0.00469003  | +             |
+| blimp_principle_A_domain_2                                | acc             |   0.889      |  0.0099387   |   0.887      |  0.0100166   |               |
+| cycle_letters                                             | acc             |   0.0026     |  0.000509264 |   0.0331     |  0.00178907  | +             |
+| hendrycksTest-college_mathematics                         | acc             |   0.26       |  0.0440844   |   0.26       |  0.0440844   |               |
+| hendrycksTest-college_mathematics                         | acc_norm        |   0.4        |  0.0492366   |   0.31       |  0.0464823   | -             |
+| arithmetic_1dc                                            | acc             |   0.089      |  0.00636866  |   0.077      |  0.00596266  | -             |
+| arithmetic_4da                                            | acc             |   0.007      |  0.00186474  |   0.0005     |  0.0005      | -             |
+| triviaqa                                                  | acc             |   0.167418   |  0.00351031  |   0.150888   |  0.00336543  | -             |
+| boolq                                                     | acc             |   0.655352   |  0.00831224  |   0.673394   |  0.00820236  | +             |
+| random_insertion                                          | acc             |   0          |  0           |   0.0004     |  0.00019997  | +             |
+| qa4mre_2012                                               | acc             |   0.4125     |  0.0390407   |   0.4        |  0.0388514   |               |
+| qa4mre_2012                                               | acc_norm        |   0.50625    |  0.0396495   |   0.4625     |  0.0395409   | -             |
+| math_asdiv                                                | acc             |   0.00563991 |  0.00156015  |   0.00997831 |  0.00207066  | +             |
+| hendrycksTest-moral_scenarios                             | acc             |   0.236872   |  0.0142196   |   0.236872   |  0.0142196   |               |
+| hendrycksTest-moral_scenarios                             | acc_norm        |   0.272626   |  0.0148934   |   0.272626   |  0.0148934   |               |
+| hendrycksTest-high_school_geography                       | acc             |   0.20202    |  0.0286062   |   0.247475   |  0.0307463   | +             |
+| hendrycksTest-high_school_geography                       | acc_norm        |   0.292929   |  0.032425    |   0.287879   |  0.0322588   |               |
+| gsm8k                                                     | acc             |   0          |  0           |   0          |  0           |               |
+| blimp_existential_there_object_raising                    | acc             |   0.792      |  0.0128414   |   0.812      |  0.0123616   | +             |
+| blimp_superlative_quantifiers_2                           | acc             |   0.865      |  0.0108117   |   0.917      |  0.00872853  | +             |
+| hendrycksTest-college_chemistry                           | acc             |   0.24       |  0.0429235   |   0.28       |  0.0451261   |               |
+| hendrycksTest-college_chemistry                           | acc_norm        |   0.28       |  0.0451261   |   0.31       |  0.0464823   |               |
+| blimp_existential_there_quantifiers_2                     | acc             |   0.383      |  0.0153801   |   0.545      |  0.0157551   | +             |
+| hendrycksTest-abstract_algebra                            | acc             |   0.26       |  0.0440844   |   0.17       |  0.0377525   | -             |
+| hendrycksTest-abstract_algebra                            | acc_norm        |   0.3        |  0.0460566   |   0.26       |  0.0440844   |               |
+| hendrycksTest-professional_psychology                     | acc             |   0.28268    |  0.0182173   |   0.26634    |  0.0178832   |               |
+| hendrycksTest-professional_psychology                     | acc_norm        |   0.259804   |  0.0177409   |   0.256536   |  0.0176678   |               |
+| ethics_virtue                                             | acc             |   0.200201   |  0.00567376  |   0.249849   |  0.00613847  | +             |
+| ethics_virtue                                             | em              |   0          |  0           |   0.0040201  |  0           | +             |
+| arithmetic_5da                                            | acc             |   0.0005     |  0.0005      |   0          |  0           | -             |
+| mutual                                                    | r@1             |   0.468397   |  0.0167737   |   0.455982   |  0.0167421   |               |
+| mutual                                                    | r@2             |   0.735892   |  0.0148193   |   0.732506   |  0.0148796   |               |
+| mutual                                                    | mrr             |   0.682186   |  0.0103375   |   0.675226   |  0.0103132   |               |
+| blimp_irregular_past_participle_verbs                     | acc             |   0.876      |  0.0104275   |   0.869      |  0.0106749   |               |
+| ethics_deontology                                         | acc             |   0.523637   |  0.0083298   |   0.497775   |  0.00833904  | -             |
+| ethics_deontology                                         | em              |   0.0355951  |  0           |   0.00333704 |  0           | -             |
+| blimp_transitive                                          | acc             |   0.855      |  0.01114     |   0.818      |  0.0122076   | -             |
+| hendrycksTest-college_computer_science                    | acc             |   0.27       |  0.0446196   |   0.29       |  0.0456048   |               |
+| hendrycksTest-college_computer_science                    | acc_norm        |   0.26       |  0.0440844   |   0.27       |  0.0446196   |               |
+| hendrycksTest-professional_medicine                       | acc             |   0.272059   |  0.027033    |   0.283088   |  0.0273659   |               |
+| hendrycksTest-professional_medicine                       | acc_norm        |   0.261029   |  0.0266793   |   0.279412   |  0.0272572   |               |
+| sciq                                                      | acc             |   0.915      |  0.00882343  |   0.895      |  0.00969892  | -             |
+| sciq                                                      | acc_norm        |   0.874      |  0.0104992   |   0.869      |  0.0106749   |               |
+| blimp_anaphor_number_agreement                            | acc             |   0.995      |  0.00223159  |   0.993      |  0.00263779  |               |
+| blimp_wh_questions_subject_gap                            | acc             |   0.913      |  0.00891687  |   0.925      |  0.00833333  | +             |
+| blimp_wh_vs_that_with_gap                                 | acc             |   0.429      |  0.015659    |   0.482      |  0.015809    | +             |
+| math_num_theory                                           | acc             |   0.0203704  |  0.00608466  |   0.0351852  |  0.00793611  | +             |
+| blimp_complex_NP_island                                   | acc             |   0.535      |  0.0157805   |   0.538      |  0.0157735   |               |
+| blimp_expletive_it_object_raising                         | acc             |   0.78       |  0.0131062   |   0.777      |  0.0131698   |               |
+| lambada_mt_en                                             | ppl             |   4.10224    |  0.0884971   |   4.62504    |  0.10549     | -             |
+| lambada_mt_en                                             | acc             |   0.682127   |  0.00648741  |   0.648554   |  0.00665142  | -             |
+| hendrycksTest-formal_logic                                | acc             |   0.34127    |  0.042408    |   0.309524   |  0.0413491   |               |
+| hendrycksTest-formal_logic                                | acc_norm        |   0.325397   |  0.041906    |   0.325397   |  0.041906    |               |
+| blimp_matrix_question_npi_licensor_present                | acc             |   0.727      |  0.014095    |   0.663      |  0.0149551   | -             |
+| blimp_superlative_quantifiers_1                           | acc             |   0.871      |  0.0106053   |   0.791      |  0.0128641   | -             |
+| lambada_mt_de                                             | ppl             |  82.2416     |  4.88447     |  89.7905     |  5.30301     | -             |
+| lambada_mt_de                                             | acc             |   0.312827   |  0.00645948  |   0.312245   |  0.0064562   |               |
+| hendrycksTest-computer_security                           | acc             |   0.27       |  0.0446196   |   0.37       |  0.0485237   | +             |
+| hendrycksTest-computer_security                           | acc_norm        |   0.33       |  0.0472582   |   0.37       |  0.0485237   |               |
+| ethics_justice                                            | acc             |   0.526627   |  0.00960352  |   0.501479   |  0.00961712  | -             |
+| ethics_justice                                            | em              |   0.0251479  |  0           |   0          |  0           | -             |
+| blimp_principle_A_reconstruction                          | acc             |   0.444      |  0.0157198   |   0.296      |  0.0144427   | -             |
+| blimp_existential_there_subject_raising                   | acc             |   0.875      |  0.0104635   |   0.877      |  0.0103913   |               |
+| math_precalc                                              | acc             |   0.0018315  |  0.0018315   |   0.014652   |  0.00514689  | +             |
+| qasper                                                    | f1_yesno        |   0.666667   |  0.0311266   |   0.632997   |  0.032868    | -             |
+| qasper                                                    | f1_abstractive  |   0.118383   |  0.00692993  |   0.113489   |  0.00729073  |               |
+| cb                                                        | acc             |   0.357143   |  0.0646096   |   0.196429   |  0.0535714   | -             |
+| cb                                                        | f1              |   0.288109   |  0           |   0.149038   |  0           | -             |
+| blimp_animate_subject_trans                               | acc             |   0.868      |  0.0107094   |   0.858      |  0.0110435   |               |
+| hendrycksTest-high_school_statistics                      | acc             |   0.291667   |  0.0309987   |   0.310185   |  0.031547    |               |
+| hendrycksTest-high_school_statistics                      | acc_norm        |   0.314815   |  0.0316747   |   0.361111   |  0.0327577   | +             |
+| blimp_irregular_plural_subject_verb_agreement_2           | acc             |   0.919      |  0.00863212  |   0.881      |  0.0102442   | -             |
+| lambada_mt_es                                             | ppl             |  83.6696     |  4.57489     |  92.1172     |  5.05064     | -             |
+| lambada_mt_es                                             | acc             |   0.326994   |  0.00653569  |   0.322337   |  0.00651139  |               |
+| anli_r2                                                   | acc             |   0.337      |  0.0149551   |   0.327      |  0.0148422   |               |
+| hendrycksTest-nutrition                                   | acc             |   0.346405   |  0.0272456   |   0.346405   |  0.0272456   |               |
+| hendrycksTest-nutrition                                   | acc_norm        |   0.401961   |  0.0280742   |   0.385621   |  0.0278707   |               |
+| anli_r3                                                   | acc             |   0.3525     |  0.0137972   |   0.336667   |  0.0136476   | -             |
+| blimp_regular_plural_subject_verb_agreement_2             | acc             |   0.916      |  0.00877616  |   0.897      |  0.00961683  | -             |
+| blimp_tough_vs_raising_2                                  | acc             |   0.857      |  0.0110758   |   0.826      |  0.0119945   | -             |
+| mnli                                                      | acc             |   0.374733   |  0.00488619  |   0.316047   |  0.00469317  | -             |
+| drop                                                      | em              |   0.0228607  |  0.0015306   |   0.0595638  |  0.00242379  | +             |
+| drop                                                      | f1              |   0.103871   |  0.00219977  |   0.120355   |  0.00270951  | +             |
+| blimp_determiner_noun_agreement_with_adj_2                | acc             |   0.936      |  0.00774364  |   0.95       |  0.00689547  | +             |
+| arithmetic_2dm                                            | acc             |   0.14       |  0.00776081  |   0.061      |  0.00535293  | -             |
+| blimp_determiner_noun_agreement_irregular_2               | acc             |   0.932      |  0.00796489  |   0.93       |  0.00807249  |               |
+| lambada                                                   | ppl             |   4.10224    |  0.0884971   |   4.62504    |  0.10549     | -             |
+| lambada                                                   | acc             |   0.682127   |  0.00648741  |   0.648554   |  0.00665142  | -             |
+| arithmetic_3da                                            | acc             |   0.0865     |  0.00628718  |   0.007      |  0.00186474  | -             |
+| blimp_irregular_past_participle_adjectives                | acc             |   0.956      |  0.00648892  |   0.947      |  0.00708811  | -             |
+| hendrycksTest-college_biology                             | acc             |   0.284722   |  0.0377381   |   0.201389   |  0.0335365   | -             |
+| hendrycksTest-college_biology                             | acc_norm        |   0.270833   |  0.0371618   |   0.222222   |  0.0347659   | -             |
+| headqa_en                                                 | acc             |   0.335522   |  0.00901875  |   0.324945   |  0.00894582  | -             |
+| headqa_en                                                 | acc_norm        |   0.383297   |  0.00928648  |   0.375638   |  0.00925014  |               |
+| blimp_determiner_noun_agreement_irregular_1               | acc             |   0.944      |  0.0072744   |   0.912      |  0.00896305  | -             |
+| blimp_existential_there_quantifiers_1                     | acc             |   0.981      |  0.00431945  |   0.985      |  0.00384575  |               |
+| blimp_inchoative                                          | acc             |   0.683      |  0.0147217   |   0.653      |  0.0150605   | -             |
+| mutual_plus                                               | r@1             |   0.409707   |  0.016531    |   0.395034   |  0.0164328   |               |
+| mutual_plus                                               | r@2             |   0.680587   |  0.0156728   |   0.674944   |  0.015745    |               |
+| mutual_plus                                               | mrr             |   0.640801   |  0.0104141   |   0.632713   |  0.0103391   |               |
+| blimp_tough_vs_raising_1                                  | acc             |   0.734      |  0.01398     |   0.736      |  0.0139463   |               |
+| winogrande                                                | acc             |   0.640884   |  0.0134831   |   0.636148   |  0.0135215   |               |
+| race                                                      | acc             |   0.37512    |  0.0149842   |   0.374163   |  0.0149765   |               |
+| blimp_irregular_plural_subject_verb_agreement_1           | acc             |   0.918      |  0.00868052  |   0.908      |  0.00914438  | -             |
+| hendrycksTest-high_school_macroeconomics                  | acc             |   0.284615   |  0.0228783   |   0.284615   |  0.0228783   |               |
+| hendrycksTest-high_school_macroeconomics                  | acc_norm        |   0.276923   |  0.022688    |   0.284615   |  0.0228783   |               |
+| blimp_adjunct_island                                      | acc             |   0.902      |  0.00940662  |   0.888      |  0.00997775  | -             |
+| hendrycksTest-high_school_chemistry                       | acc             |   0.211823   |  0.028749    |   0.236453   |  0.0298961   |               |
+| hendrycksTest-high_school_chemistry                       | acc_norm        |   0.29064    |  0.0319474   |   0.300493   |  0.032258    |               |
+| arithmetic_2ds                                            | acc             |   0.218      |  0.00923475  |   0.051      |  0.00492053  | -             |
+| blimp_principle_A_case_2                                  | acc             |   0.953      |  0.00669596  |   0.955      |  0.00655881  |               |
+| blimp_only_npi_licensor_present                           | acc             |   0.953      |  0.00669596  |   0.926      |  0.00828206  | -             |
+| math_counting_and_prob                                    | acc             |   0.0021097  |  0.0021097   |   0.0274262  |  0.00750954  | +             |
+| cola                                                      | mcc             |  -0.0504508  |  0.0251594   |  -0.0854256  |  0.0304519   | -             |
+| webqs                                                     | acc             |   0.0226378  |  0.00330058  |   0.023622   |  0.00336987  |               |
+| arithmetic_4ds                                            | acc             |   0.0055     |  0.00165416  |   0.0005     |  0.0005      | -             |
+| blimp_wh_vs_that_no_gap_long_distance                     | acc             |   0.939      |  0.00757208  |   0.94       |  0.00751375  |               |
+| pile_bookcorpus2                                          | word_perplexity |  27.0559     |  0           |  28.7786     |  0           | -             |
+| pile_bookcorpus2                                          | byte_perplexity |   1.78037    |  0           |   1.79969    |  0           | -             |
+| pile_bookcorpus2                                          | bits_per_byte   |   0.832176   |  0           |   0.847751   |  0           | -             |
+| blimp_sentential_negation_npi_licensor_present            | acc             |   0.982      |  0.00420639  |   0.994      |  0.00244335  | +             |
+| hendrycksTest-high_school_government_and_politics         | acc             |   0.227979   |  0.0302769   |   0.274611   |  0.0322102   | +             |
+| hendrycksTest-high_school_government_and_politics         | acc_norm        |   0.248705   |  0.0311958   |   0.259067   |  0.0316188   |               |
+| blimp_ellipsis_n_bar_2                                    | acc             |   0.916      |  0.00877616  |   0.937      |  0.00768701  | +             |
+| hendrycksTest-clinical_knowledge                          | acc             |   0.267925   |  0.0272573   |   0.283019   |  0.0277242   |               |
+| hendrycksTest-clinical_knowledge                          | acc_norm        |   0.316981   |  0.0286372   |   0.343396   |  0.0292245   |               |
+| mc_taco                                                   | em              |   0.132883   |  0           |   0.125375   |  0           | -             |
+| mc_taco                                                   | f1              |   0.499712   |  0           |   0.487131   |  0           | -             |
+| wsc                                                       | acc             |   0.365385   |  0.0474473   |   0.365385   |  0.0474473   |               |
+| hendrycksTest-college_medicine                            | acc             |   0.190751   |  0.0299579   |   0.231214   |  0.0321474   | +             |
+| hendrycksTest-college_medicine                            | acc_norm        |   0.265896   |  0.0336876   |   0.289017   |  0.0345643   |               |
+| hendrycksTest-high_school_world_history                   | acc             |   0.2827     |  0.0293128   |   0.295359   |  0.0296963   |               |
+| hendrycksTest-high_school_world_history                   | acc_norm        |   0.312236   |  0.0301651   |   0.312236   |  0.0301651   |               |
+| hendrycksTest-anatomy                                     | acc             |   0.281481   |  0.03885     |   0.296296   |  0.0394462   |               |
+| hendrycksTest-anatomy                                     | acc_norm        |   0.266667   |  0.0382017   |   0.288889   |  0.0391545   |               |
+| hendrycksTest-jurisprudence                               | acc             |   0.277778   |  0.0433004   |   0.25       |  0.0418609   |               |
+| hendrycksTest-jurisprudence                               | acc_norm        |   0.425926   |  0.0478034   |   0.416667   |  0.0476608   |               |
+| logiqa                                                    | acc             |   0.211982   |  0.016031    |   0.193548   |  0.0154963   | -             |
+| logiqa                                                    | acc_norm        |   0.291859   |  0.0178316   |   0.281106   |  0.0176324   |               |
+| ethics_utilitarianism_original                            | acc             |   0.941556   |  0.00338343  |   0.767679   |  0.00609112  | -             |
+| blimp_principle_A_c_command                               | acc             |   0.81       |  0.0124119   |   0.827      |  0.0119672   | +             |
+| blimp_coordinate_structure_constraint_complex_left_branch | acc             |   0.764      |  0.0134345   |   0.794      |  0.0127956   | +             |
+| arithmetic_5ds                                            | acc             |   0          |  0           |   0          |  0           |               |
+| lambada_mt_it                                             | ppl             |  86.66       |  5.1869      |  96.8846     |  5.80902     | -             |
+| lambada_mt_it                                             | acc             |   0.336891   |  0.0065849   |   0.328158   |  0.00654165  | -             |
+| wsc273                                                    | acc             |   0.827839   |  0.0228905   |   0.827839   |  0.0228905   |               |
+| blimp_coordinate_structure_constraint_object_extraction   | acc             |   0.876      |  0.0104275   |   0.852      |  0.0112349   | -             |
+| blimp_principle_A_domain_3                                | acc             |   0.819      |  0.0121814   |   0.79       |  0.0128867   | -             |
+| blimp_left_branch_island_echo_question                    | acc             |   0.519      |  0.0158079   |   0.638      |  0.0152048   | +             |
+| rte                                                       | acc             |   0.548736   |  0.0299531   |   0.534296   |  0.0300256   |               |
+| blimp_passive_2                                           | acc             |   0.899      |  0.00953362  |   0.892      |  0.00982     |               |
+| hendrycksTest-electrical_engineering                      | acc             |   0.358621   |  0.0399663   |   0.344828   |  0.0396093   |               |
+| hendrycksTest-electrical_engineering                      | acc_norm        |   0.372414   |  0.0402873   |   0.372414   |  0.0402873   |               |
+| sst                                                       | acc             |   0.493119   |  0.0169402   |   0.626147   |  0.0163938   | +             |
+| blimp_npi_present_1                                       | acc             |   0.576      |  0.0156355   |   0.565      |  0.0156851   |               |
+| piqa                                                      | acc             |   0.754081   |  0.0100473   |   0.739391   |  0.0102418   | -             |
+| piqa                                                      | acc_norm        |   0.761697   |  0.00994033  |   0.755169   |  0.0100323   |               |
+| hendrycksTest-professional_accounting                     | acc             |   0.265957   |  0.0263581   |   0.312057   |  0.0276401   | +             |
+| hendrycksTest-professional_accounting                     | acc_norm        |   0.22695    |  0.0249871   |   0.27305    |  0.0265779   | +             |
+| arc_challenge                                             | acc             |   0.337884   |  0.013822    |   0.325085   |  0.0136881   |               |
+| arc_challenge                                             | acc_norm        |   0.366041   |  0.0140772   |   0.352389   |  0.0139601   |               |
+| hendrycksTest-econometrics                                | acc             |   0.245614   |  0.0404934   |   0.263158   |  0.0414244   |               |
+| hendrycksTest-econometrics                                | acc_norm        |   0.27193    |  0.0418577   |   0.254386   |  0.0409699   |               |
+| headqa                                                    | acc             |   0.251276   |  0.0082848   |   0.238877   |  0.00814442  | -             |
+| headqa                                                    | acc_norm        |   0.286652   |  0.00863721  |   0.290664   |  0.00867295  |               |
+| wic                                                       | acc             |   0.5        |  0.0198107   |   0.482759   |  0.0197989   |               |
+| hendrycksTest-high_school_biology                         | acc             |   0.251613   |  0.024686    |   0.270968   |  0.0252844   |               |
+| hendrycksTest-high_school_biology                         | acc_norm        |   0.283871   |  0.0256494   |   0.274194   |  0.0253781   |               |
+| hendrycksTest-management                                  | acc             |   0.23301    |  0.0418583   |   0.281553   |  0.0445325   | +             |
+| hendrycksTest-management                                  | acc_norm        |   0.320388   |  0.0462028   |   0.291262   |  0.0449868   |               |
+| blimp_npi_present_2                                       | acc             |   0.664      |  0.0149441   |   0.645      |  0.0151395   | -             |
+| hendrycksTest-prehistory                                  | acc             |   0.243827   |  0.0238919   |   0.265432   |  0.0245692   |               |
+| hendrycksTest-prehistory                                  | acc_norm        |   0.219136   |  0.0230167   |   0.225309   |  0.0232462   |               |
+| hendrycksTest-world_religions                             | acc             |   0.333333   |  0.0361551   |   0.321637   |  0.0358253   |               |
+| hendrycksTest-world_religions                             | acc_norm        |   0.380117   |  0.0372297   |   0.397661   |  0.0375364   |               |
+| math_intermediate_algebra                                 | acc             |   0.00332226 |  0.00191598  |   0.00996678 |  0.00330749  | +             |
+| anagrams2                                                 | acc             |   0.0055     |  0.000739615 |   0.0347     |  0.00183028  | +             |
+| arc_easy                                                  | acc             |   0.669613   |  0.00965143  |   0.647306   |  0.00980442  | -             |
+| arc_easy                                                  | acc_norm        |   0.622896   |  0.00994504  |   0.609848   |  0.0100091   | -             |
+| blimp_anaphor_gender_agreement                            | acc             |   0.994      |  0.00244335  |   0.993      |  0.00263779  |               |
+| hendrycksTest-marketing                                   | acc             |   0.307692   |  0.0302364   |   0.311966   |  0.0303515   |               |
+| hendrycksTest-marketing                                   | acc_norm        |   0.294872   |  0.0298726   |   0.34188    |  0.031075    | +             |
+| blimp_principle_A_domain_1                                | acc             |   0.997      |  0.00173032  |   0.997      |  0.00173032  |               |
+| blimp_wh_island                                           | acc             |   0.852      |  0.0112349   |   0.856      |  0.011108    |               |
+| hendrycksTest-sociology                                   | acc             |   0.278607   |  0.0317006   |   0.303483   |  0.0325101   |               |
+| hendrycksTest-sociology                                   | acc_norm        |   0.318408   |  0.0329412   |   0.298507   |  0.0323574   |               |
+| blimp_distractor_agreement_relative_clause                | acc             |   0.719      |  0.0142212   |   0.774      |  0.0132325   | +             |
+| truthfulqa_gen                                            | bleurt_max      |  -0.814228   |  0.0172128   |  -0.811655   |  0.0180743   |               |
+| truthfulqa_gen                                            | bleurt_acc      |   0.329253   |  0.0164513   |   0.395349   |  0.0171158   | +             |
+| truthfulqa_gen                                            | bleurt_diff     |  -0.185905   |  0.0169617   |  -0.0488385  |  0.0204525   | +             |
+| truthfulqa_gen                                            | bleu_max        |  20.2238     |  0.711772    |  20.8747     |  0.717003    |               |
+| truthfulqa_gen                                            | bleu_acc        |   0.281518   |  0.015744    |   0.330477   |  0.0164668   | +             |
+| truthfulqa_gen                                            | bleu_diff       |  -6.66121    |  0.719366    |  -2.12856    |  0.832693    | +             |
+| truthfulqa_gen                                            | rouge1_max      |  45.3457     |  0.89238     |  47.0293     |  0.962404    | +             |
+| truthfulqa_gen                                            | rouge1_acc      |   0.257038   |  0.0152981   |   0.341493   |  0.0166007   | +             |
+| truthfulqa_gen                                            | rouge1_diff     | -10.1049     |  0.8922      |  -2.29454    |  1.2086      | +             |
+| truthfulqa_gen                                            | rouge2_max      |  28.7438     |  0.981282    |  31.0617     |  1.08725     | +             |
+| truthfulqa_gen                                            | rouge2_acc      |   0.201958   |  0.014054    |   0.247246   |  0.0151024   | +             |
+| truthfulqa_gen                                            | rouge2_diff     | -11.0916     |  1.01664     |  -2.84021    |  1.28749     | +             |
+| truthfulqa_gen                                            | rougeL_max      |  42.6116     |  0.893252    |  44.6463     |  0.966119    | +             |
+| truthfulqa_gen                                            | rougeL_acc      |   0.24235    |  0.0150007   |   0.334149   |  0.0165125   | +             |
+| truthfulqa_gen                                            | rougeL_diff     | -10.4299     |  0.904205    |  -2.50853    |  1.22016     | +             |
+| hendrycksTest-public_relations                            | acc             |   0.281818   |  0.0430912   |   0.3        |  0.0438931   |               |
+| hendrycksTest-public_relations                            | acc_norm        |   0.163636   |  0.0354343   |   0.190909   |  0.0376443   |               |
+| blimp_distractor_agreement_relational_noun                | acc             |   0.833      |  0.0118004   |   0.859      |  0.0110109   | +             |
+| lambada_mt_fr                                             | ppl             |  51.7313     |  2.90272     |  57.0379     |  3.15719     | -             |
+| lambada_mt_fr                                             | acc             |   0.40947    |  0.00685084  |   0.388512   |  0.0067906   | -             |
+| blimp_principle_A_case_1                                  | acc             |   1          |  0           |   1          |  0           |               |
+| hendrycksTest-medical_genetics                            | acc             |   0.31       |  0.0464823   |   0.37       |  0.0485237   | +             |
+| hendrycksTest-medical_genetics                            | acc_norm        |   0.39       |  0.0490207   |   0.41       |  0.0494311   |               |
+| qqp                                                       | acc             |   0.383626   |  0.00241841  |   0.364383   |  0.00239348  | -             |
+| qqp                                                       | f1              |   0.451222   |  0.00289696  |   0.516391   |  0.00263674  | +             |
+| iwslt17-en-ar                                             | bleu            |   4.98225    |  0.275369    |   2.35563    |  0.188638    | -             |
+| iwslt17-en-ar                                             | chrf            |   0.277708   |  0.00415432  |   0.140912   |  0.00503101  | -             |
+| iwslt17-en-ar                                             | ter             |   0.954701   |  0.0126737   |   1.0909     |  0.0122111   | -             |
+| multirc                                                   | acc             |   0.0178384  |  0.00428994  |   0.0409234  |  0.00642087  | +             |
+| hendrycksTest-human_aging                                 | acc             |   0.264574   |  0.0296051   |   0.264574   |  0.0296051   |               |
+| hendrycksTest-human_aging                                 | acc_norm        |   0.237668   |  0.0285681   |   0.197309   |  0.0267099   | -             |
+| reversed_words                                            | acc             |   0          |  0           |   0.0003     |  0.000173188 | +             |
+<figcaption><p>Some results are missing due to errors or computational constraints.</p>
+</figcaption></figure>