--- inference: false language: - en tags: - text-generation - pytorch - causal-lm license: apache-2.0 --- # GPT-4chan Project Website: [https://gpt-4chan.com](https://gpt-4chan.com) ## Model Description GPT-4chan is a language model fine-tuned from [GPT-J 6B](https://huggingface.co/EleutherAI/gpt-j-6B) on 3.5 years worth of data from 4chan's _politically incorrect_ (/pol/) board. ## Training data GPT-4chan was fine-tuned on the dataset [Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board](https://zenodo.org/record/3606810). ## Training procedure The model was trained for 1 epoch following [GPT-J's fine-tuning guide](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/howto_finetune.md). ## Intended Use GPT-4chan is trained on anonymously posted and sparsely moderated discussions of political topics. Its intended use is to reproduce text according to the distribution of its input data. It may also be a useful tool to investigate discourse in such anonymous online communities. Lastly, it has potential applications in tasks such as toxicity detection, as initial experiments show promising zero-shot results when comparing a string's likelihood under GPT-4chan to its likelihood under GPT-J 6B. ### How to use The following is copied from the [Hugging Face documentation on GPT-J](https://huggingface.co/docs/transformers/main/en/model_doc/gptj#generation). Refer to the original for more details. For inference parameters, we recommend a temperature of 0.8, along with either a top_p of 0.8 or a typical_p of 0.3. For the float32 model (CPU): ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B") tokenizer = AutoTokenizer.from_pretrained("ykilcher/gpt-4chan") prompt = ( "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " "previously unexplored valley, in the Andes Mountains. Even more surprising to the " "researchers was the fact that the unicorns spoke perfect English." ) input_ids = tokenizer(prompt, return_tensors="pt").input_ids gen_tokens = model.generate( input_ids, do_sample=True, temperature=0.8, top_p=0.9, max_length=100, ) gen_text = tokenizer.batch_decode(gen_tokens)[0] ``` For the float16 model (GPU): ```python from transformers import GPTJForCausalLM, AutoTokenizer import torch from transformers import GPTJForCausalLM import torch model = GPTJForCausalLM.from_pretrained( "ykilcher/gpt-4chan", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True ) model.cuda() tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B") prompt = ( "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " "previously unexplored valley, in the Andes Mountains. Even more surprising to the " "researchers was the fact that the unicorns spoke perfect English." ) input_ids = tokenizer(prompt, return_tensors="pt").input_ids input_ids = input_ids.cuda() gen_tokens = model.generate( input_ids, do_sample=True, temperature=0.8, top_p=0.9, max_length=100, ) gen_text = tokenizer.batch_decode(gen_tokens)[0] ``` ### Limitations and Biases This is a statistical model. As such, it continues text as is likely under the distribution the model has learned from the training data. Outputs should not be interpreted as "correct", "truthful", or otherwise as anything more than a statistical function of the input. That being said, GPT-4chan does significantly outperform GPT-J (and GPT-3) on the [TruthfulQA Benchmark](https://arxiv.org/abs/2109.07958) that measures whether a language model is truthful in generating answers to questions. The dataset is time- and domain-limited. It was collected from 2016 to 2019 on 4chan's _politically incorrect_ board. As such, political topics from that area will be overrepresented in the model's distribution, compared to other models (e.g. GPT-J 6B). Also, due to the very lax rules and anonymity of posters, a large part of the dataset contains offensive material. Thus, it is **very likely that the model will produce offensive outputs**, including but not limited to: toxicity, hate speech, racism, sexism, homo- and transphobia, xenophobia, and anti-semitism. Due to the above limitations, it is strongly recommend to not deploy this model into a real-world environment unless its behavior is well-understood and explicit and strict limitations on the scope, impact, and duration of the deployment are enforced. ## Evaluation results ### Language Model Evaluation Harness The following table compares GPT-J 6B to GPT-4chan on a subset of the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness). Differences exceeding standard errors are marked in the "Significant" column with a minus sign (-) indicating an advantage for GPT-J 6B and a plus sign (+) indicating an advantage for GPT-4chan.
| Task | Metric | GPT-4chan | stderr | GPT-J-6B | stderr | Significant | |:----------------------------------------------------------|:----------------|-------------:|-------------:|-------------:|-------------:|:--------------| | copa | acc | 0.85 | 0.035887 | 0.83 | 0.0377525 | | | blimp_only_npi_scope | acc | 0.712 | 0.0143269 | 0.787 | 0.0129537 | - | | hendrycksTest-conceptual_physics | acc | 0.251064 | 0.028347 | 0.255319 | 0.0285049 | | | hendrycksTest-conceptual_physics | acc_norm | 0.187234 | 0.0255016 | 0.191489 | 0.0257221 | | | hendrycksTest-high_school_mathematics | acc | 0.248148 | 0.0263357 | 0.218519 | 0.0251958 | + | | hendrycksTest-high_school_mathematics | acc_norm | 0.3 | 0.0279405 | 0.251852 | 0.0264661 | + | | blimp_sentential_negation_npi_scope | acc | 0.734 | 0.01398 | 0.733 | 0.0139967 | | | hendrycksTest-high_school_european_history | acc | 0.278788 | 0.0350144 | 0.260606 | 0.0342774 | | | hendrycksTest-high_school_european_history | acc_norm | 0.315152 | 0.0362773 | 0.278788 | 0.0350144 | + | | blimp_wh_questions_object_gap | acc | 0.841 | 0.0115695 | 0.835 | 0.0117436 | | | hendrycksTest-international_law | acc | 0.214876 | 0.0374949 | 0.264463 | 0.0402619 | - | | hendrycksTest-international_law | acc_norm | 0.438017 | 0.0452915 | 0.404959 | 0.0448114 | | | hendrycksTest-high_school_us_history | acc | 0.323529 | 0.0328347 | 0.289216 | 0.0318223 | + | | hendrycksTest-high_school_us_history | acc_norm | 0.323529 | 0.0328347 | 0.29902 | 0.0321333 | | | openbookqa | acc | 0.276 | 0.0200112 | 0.29 | 0.0203132 | | | openbookqa | acc_norm | 0.362 | 0.0215137 | 0.382 | 0.0217508 | | | blimp_causative | acc | 0.737 | 0.0139293 | 0.761 | 0.013493 | - | | record | f1 | 0.878443 | 0.00322394 | 0.885049 | 0.00314367 | - | | record | em | 0.8702 | 0.003361 | 0.8765 | 0.00329027 | - | | blimp_determiner_noun_agreement_1 | acc | 0.996 | 0.00199699 | 0.995 | 0.00223159 | | | hendrycksTest-miscellaneous | acc | 0.305236 | 0.0164677 | 0.274585 | 0.0159598 | + | | hendrycksTest-miscellaneous | acc_norm | 0.269476 | 0.0158662 | 0.260536 | 0.015696 | | | hendrycksTest-virology | acc | 0.343373 | 0.0369658 | 0.349398 | 0.0371173 | | | hendrycksTest-virology | acc_norm | 0.331325 | 0.0366431 | 0.325301 | 0.0364717 | | | mathqa | acc | 0.269012 | 0.00811786 | 0.267002 | 0.00809858 | | | mathqa | acc_norm | 0.261642 | 0.00804614 | 0.270687 | 0.00813376 | - | | squad2 | exact | 10.6123 | 0 | 10.6207 | 0 | - | | squad2 | f1 | 17.8734 | 0 | 17.7413 | 0 | + | | squad2 | HasAns_exact | 17.2571 | 0 | 15.5027 | 0 | + | | squad2 | HasAns_f1 | 31.8 | 0 | 29.7643 | 0 | + | | squad2 | NoAns_exact | 3.98654 | 0 | 5.75273 | 0 | - | | squad2 | NoAns_f1 | 3.98654 | 0 | 5.75273 | 0 | - | | squad2 | best_exact | 50.0716 | 0 | 50.0716 | 0 | | | squad2 | best_f1 | 50.077 | 0 | 50.0778 | 0 | - | | mnli_mismatched | acc | 0.320586 | 0.00470696 | 0.376627 | 0.00488687 | - | | blimp_animate_subject_passive | acc | 0.79 | 0.0128867 | 0.781 | 0.0130847 | | | blimp_determiner_noun_agreement_with_adj_irregular_1 | acc | 0.834 | 0.0117721 | 0.878 | 0.0103549 | - | | qnli | acc | 0.491305 | 0.00676439 | 0.513454 | 0.00676296 | - | | blimp_intransitive | acc | 0.806 | 0.0125108 | 0.858 | 0.0110435 | - | | ethics_cm | acc | 0.512227 | 0.00802048 | 0.559846 | 0.00796521 | - | | hendrycksTest-high_school_computer_science | acc | 0.2 | 0.0402015 | 0.25 | 0.0435194 | - | | hendrycksTest-high_school_computer_science | acc_norm | 0.26 | 0.0440844 | 0.27 | 0.0446196 | | | iwslt17-ar-en | bleu | 21.4685 | 0.64825 | 20.7322 | 0.795602 | + | | iwslt17-ar-en | chrf | 0.452175 | 0.00498012 | 0.450919 | 0.00526515 | | | iwslt17-ar-en | ter | 0.733514 | 0.0201688 | 0.787631 | 0.0285488 | + | | hendrycksTest-security_studies | acc | 0.391837 | 0.0312513 | 0.363265 | 0.0307891 | | | hendrycksTest-security_studies | acc_norm | 0.285714 | 0.0289206 | 0.285714 | 0.0289206 | | | hendrycksTest-global_facts | acc | 0.29 | 0.0456048 | 0.25 | 0.0435194 | | | hendrycksTest-global_facts | acc_norm | 0.26 | 0.0440844 | 0.22 | 0.0416333 | | | anli_r1 | acc | 0.297 | 0.0144568 | 0.322 | 0.0147829 | - | | blimp_left_branch_island_simple_question | acc | 0.884 | 0.0101315 | 0.867 | 0.0107437 | + | | hendrycksTest-astronomy | acc | 0.25 | 0.0352381 | 0.25 | 0.0352381 | | | hendrycksTest-astronomy | acc_norm | 0.348684 | 0.0387814 | 0.335526 | 0.038425 | | | mrpc | acc | 0.536765 | 0.024717 | 0.683824 | 0.0230483 | - | | mrpc | f1 | 0.63301 | 0.0247985 | 0.812227 | 0.0162476 | - | | ethics_utilitarianism | acc | 0.525374 | 0.00720233 | 0.509775 | 0.00721024 | + | | blimp_determiner_noun_agreement_2 | acc | 0.99 | 0.003148 | 0.977 | 0.00474273 | + | | lambada_cloze | ppl | 388.123 | 13.1523 | 405.646 | 14.5519 | + | | lambada_cloze | acc | 0.0116437 | 0.00149456 | 0.0199884 | 0.00194992 | - | | truthfulqa_mc | mc1 | 0.225214 | 0.0146232 | 0.201958 | 0.014054 | + | | truthfulqa_mc | mc2 | 0.371625 | 0.0136558 | 0.359537 | 0.0134598 | | | blimp_wh_vs_that_with_gap_long_distance | acc | 0.441 | 0.0157088 | 0.342 | 0.0150087 | + | | hendrycksTest-business_ethics | acc | 0.28 | 0.0451261 | 0.29 | 0.0456048 | | | hendrycksTest-business_ethics | acc_norm | 0.29 | 0.0456048 | 0.3 | 0.0460566 | | | arithmetic_3ds | acc | 0.0065 | 0.00179736 | 0.046 | 0.0046854 | - | | blimp_determiner_noun_agreement_with_adjective_1 | acc | 0.988 | 0.00344498 | 0.978 | 0.00464086 | + | | hendrycksTest-moral_disputes | acc | 0.277457 | 0.0241057 | 0.283237 | 0.0242579 | | | hendrycksTest-moral_disputes | acc_norm | 0.309249 | 0.0248831 | 0.32659 | 0.0252483 | | | arithmetic_2da | acc | 0.0455 | 0.00466109 | 0.2405 | 0.00955906 | - | | qa4mre_2011 | acc | 0.425 | 0.0453163 | 0.458333 | 0.0456755 | | | qa4mre_2011 | acc_norm | 0.558333 | 0.0455219 | 0.533333 | 0.045733 | | | blimp_regular_plural_subject_verb_agreement_1 | acc | 0.966 | 0.00573384 | 0.968 | 0.00556839 | | | hendrycksTest-human_sexuality | acc | 0.389313 | 0.0427649 | 0.396947 | 0.0429114 | | | hendrycksTest-human_sexuality | acc_norm | 0.305344 | 0.0403931 | 0.343511 | 0.0416498 | | | blimp_passive_1 | acc | 0.878 | 0.0103549 | 0.885 | 0.0100934 | | | blimp_drop_argument | acc | 0.784 | 0.0130197 | 0.823 | 0.0120755 | - | | hendrycksTest-high_school_microeconomics | acc | 0.260504 | 0.0285103 | 0.277311 | 0.0290794 | | | hendrycksTest-high_school_microeconomics | acc_norm | 0.390756 | 0.0316938 | 0.39916 | 0.0318111 | | | hendrycksTest-us_foreign_policy | acc | 0.32 | 0.0468826 | 0.34 | 0.0476095 | | | hendrycksTest-us_foreign_policy | acc_norm | 0.4 | 0.0492366 | 0.35 | 0.0479372 | + | | blimp_ellipsis_n_bar_1 | acc | 0.846 | 0.0114199 | 0.841 | 0.0115695 | | | hendrycksTest-high_school_physics | acc | 0.264901 | 0.0360304 | 0.271523 | 0.0363133 | | | hendrycksTest-high_school_physics | acc_norm | 0.284768 | 0.0368488 | 0.271523 | 0.0363133 | | | qa4mre_2013 | acc | 0.362676 | 0.028579 | 0.401408 | 0.0291384 | - | | qa4mre_2013 | acc_norm | 0.387324 | 0.0289574 | 0.383803 | 0.0289082 | | | blimp_wh_vs_that_no_gap | acc | 0.963 | 0.00597216 | 0.969 | 0.00548353 | - | | headqa_es | acc | 0.238877 | 0.00814442 | 0.251276 | 0.0082848 | - | | headqa_es | acc_norm | 0.290664 | 0.00867295 | 0.286652 | 0.00863721 | | | blimp_sentential_subject_island | acc | 0.359 | 0.0151773 | 0.421 | 0.0156206 | - | | hendrycksTest-philosophy | acc | 0.241158 | 0.0242966 | 0.26045 | 0.0249267 | | | hendrycksTest-philosophy | acc_norm | 0.327974 | 0.0266644 | 0.334405 | 0.0267954 | | | hendrycksTest-elementary_mathematics | acc | 0.248677 | 0.0222618 | 0.251323 | 0.0223405 | | | hendrycksTest-elementary_mathematics | acc_norm | 0.275132 | 0.0230001 | 0.26455 | 0.0227175 | | | math_geometry | acc | 0.0187891 | 0.00621042 | 0.0104384 | 0.00464863 | + | | blimp_wh_questions_subject_gap_long_distance | acc | 0.886 | 0.0100551 | 0.883 | 0.0101693 | | | hendrycksTest-college_physics | acc | 0.205882 | 0.0402338 | 0.205882 | 0.0402338 | | | hendrycksTest-college_physics | acc_norm | 0.22549 | 0.0415831 | 0.245098 | 0.0428011 | | | hellaswag | acc | 0.488747 | 0.00498852 | 0.49532 | 0.00498956 | - | | hellaswag | acc_norm | 0.648277 | 0.00476532 | 0.66202 | 0.00472055 | - | | hendrycksTest-logical_fallacies | acc | 0.269939 | 0.0348783 | 0.294479 | 0.0358117 | | | hendrycksTest-logical_fallacies | acc_norm | 0.343558 | 0.0373113 | 0.355828 | 0.0376152 | | | hendrycksTest-machine_learning | acc | 0.339286 | 0.0449395 | 0.223214 | 0.039523 | + | | hendrycksTest-machine_learning | acc_norm | 0.205357 | 0.0383424 | 0.178571 | 0.0363521 | | | hendrycksTest-high_school_psychology | acc | 0.286239 | 0.0193794 | 0.273394 | 0.0191093 | | | hendrycksTest-high_school_psychology | acc_norm | 0.266055 | 0.018946 | 0.269725 | 0.0190285 | | | prost | acc | 0.256298 | 0.00318967 | 0.268254 | 0.00323688 | - | | prost | acc_norm | 0.280156 | 0.00328089 | 0.274658 | 0.00326093 | + | | blimp_determiner_noun_agreement_with_adj_irregular_2 | acc | 0.898 | 0.00957537 | 0.916 | 0.00877616 | - | | wnli | acc | 0.43662 | 0.0592794 | 0.464789 | 0.0596131 | | | hendrycksTest-professional_law | acc | 0.284876 | 0.0115278 | 0.273794 | 0.0113886 | | | hendrycksTest-professional_law | acc_norm | 0.301825 | 0.0117244 | 0.292699 | 0.0116209 | | | math_algebra | acc | 0.0126369 | 0.00324352 | 0.0117944 | 0.00313487 | | | wikitext | word_perplexity | 11.4687 | 0 | 10.8819 | 0 | - | | wikitext | byte_perplexity | 1.5781 | 0 | 1.56268 | 0 | - | | wikitext | bits_per_byte | 0.658188 | 0 | 0.644019 | 0 | - | | anagrams1 | acc | 0.0125 | 0.00111108 | 0.0008 | 0.000282744 | + | | math_prealgebra | acc | 0.0195178 | 0.00469003 | 0.0126292 | 0.00378589 | + | | blimp_principle_A_domain_2 | acc | 0.887 | 0.0100166 | 0.889 | 0.0099387 | | | cycle_letters | acc | 0.0331 | 0.00178907 | 0.0026 | 0.000509264 | + | | hendrycksTest-college_mathematics | acc | 0.26 | 0.0440844 | 0.26 | 0.0440844 | | | hendrycksTest-college_mathematics | acc_norm | 0.31 | 0.0464823 | 0.4 | 0.0492366 | - | | arithmetic_1dc | acc | 0.077 | 0.00596266 | 0.089 | 0.00636866 | - | | arithmetic_4da | acc | 0.0005 | 0.0005 | 0.007 | 0.00186474 | - | | triviaqa | acc | 0.150888 | 0.00336543 | 0.167418 | 0.00351031 | - | | boolq | acc | 0.673394 | 0.00820236 | 0.655352 | 0.00831224 | + | | random_insertion | acc | 0.0004 | 0.00019997 | 0 | 0 | + | | qa4mre_2012 | acc | 0.4 | 0.0388514 | 0.4125 | 0.0390407 | | | qa4mre_2012 | acc_norm | 0.4625 | 0.0395409 | 0.50625 | 0.0396495 | - | | math_asdiv | acc | 0.00997831 | 0.00207066 | 0.00563991 | 0.00156015 | + | | hendrycksTest-moral_scenarios | acc | 0.236872 | 0.0142196 | 0.236872 | 0.0142196 | | | hendrycksTest-moral_scenarios | acc_norm | 0.272626 | 0.0148934 | 0.272626 | 0.0148934 | | | hendrycksTest-high_school_geography | acc | 0.247475 | 0.0307463 | 0.20202 | 0.0286062 | + | | hendrycksTest-high_school_geography | acc_norm | 0.287879 | 0.0322588 | 0.292929 | 0.032425 | | | gsm8k | acc | 0 | 0 | 0 | 0 | | | blimp_existential_there_object_raising | acc | 0.812 | 0.0123616 | 0.792 | 0.0128414 | + | | blimp_superlative_quantifiers_2 | acc | 0.917 | 0.00872853 | 0.865 | 0.0108117 | + | | hendrycksTest-college_chemistry | acc | 0.28 | 0.0451261 | 0.24 | 0.0429235 | | | hendrycksTest-college_chemistry | acc_norm | 0.31 | 0.0464823 | 0.28 | 0.0451261 | | | blimp_existential_there_quantifiers_2 | acc | 0.545 | 0.0157551 | 0.383 | 0.0153801 | + | | hendrycksTest-abstract_algebra | acc | 0.17 | 0.0377525 | 0.26 | 0.0440844 | - | | hendrycksTest-abstract_algebra | acc_norm | 0.26 | 0.0440844 | 0.3 | 0.0460566 | | | hendrycksTest-professional_psychology | acc | 0.26634 | 0.0178832 | 0.28268 | 0.0182173 | | | hendrycksTest-professional_psychology | acc_norm | 0.256536 | 0.0176678 | 0.259804 | 0.0177409 | | | ethics_virtue | acc | 0.249849 | 0.00613847 | 0.200201 | 0.00567376 | + | | ethics_virtue | em | 0.0040201 | 0 | 0 | 0 | + | | arithmetic_5da | acc | 0 | 0 | 0.0005 | 0.0005 | - | | mutual | r@1 | 0.455982 | 0.0167421 | 0.468397 | 0.0167737 | | | mutual | r@2 | 0.732506 | 0.0148796 | 0.735892 | 0.0148193 | | | mutual | mrr | 0.675226 | 0.0103132 | 0.682186 | 0.0103375 | | | blimp_irregular_past_participle_verbs | acc | 0.869 | 0.0106749 | 0.876 | 0.0104275 | | | ethics_deontology | acc | 0.497775 | 0.00833904 | 0.523637 | 0.0083298 | - | | ethics_deontology | em | 0.00333704 | 0 | 0.0355951 | 0 | - | | blimp_transitive | acc | 0.818 | 0.0122076 | 0.855 | 0.01114 | - | | hendrycksTest-college_computer_science | acc | 0.29 | 0.0456048 | 0.27 | 0.0446196 | | | hendrycksTest-college_computer_science | acc_norm | 0.27 | 0.0446196 | 0.26 | 0.0440844 | | | hendrycksTest-professional_medicine | acc | 0.283088 | 0.0273659 | 0.272059 | 0.027033 | | | hendrycksTest-professional_medicine | acc_norm | 0.279412 | 0.0272572 | 0.261029 | 0.0266793 | | | sciq | acc | 0.895 | 0.00969892 | 0.915 | 0.00882343 | - | | sciq | acc_norm | 0.869 | 0.0106749 | 0.874 | 0.0104992 | | | blimp_anaphor_number_agreement | acc | 0.993 | 0.00263779 | 0.995 | 0.00223159 | | | blimp_wh_questions_subject_gap | acc | 0.925 | 0.00833333 | 0.913 | 0.00891687 | + | | blimp_wh_vs_that_with_gap | acc | 0.482 | 0.015809 | 0.429 | 0.015659 | + | | math_num_theory | acc | 0.0351852 | 0.00793611 | 0.0203704 | 0.00608466 | + | | blimp_complex_NP_island | acc | 0.538 | 0.0157735 | 0.535 | 0.0157805 | | | blimp_expletive_it_object_raising | acc | 0.777 | 0.0131698 | 0.78 | 0.0131062 | | | lambada_mt_en | ppl | 4.62504 | 0.10549 | 4.10224 | 0.0884971 | - | | lambada_mt_en | acc | 0.648554 | 0.00665142 | 0.682127 | 0.00648741 | - | | hendrycksTest-formal_logic | acc | 0.309524 | 0.0413491 | 0.34127 | 0.042408 | | | hendrycksTest-formal_logic | acc_norm | 0.325397 | 0.041906 | 0.325397 | 0.041906 | | | blimp_matrix_question_npi_licensor_present | acc | 0.663 | 0.0149551 | 0.727 | 0.014095 | - | | blimp_superlative_quantifiers_1 | acc | 0.791 | 0.0128641 | 0.871 | 0.0106053 | - | | lambada_mt_de | ppl | 89.7905 | 5.30301 | 82.2416 | 4.88447 | - | | lambada_mt_de | acc | 0.312245 | 0.0064562 | 0.312827 | 0.00645948 | | | hendrycksTest-computer_security | acc | 0.37 | 0.0485237 | 0.27 | 0.0446196 | + | | hendrycksTest-computer_security | acc_norm | 0.37 | 0.0485237 | 0.33 | 0.0472582 | | | ethics_justice | acc | 0.501479 | 0.00961712 | 0.526627 | 0.00960352 | - | | ethics_justice | em | 0 | 0 | 0.0251479 | 0 | - | | blimp_principle_A_reconstruction | acc | 0.296 | 0.0144427 | 0.444 | 0.0157198 | - | | blimp_existential_there_subject_raising | acc | 0.877 | 0.0103913 | 0.875 | 0.0104635 | | | math_precalc | acc | 0.014652 | 0.00514689 | 0.0018315 | 0.0018315 | + | | qasper | f1_yesno | 0.632997 | 0.032868 | 0.666667 | 0.0311266 | - | | qasper | f1_abstractive | 0.113489 | 0.00729073 | 0.118383 | 0.00692993 | | | cb | acc | 0.196429 | 0.0535714 | 0.357143 | 0.0646096 | - | | cb | f1 | 0.149038 | 0 | 0.288109 | 0 | - | | blimp_animate_subject_trans | acc | 0.858 | 0.0110435 | 0.868 | 0.0107094 | | | hendrycksTest-high_school_statistics | acc | 0.310185 | 0.031547 | 0.291667 | 0.0309987 | | | hendrycksTest-high_school_statistics | acc_norm | 0.361111 | 0.0327577 | 0.314815 | 0.0316747 | + | | blimp_irregular_plural_subject_verb_agreement_2 | acc | 0.881 | 0.0102442 | 0.919 | 0.00863212 | - | | lambada_mt_es | ppl | 92.1172 | 5.05064 | 83.6696 | 4.57489 | - | | lambada_mt_es | acc | 0.322337 | 0.00651139 | 0.326994 | 0.00653569 | | | anli_r2 | acc | 0.327 | 0.0148422 | 0.337 | 0.0149551 | | | hendrycksTest-nutrition | acc | 0.346405 | 0.0272456 | 0.346405 | 0.0272456 | | | hendrycksTest-nutrition | acc_norm | 0.385621 | 0.0278707 | 0.401961 | 0.0280742 | | | anli_r3 | acc | 0.336667 | 0.0136476 | 0.3525 | 0.0137972 | - | | blimp_regular_plural_subject_verb_agreement_2 | acc | 0.897 | 0.00961683 | 0.916 | 0.00877616 | - | | blimp_tough_vs_raising_2 | acc | 0.826 | 0.0119945 | 0.857 | 0.0110758 | - | | mnli | acc | 0.316047 | 0.00469317 | 0.374733 | 0.00488619 | - | | drop | em | 0.0595638 | 0.00242379 | 0.0228607 | 0.0015306 | + | | drop | f1 | 0.120355 | 0.00270951 | 0.103871 | 0.00219977 | + | | blimp_determiner_noun_agreement_with_adj_2 | acc | 0.95 | 0.00689547 | 0.936 | 0.00774364 | + | | arithmetic_2dm | acc | 0.061 | 0.00535293 | 0.14 | 0.00776081 | - | | blimp_determiner_noun_agreement_irregular_2 | acc | 0.93 | 0.00807249 | 0.932 | 0.00796489 | | | lambada | ppl | 4.62504 | 0.10549 | 4.10224 | 0.0884971 | - | | lambada | acc | 0.648554 | 0.00665142 | 0.682127 | 0.00648741 | - | | arithmetic_3da | acc | 0.007 | 0.00186474 | 0.0865 | 0.00628718 | - | | blimp_irregular_past_participle_adjectives | acc | 0.947 | 0.00708811 | 0.956 | 0.00648892 | - | | hendrycksTest-college_biology | acc | 0.201389 | 0.0335365 | 0.284722 | 0.0377381 | - | | hendrycksTest-college_biology | acc_norm | 0.222222 | 0.0347659 | 0.270833 | 0.0371618 | - | | headqa_en | acc | 0.324945 | 0.00894582 | 0.335522 | 0.00901875 | - | | headqa_en | acc_norm | 0.375638 | 0.00925014 | 0.383297 | 0.00928648 | | | blimp_determiner_noun_agreement_irregular_1 | acc | 0.912 | 0.00896305 | 0.944 | 0.0072744 | - | | blimp_existential_there_quantifiers_1 | acc | 0.985 | 0.00384575 | 0.981 | 0.00431945 | | | blimp_inchoative | acc | 0.653 | 0.0150605 | 0.683 | 0.0147217 | - | | mutual_plus | r@1 | 0.395034 | 0.0164328 | 0.409707 | 0.016531 | | | mutual_plus | r@2 | 0.674944 | 0.015745 | 0.680587 | 0.0156728 | | | mutual_plus | mrr | 0.632713 | 0.0103391 | 0.640801 | 0.0104141 | | | blimp_tough_vs_raising_1 | acc | 0.736 | 0.0139463 | 0.734 | 0.01398 | | | winogrande | acc | 0.636148 | 0.0135215 | 0.640884 | 0.0134831 | | | race | acc | 0.374163 | 0.0149765 | 0.37512 | 0.0149842 | | | blimp_irregular_plural_subject_verb_agreement_1 | acc | 0.908 | 0.00914438 | 0.918 | 0.00868052 | - | | hendrycksTest-high_school_macroeconomics | acc | 0.284615 | 0.0228783 | 0.284615 | 0.0228783 | | | hendrycksTest-high_school_macroeconomics | acc_norm | 0.284615 | 0.0228783 | 0.276923 | 0.022688 | | | blimp_adjunct_island | acc | 0.888 | 0.00997775 | 0.902 | 0.00940662 | - | | hendrycksTest-high_school_chemistry | acc | 0.236453 | 0.0298961 | 0.211823 | 0.028749 | | | hendrycksTest-high_school_chemistry | acc_norm | 0.300493 | 0.032258 | 0.29064 | 0.0319474 | | | arithmetic_2ds | acc | 0.051 | 0.00492053 | 0.218 | 0.00923475 | - | | blimp_principle_A_case_2 | acc | 0.955 | 0.00655881 | 0.953 | 0.00669596 | | | blimp_only_npi_licensor_present | acc | 0.926 | 0.00828206 | 0.953 | 0.00669596 | - | | math_counting_and_prob | acc | 0.0274262 | 0.00750954 | 0.0021097 | 0.0021097 | + | | cola | mcc | -0.0854256 | 0.0304519 | -0.0504508 | 0.0251594 | - | | webqs | acc | 0.023622 | 0.00336987 | 0.0226378 | 0.00330058 | | | arithmetic_4ds | acc | 0.0005 | 0.0005 | 0.0055 | 0.00165416 | - | | blimp_wh_vs_that_no_gap_long_distance | acc | 0.94 | 0.00751375 | 0.939 | 0.00757208 | | | pile_bookcorpus2 | word_perplexity | 28.7786 | 0 | 27.0559 | 0 | - | | pile_bookcorpus2 | byte_perplexity | 1.79969 | 0 | 1.78037 | 0 | - | | pile_bookcorpus2 | bits_per_byte | 0.847751 | 0 | 0.832176 | 0 | - | | blimp_sentential_negation_npi_licensor_present | acc | 0.994 | 0.00244335 | 0.982 | 0.00420639 | + | | hendrycksTest-high_school_government_and_politics | acc | 0.274611 | 0.0322102 | 0.227979 | 0.0302769 | + | | hendrycksTest-high_school_government_and_politics | acc_norm | 0.259067 | 0.0316188 | 0.248705 | 0.0311958 | | | blimp_ellipsis_n_bar_2 | acc | 0.937 | 0.00768701 | 0.916 | 0.00877616 | + | | hendrycksTest-clinical_knowledge | acc | 0.283019 | 0.0277242 | 0.267925 | 0.0272573 | | | hendrycksTest-clinical_knowledge | acc_norm | 0.343396 | 0.0292245 | 0.316981 | 0.0286372 | | | mc_taco | em | 0.125375 | 0 | 0.132883 | 0 | - | | mc_taco | f1 | 0.487131 | 0 | 0.499712 | 0 | - | | wsc | acc | 0.365385 | 0.0474473 | 0.365385 | 0.0474473 | | | hendrycksTest-college_medicine | acc | 0.231214 | 0.0321474 | 0.190751 | 0.0299579 | + | | hendrycksTest-college_medicine | acc_norm | 0.289017 | 0.0345643 | 0.265896 | 0.0336876 | | | hendrycksTest-high_school_world_history | acc | 0.295359 | 0.0296963 | 0.2827 | 0.0293128 | | | hendrycksTest-high_school_world_history | acc_norm | 0.312236 | 0.0301651 | 0.312236 | 0.0301651 | | | hendrycksTest-anatomy | acc | 0.296296 | 0.0394462 | 0.281481 | 0.03885 | | | hendrycksTest-anatomy | acc_norm | 0.288889 | 0.0391545 | 0.266667 | 0.0382017 | | | hendrycksTest-jurisprudence | acc | 0.25 | 0.0418609 | 0.277778 | 0.0433004 | | | hendrycksTest-jurisprudence | acc_norm | 0.416667 | 0.0476608 | 0.425926 | 0.0478034 | | | logiqa | acc | 0.193548 | 0.0154963 | 0.211982 | 0.016031 | - | | logiqa | acc_norm | 0.281106 | 0.0176324 | 0.291859 | 0.0178316 | | | ethics_utilitarianism_original | acc | 0.767679 | 0.00609112 | 0.941556 | 0.00338343 | - | | blimp_principle_A_c_command | acc | 0.827 | 0.0119672 | 0.81 | 0.0124119 | + | | blimp_coordinate_structure_constraint_complex_left_branch | acc | 0.794 | 0.0127956 | 0.764 | 0.0134345 | + | | arithmetic_5ds | acc | 0 | 0 | 0 | 0 | | | lambada_mt_it | ppl | 96.8846 | 5.80902 | 86.66 | 5.1869 | - | | lambada_mt_it | acc | 0.328158 | 0.00654165 | 0.336891 | 0.0065849 | - | | wsc273 | acc | 0.827839 | 0.0228905 | 0.827839 | 0.0228905 | | | blimp_coordinate_structure_constraint_object_extraction | acc | 0.852 | 0.0112349 | 0.876 | 0.0104275 | - | | blimp_principle_A_domain_3 | acc | 0.79 | 0.0128867 | 0.819 | 0.0121814 | - | | blimp_left_branch_island_echo_question | acc | 0.638 | 0.0152048 | 0.519 | 0.0158079 | + | | rte | acc | 0.534296 | 0.0300256 | 0.548736 | 0.0299531 | | | blimp_passive_2 | acc | 0.892 | 0.00982 | 0.899 | 0.00953362 | | | hendrycksTest-electrical_engineering | acc | 0.344828 | 0.0396093 | 0.358621 | 0.0399663 | | | hendrycksTest-electrical_engineering | acc_norm | 0.372414 | 0.0402873 | 0.372414 | 0.0402873 | | | sst | acc | 0.626147 | 0.0163938 | 0.493119 | 0.0169402 | + | | blimp_npi_present_1 | acc | 0.565 | 0.0156851 | 0.576 | 0.0156355 | | | piqa | acc | 0.739391 | 0.0102418 | 0.754081 | 0.0100473 | - | | piqa | acc_norm | 0.755169 | 0.0100323 | 0.761697 | 0.00994033 | | | hendrycksTest-professional_accounting | acc | 0.312057 | 0.0276401 | 0.265957 | 0.0263581 | + | | hendrycksTest-professional_accounting | acc_norm | 0.27305 | 0.0265779 | 0.22695 | 0.0249871 | + | | arc_challenge | acc | 0.325085 | 0.0136881 | 0.337884 | 0.013822 | | | arc_challenge | acc_norm | 0.352389 | 0.0139601 | 0.366041 | 0.0140772 | | | hendrycksTest-econometrics | acc | 0.263158 | 0.0414244 | 0.245614 | 0.0404934 | | | hendrycksTest-econometrics | acc_norm | 0.254386 | 0.0409699 | 0.27193 | 0.0418577 | | | headqa | acc | 0.238877 | 0.00814442 | 0.251276 | 0.0082848 | - | | headqa | acc_norm | 0.290664 | 0.00867295 | 0.286652 | 0.00863721 | | | wic | acc | 0.482759 | 0.0197989 | 0.5 | 0.0198107 | | | hendrycksTest-high_school_biology | acc | 0.270968 | 0.0252844 | 0.251613 | 0.024686 | | | hendrycksTest-high_school_biology | acc_norm | 0.274194 | 0.0253781 | 0.283871 | 0.0256494 | | | hendrycksTest-management | acc | 0.281553 | 0.0445325 | 0.23301 | 0.0418583 | + | | hendrycksTest-management | acc_norm | 0.291262 | 0.0449868 | 0.320388 | 0.0462028 | | | blimp_npi_present_2 | acc | 0.645 | 0.0151395 | 0.664 | 0.0149441 | - | | hendrycksTest-prehistory | acc | 0.265432 | 0.0245692 | 0.243827 | 0.0238919 | | | hendrycksTest-prehistory | acc_norm | 0.225309 | 0.0232462 | 0.219136 | 0.0230167 | | | hendrycksTest-world_religions | acc | 0.321637 | 0.0358253 | 0.333333 | 0.0361551 | | | hendrycksTest-world_religions | acc_norm | 0.397661 | 0.0375364 | 0.380117 | 0.0372297 | | | math_intermediate_algebra | acc | 0.00996678 | 0.00330749 | 0.00332226 | 0.00191598 | + | | anagrams2 | acc | 0.0347 | 0.00183028 | 0.0055 | 0.000739615 | + | | arc_easy | acc | 0.647306 | 0.00980442 | 0.669613 | 0.00965143 | - | | arc_easy | acc_norm | 0.609848 | 0.0100091 | 0.622896 | 0.00994504 | - | | blimp_anaphor_gender_agreement | acc | 0.993 | 0.00263779 | 0.994 | 0.00244335 | | | hendrycksTest-marketing | acc | 0.311966 | 0.0303515 | 0.307692 | 0.0302364 | | | hendrycksTest-marketing | acc_norm | 0.34188 | 0.031075 | 0.294872 | 0.0298726 | + | | blimp_principle_A_domain_1 | acc | 0.997 | 0.00173032 | 0.997 | 0.00173032 | | | blimp_wh_island | acc | 0.856 | 0.011108 | 0.852 | 0.0112349 | | | hendrycksTest-sociology | acc | 0.303483 | 0.0325101 | 0.278607 | 0.0317006 | | | hendrycksTest-sociology | acc_norm | 0.298507 | 0.0323574 | 0.318408 | 0.0329412 | | | blimp_distractor_agreement_relative_clause | acc | 0.774 | 0.0132325 | 0.719 | 0.0142212 | + | | truthfulqa_gen | bleurt_max | -0.811655 | 0.0180743 | -0.814228 | 0.0172128 | | | truthfulqa_gen | bleurt_acc | 0.395349 | 0.0171158 | 0.329253 | 0.0164513 | + | | truthfulqa_gen | bleurt_diff | -0.0488385 | 0.0204525 | -0.185905 | 0.0169617 | + | | truthfulqa_gen | bleu_max | 20.8747 | 0.717003 | 20.2238 | 0.711772 | | | truthfulqa_gen | bleu_acc | 0.330477 | 0.0164668 | 0.281518 | 0.015744 | + | | truthfulqa_gen | bleu_diff | -2.12856 | 0.832693 | -6.66121 | 0.719366 | + | | truthfulqa_gen | rouge1_max | 47.0293 | 0.962404 | 45.3457 | 0.89238 | + | | truthfulqa_gen | rouge1_acc | 0.341493 | 0.0166007 | 0.257038 | 0.0152981 | + | | truthfulqa_gen | rouge1_diff | -2.29454 | 1.2086 | -10.1049 | 0.8922 | + | | truthfulqa_gen | rouge2_max | 31.0617 | 1.08725 | 28.7438 | 0.981282 | + | | truthfulqa_gen | rouge2_acc | 0.247246 | 0.0151024 | 0.201958 | 0.014054 | + | | truthfulqa_gen | rouge2_diff | -2.84021 | 1.28749 | -11.0916 | 1.01664 | + | | truthfulqa_gen | rougeL_max | 44.6463 | 0.966119 | 42.6116 | 0.893252 | + | | truthfulqa_gen | rougeL_acc | 0.334149 | 0.0165125 | 0.24235 | 0.0150007 | + | | truthfulqa_gen | rougeL_diff | -2.50853 | 1.22016 | -10.4299 | 0.904205 | + | | hendrycksTest-public_relations | acc | 0.3 | 0.0438931 | 0.281818 | 0.0430912 | | | hendrycksTest-public_relations | acc_norm | 0.190909 | 0.0376443 | 0.163636 | 0.0354343 | | | blimp_distractor_agreement_relational_noun | acc | 0.859 | 0.0110109 | 0.833 | 0.0118004 | + | | lambada_mt_fr | ppl | 57.0379 | 3.15719 | 51.7313 | 2.90272 | - | | lambada_mt_fr | acc | 0.388512 | 0.0067906 | 0.40947 | 0.00685084 | - | | blimp_principle_A_case_1 | acc | 1 | 0 | 1 | 0 | | | hendrycksTest-medical_genetics | acc | 0.37 | 0.0485237 | 0.31 | 0.0464823 | + | | hendrycksTest-medical_genetics | acc_norm | 0.41 | 0.0494311 | 0.39 | 0.0490207 | | | qqp | acc | 0.364383 | 0.00239348 | 0.383626 | 0.00241841 | - | | qqp | f1 | 0.516391 | 0.00263674 | 0.451222 | 0.00289696 | + | | iwslt17-en-ar | bleu | 2.35563 | 0.188638 | 4.98225 | 0.275369 | - | | iwslt17-en-ar | chrf | 0.140912 | 0.00503101 | 0.277708 | 0.00415432 | - | | iwslt17-en-ar | ter | 1.0909 | 0.0122111 | 0.954701 | 0.0126737 | - | | multirc | acc | 0.0409234 | 0.00642087 | 0.0178384 | 0.00428994 | + | | hendrycksTest-human_aging | acc | 0.264574 | 0.0296051 | 0.264574 | 0.0296051 | | | hendrycksTest-human_aging | acc_norm | 0.197309 | 0.0267099 | 0.237668 | 0.0285681 | - | | reversed_words | acc | 0.0003 | 0.000173188 | 0 | 0 | + |

Some results are missing due to errors or computational constraints.