Benchmark¶
Evaluation of different base models without fine-tuning. The models were provided with an example of one class and were guided to predict the class as the next token as:
Sentiment Classification into categories 'negative' or 'positive'.
'hat 's far too tragic to merit such superficial treatment '='negative'
'that loves its characters and communicates something rather beautiful about human nature '='positive'
'<sentence>'='<Vbz: [["negative",...], ["positive",...]]>
Table¶
🤗-Model-ID |
Average Accuracy |
AGnews |
Amazon Polarity |
DBPedia |
Emotion |
Fnc1 |
IMDB |
MNLI |
QNLI |
RTE |
SST2 |
TREC-6 |
Tweet Sentiment |
Wikitalk |
Yahoo |
Yelp |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Qwen/Qwen2.5-32B |
0.802580741896532 |
0.868 |
0.963 |
0.956 |
0.487 |
0.414346454762276 |
0.967 |
0.872 |
0.862 |
0.758122743682311 |
0.952981651376147 |
0.812 |
0.683 |
0.838260278627251 |
0.624 |
0.981 |
google/gemma-2-2b |
0.666815090408935 |
0.823 |
0.944 |
0.935 |
0.446 |
0.304029613613792 |
0.937 |
0.405 |
0.54 |
0.581227436823105 |
0.913990825688073 |
0.542 |
0.563 |
0.609978480009061 |
0.609 |
0.849 |
meta-llama/Meta-Llama-3.1-8B |
0.727249223386589 |
0.844 |
0.954 |
0.952 |
0.487 |
0.265591913157836 |
0.923 |
0.568 |
0.744 |
0.729241877256318 |
0.924311926605505 |
0.664 |
0.593 |
0.676592633779178 |
0.619 |
0.965 |
mistralai/Mistral-7B-v0.3 |
0.859 |
0.956 |
0.895 |
0.423 |
0.939 |
0.57 |
0.672 |
0.729241877256318 |
0.922018348623853 |
0.698 |
0.624 |
0.668736292589504 |
0.637 |
0.945 |
||
tiiuae/Falcon3-1B-Base |
0.670035048240265 |
0.808 |
0.909 |
0.878 |
0.453 |
0.291803564927297 |
0.879 |
0.416 |
0.562 |
0.577617328519856 |
0.892201834862385 |
0.63 |
0.562 |
0.658902995294433 |
0.593 |
0.94 |
lmsys/vicuna-13b-v1.5 |
0.71634232489797 |
0.747 |
0.945 |
0.926 |
0.488 |
0.263846661564714 |
0.909 |
0.582 |
0.748 |
0.76173285198556 |
0.931192660550459 |
0.58 |
0.612 |
0.725362699368816 |
0.579 |
0.947 |
tiiuae/Falcon3-7B-Base |
0.722606069305748 |
0.772 |
0.949 |
0.936 |
0.489 |
0.263597446051807 |
0.939 |
0.667 |
0.731 |
0.743682310469314 |
0.935779816513762 |
0.588 |
0.634 |
0.676031466551344 |
0.567 |
0.948 |
mistralai/Mistral-Nemo-Base-2407 |
0.739890499381605 |
0.884 |
0.954 |
0.939 |
0.426 |
0.258144920213886 |
0.946 |
0.679 |
0.755 |
0.754512635379061 |
0.936926605504587 |
0.646 |
0.623 |
0.705773329626541 |
0.627 |
0.964 |
tiiuae/Falcon3-3B-Base |
0.69797446797878 |
0.754 |
0.93 |
0.902 |
0.468 |
0.308771085799483 |
0.908 |
0.53 |
0.674 |
0.675090252707581 |
0.896788990825688 |
0.716 |
0.595 |
0.564966690348953 |
0.584 |
0.963 |
HuggingFaceTB/SmolLM2-1.7B |
0.663083716245386 |
0.791 |
0.942 |
0.856 |
0.384 |
0.256327163933655 |
0.936 |
0.419 |
0.535 |
0.624548736462094 |
0.920871559633028 |
0.572 |
0.56 |
0.644508283652015 |
0.535 |
0.97 |
answerdotai/ModernBERT-large |
0.530367001294045 |
0.79 |
0.66 |
0.67 |
0.33 |
0.116628614916286 |
0.68 |
0.29 |
0.55 |
0.52 |
0.76 |
0.53 |
0.55 |
0.398876404494382 |
0.43 |
0.68 |
tiiuae/falcon-mamba-7b |
0.728538294959239 |
0.821 |
0.962 |
0.941 |
0.467 |
0.250898414184418 |
0.921 |
0.56 |
0.639 |
0.797833935018051 |
0.935779816513762 |
0.668 |
0.625 |
0.72656225867235 |
0.645 |
0.968 |
lmsys/vicuna-7b-v1.5 |
0.717034574534011 |
0.869 |
0.957 |
0.925 |
0.502 |
0.258806811849408 |
0.948 |
0.537 |
0.717 |
0.696750902527076 |
0.920871559633028 |
0.506 |
0.618 |
0.717089344000659 |
0.621 |
0.962 |
tiiuae/falcon-7b |
0.67027734119717 |
0.876 |
0.951 |
0.776 |
0.483 |
0.259471551662221 |
0.929 |
0.419 |
0.505 |
0.595667870036101 |
0.911697247706422 |
0.534 |
0.575 |
0.673323448552806 |
0.597 |
0.969 |
Qwen/Qwen2.5-14B |
0.793427125926012 |
0.884 |
0.945 |
0.957 |
0.495 |
0.415383624714254 |
0.941 |
0.832 |
0.856 |
0.779783393501805 |
0.935779816513762 |
0.798 |
0.67 |
0.78046005416036 |
0.645 |
0.967 |
google/gemma-2-9b |
0.744717802816623 |
0.872 |
0.963 |
0.948 |
0.469 |
0.235040629868216 |
0.929 |
0.675 |
0.753 |
0.714801444043321 |
0.947247706422018 |
0.682 |
0.615 |
0.775677261915794 |
0.638 |
0.954 |
google/gemma-2-27b |
0.341126766689851 |
0.237 |
0.486 |
0.075 |
0.286 |
0.25 |
0.476 |
0.371 |
0.484 |
0.527075812274368 |
0.490825688073395 |
0.018 |
0.281 |
0.5 |
0.106 |
0.529 |
lmsys/vicuna-33b-v1.3 |
0.69071294530782 |
0.72 |
0.961 |
0.896 |
0.476 |
0.20791904737138 |
0.944 |
0.591 |
0.767 |
0.768953068592058 |
0.92545871559633 |
0.516 |
0.627 |
0.548363348057538 |
0.584 |
0.828 |
bigscience/bloom-7b1 |
0.729 |
0.934 |
0.905 |
0.367 |
0.889 |
0.339 |
0.531 |
0.555956678700361 |
0.895642201834862 |
0.448 |
0.495 |
0.432129655371246 |
0.507 |
0.911 |
||
tiiuae/Falcon3-10B-Base |
0.752038507723927 |
0.838 |
0.963 |
0.954 |
0.504 |
0.203714394688026 |
0.928 |
0.706 |
0.776 |
0.772563176895307 |
0.938073394495413 |
0.726 |
0.646 |
0.752226649780167 |
0.601 |
0.972 |
tiiuae/falcon-11B |
0.736912972596415 |
0.838 |
0.947 |
0.906 |
0.474 |
0.20857710015925 |
0.92 |
0.629 |
0.773 |
0.779783393501805 |
0.947247706422018 |
0.706 |
0.645 |
0.690086388863147 |
0.626 |
0.964 |
Qwen/Qwen2.5-7B |
0.769090560222669 |
0.827 |
0.943 |
0.958 |
0.493 |
0.325405060344208 |
0.93 |
0.803 |
0.799 |
0.790613718411552 |
0.954128440366973 |
0.75 |
0.674 |
0.710211184217301 |
0.616 |
0.963 |
deepseek-ai/DeepSeek-V2-Lite |
0.704019867128113 |
0.876 |
0.958 |
0.877 |
0.468 |
0.27148212954501 |
0.944 |
0.427 |
0.56 |
0.606498194945848 |
0.924311926605505 |
0.682 |
0.617 |
0.739005755825328 |
0.64 |
0.97 |