Benchmark¶

Evaluation of different base models without fine-tuning. The models were provided with an example of one class and were guided to predict the class as the next token as:

Sentiment Classification into categories 'negative' or 'positive'.

'hat 's far too tragic to merit such superficial treatment '='negative'
'that loves its characters and communicates something rather beautiful about human nature '='positive'

'<sentence>'='<Vbz: [["negative",...], ["positive",...]]>

Table¶

🤗-Model-ID	Average Accuracy	AGnews	Amazon Polarity	DBPedia	Emotion	Fnc1	IMDB	MNLI	QNLI	RTE	SST2	TREC-6	Tweet Sentiment	Wikitalk	Yahoo	Yelp
Qwen/Qwen2.5-32B	0.802580741896532	0.868	0.963	0.956	0.487	0.414346454762276	0.967	0.872	0.862	0.758122743682311	0.952981651376147	0.812	0.683	0.838260278627251	0.624	0.981
google/gemma-2-2b	0.666815090408935	0.823	0.944	0.935	0.446	0.304029613613792	0.937	0.405	0.54	0.581227436823105	0.913990825688073	0.542	0.563	0.609978480009061	0.609	0.849
meta-llama/Meta-Llama-3.1-8B	0.727249223386589	0.844	0.954	0.952	0.487	0.265591913157836	0.923	0.568	0.744	0.729241877256318	0.924311926605505	0.664	0.593	0.676592633779178	0.619	0.965
mistralai/Mistral-7B-v0.3		0.859	0.956	0.895	0.423		0.939	0.57	0.672	0.729241877256318	0.922018348623853	0.698	0.624	0.668736292589504	0.637	0.945
tiiuae/Falcon3-1B-Base	0.670035048240265	0.808	0.909	0.878	0.453	0.291803564927297	0.879	0.416	0.562	0.577617328519856	0.892201834862385	0.63	0.562	0.658902995294433	0.593	0.94
lmsys/vicuna-13b-v1.5	0.71634232489797	0.747	0.945	0.926	0.488	0.263846661564714	0.909	0.582	0.748	0.76173285198556	0.931192660550459	0.58	0.612	0.725362699368816	0.579	0.947
tiiuae/Falcon3-7B-Base	0.722606069305748	0.772	0.949	0.936	0.489	0.263597446051807	0.939	0.667	0.731	0.743682310469314	0.935779816513762	0.588	0.634	0.676031466551344	0.567	0.948
mistralai/Mistral-Nemo-Base-2407	0.739890499381605	0.884	0.954	0.939	0.426	0.258144920213886	0.946	0.679	0.755	0.754512635379061	0.936926605504587	0.646	0.623	0.705773329626541	0.627	0.964
tiiuae/Falcon3-3B-Base	0.69797446797878	0.754	0.93	0.902	0.468	0.308771085799483	0.908	0.53	0.674	0.675090252707581	0.896788990825688	0.716	0.595	0.564966690348953	0.584	0.963
HuggingFaceTB/SmolLM2-1.7B	0.663083716245386	0.791	0.942	0.856	0.384	0.256327163933655	0.936	0.419	0.535	0.624548736462094	0.920871559633028	0.572	0.56	0.644508283652015	0.535	0.97
answerdotai/ModernBERT-large	0.530367001294045	0.79	0.66	0.67	0.33	0.116628614916286	0.68	0.29	0.55	0.52	0.76	0.53	0.55	0.398876404494382	0.43	0.68
tiiuae/falcon-mamba-7b	0.728538294959239	0.821	0.962	0.941	0.467	0.250898414184418	0.921	0.56	0.639	0.797833935018051	0.935779816513762	0.668	0.625	0.72656225867235	0.645	0.968
lmsys/vicuna-7b-v1.5	0.717034574534011	0.869	0.957	0.925	0.502	0.258806811849408	0.948	0.537	0.717	0.696750902527076	0.920871559633028	0.506	0.618	0.717089344000659	0.621	0.962
tiiuae/falcon-7b	0.67027734119717	0.876	0.951	0.776	0.483	0.259471551662221	0.929	0.419	0.505	0.595667870036101	0.911697247706422	0.534	0.575	0.673323448552806	0.597	0.969
Qwen/Qwen2.5-14B	0.793427125926012	0.884	0.945	0.957	0.495	0.415383624714254	0.941	0.832	0.856	0.779783393501805	0.935779816513762	0.798	0.67	0.78046005416036	0.645	0.967
google/gemma-2-9b	0.744717802816623	0.872	0.963	0.948	0.469	0.235040629868216	0.929	0.675	0.753	0.714801444043321	0.947247706422018	0.682	0.615	0.775677261915794	0.638	0.954
google/gemma-2-27b	0.341126766689851	0.237	0.486	0.075	0.286	0.25	0.476	0.371	0.484	0.527075812274368	0.490825688073395	0.018	0.281	0.5	0.106	0.529
lmsys/vicuna-33b-v1.3	0.69071294530782	0.72	0.961	0.896	0.476	0.20791904737138	0.944	0.591	0.767	0.768953068592058	0.92545871559633	0.516	0.627	0.548363348057538	0.584	0.828
bigscience/bloom-7b1		0.729	0.934	0.905	0.367		0.889	0.339	0.531	0.555956678700361	0.895642201834862	0.448	0.495	0.432129655371246	0.507	0.911
tiiuae/Falcon3-10B-Base	0.752038507723927	0.838	0.963	0.954	0.504	0.203714394688026	0.928	0.706	0.776	0.772563176895307	0.938073394495413	0.726	0.646	0.752226649780167	0.601	0.972
tiiuae/falcon-11B	0.736912972596415	0.838	0.947	0.906	0.474	0.20857710015925	0.92	0.629	0.773	0.779783393501805	0.947247706422018	0.706	0.645	0.690086388863147	0.626	0.964
Qwen/Qwen2.5-7B	0.769090560222669	0.827	0.943	0.958	0.493	0.325405060344208	0.93	0.803	0.799	0.790613718411552	0.954128440366973	0.75	0.674	0.710211184217301	0.616	0.963
deepseek-ai/DeepSeek-V2-Lite	0.704019867128113	0.876	0.958	0.877	0.468	0.27148212954501	0.944	0.427	0.56	0.606498194945848	0.924311926605505	0.682	0.617	0.739005755825328	0.64	0.97