Benchmark¶

Evaluation of different base models without fine-tuning. The models were provided with an example of one class and were guided to predict the class as the next token as:

Sentiment Classification into categories 'negative' or 'positive'.

'hat 's far too tragic to merit such superficial treatment '='negative'
'that loves its characters and communicates something rather beautiful about human nature '='positive'

'<sentence>'='<Vbz: [["negative",...], ["positive",...]]>

Table¶

🤗-Model-ID

Average Accuracy

AGnews

Amazon Polarity

DBPedia

Emotion

Fnc1

IMDB

MNLI

QNLI

RTE

SST2

TREC-6

Tweet Sentiment

Wikitalk

Yahoo

Yelp

Qwen/Qwen2.5-32B

0.802580741896532

0.868

0.963

0.956

0.487

0.414346454762276

0.967

0.872

0.862

0.758122743682311

0.952981651376147

0.812

0.683

0.838260278627251

0.624

0.981

google/gemma-2-2b

0.666815090408935

0.823

0.944

0.935

0.446

0.304029613613792

0.937

0.405

0.54

0.581227436823105

0.913990825688073

0.542

0.563

0.609978480009061

0.609

0.849

meta-llama/Meta-Llama-3.1-8B

0.727249223386589

0.844

0.954

0.952

0.487

0.265591913157836

0.923

0.568

0.744

0.729241877256318

0.924311926605505

0.664

0.593

0.676592633779178

0.619

0.965

mistralai/Mistral-7B-v0.3

0.859

0.956

0.895

0.423

0.939

0.57

0.672

0.729241877256318

0.922018348623853

0.698

0.624

0.668736292589504

0.637

0.945

tiiuae/Falcon3-1B-Base

0.670035048240265

0.808

0.909

0.878

0.453

0.291803564927297

0.879

0.416

0.562

0.577617328519856

0.892201834862385

0.63

0.562

0.658902995294433

0.593

0.94

lmsys/vicuna-13b-v1.5

0.71634232489797

0.747

0.945

0.926

0.488

0.263846661564714

0.909

0.582

0.748

0.76173285198556

0.931192660550459

0.58

0.612

0.725362699368816

0.579

0.947

tiiuae/Falcon3-7B-Base

0.722606069305748

0.772

0.949

0.936

0.489

0.263597446051807

0.939

0.667

0.731

0.743682310469314

0.935779816513762

0.588

0.634

0.676031466551344

0.567

0.948

mistralai/Mistral-Nemo-Base-2407

0.739890499381605

0.884

0.954

0.939

0.426

0.258144920213886

0.946

0.679

0.755

0.754512635379061

0.936926605504587

0.646

0.623

0.705773329626541

0.627

0.964

tiiuae/Falcon3-3B-Base

0.69797446797878

0.754

0.93

0.902

0.468

0.308771085799483

0.908

0.53

0.674

0.675090252707581

0.896788990825688

0.716

0.595

0.564966690348953

0.584

0.963

HuggingFaceTB/SmolLM2-1.7B

0.663083716245386

0.791

0.942

0.856

0.384

0.256327163933655

0.936

0.419

0.535

0.624548736462094

0.920871559633028

0.572

0.56

0.644508283652015

0.535

0.97

answerdotai/ModernBERT-large

0.530367001294045

0.79

0.66

0.67

0.33

0.116628614916286

0.68

0.29

0.55

0.52

0.76

0.53

0.55

0.398876404494382

0.43

0.68

tiiuae/falcon-mamba-7b

0.728538294959239

0.821

0.962

0.941

0.467

0.250898414184418

0.921

0.56

0.639

0.797833935018051

0.935779816513762

0.668

0.625

0.72656225867235

0.645

0.968

lmsys/vicuna-7b-v1.5

0.717034574534011

0.869

0.957

0.925

0.502

0.258806811849408

0.948

0.537

0.717

0.696750902527076

0.920871559633028

0.506

0.618

0.717089344000659

0.621

0.962

tiiuae/falcon-7b

0.67027734119717

0.876

0.951

0.776

0.483

0.259471551662221

0.929

0.419

0.505

0.595667870036101

0.911697247706422

0.534

0.575

0.673323448552806

0.597

0.969

Qwen/Qwen2.5-14B

0.793427125926012

0.884

0.945

0.957

0.495

0.415383624714254

0.941

0.832

0.856

0.779783393501805

0.935779816513762

0.798

0.67

0.78046005416036

0.645

0.967

google/gemma-2-9b

0.744717802816623

0.872

0.963

0.948

0.469

0.235040629868216

0.929

0.675

0.753

0.714801444043321

0.947247706422018

0.682

0.615

0.775677261915794

0.638

0.954

google/gemma-2-27b

0.341126766689851

0.237

0.486

0.075

0.286

0.25

0.476

0.371

0.484

0.527075812274368

0.490825688073395

0.018

0.281

0.5

0.106

0.529

lmsys/vicuna-33b-v1.3

0.69071294530782

0.72

0.961

0.896

0.476

0.20791904737138

0.944

0.591

0.767

0.768953068592058

0.92545871559633

0.516

0.627

0.548363348057538

0.584

0.828

bigscience/bloom-7b1

0.729

0.934

0.905

0.367

0.889

0.339

0.531

0.555956678700361

0.895642201834862

0.448

0.495

0.432129655371246

0.507

0.911

tiiuae/Falcon3-10B-Base

0.752038507723927

0.838

0.963

0.954

0.504

0.203714394688026

0.928

0.706

0.776

0.772563176895307

0.938073394495413

0.726

0.646

0.752226649780167

0.601

0.972

tiiuae/falcon-11B

0.736912972596415

0.838

0.947

0.906

0.474

0.20857710015925

0.92

0.629

0.773

0.779783393501805

0.947247706422018

0.706

0.645

0.690086388863147

0.626

0.964

Qwen/Qwen2.5-7B

0.769090560222669

0.827

0.943

0.958

0.493

0.325405060344208

0.93

0.803

0.799

0.790613718411552

0.954128440366973

0.75

0.674

0.710211184217301

0.616

0.963

deepseek-ai/DeepSeek-V2-Lite

0.704019867128113

0.876

0.958

0.877

0.468

0.27148212954501

0.944

0.427

0.56

0.606498194945848

0.924311926605505

0.682

0.617

0.739005755825328

0.64

0.97