.. _tutorial_masked_lms: Tutorial - Masked-Language-Model-Based Classifiers ================================================== While causal language models predict the next token in an autoregressive manner from left to right, masked language models (MLMs) can predict a token based on the surrounding context. A word in the sequence is masked, and the model is trained to predict this masked token based on the context to the left and right. This technique can be leveraged to turn pre-trained MLMs into zero-shot classifiers by turning the data into cloze-question tasks. Example Dataset --------------- As promtzl is intended to be used along the 🤗-transformers and -dataset libraries, we first have to initialize an example dataset. For this tutorial, we will use the AGNews dataset. In this task, the model must label news into broader categories: .. code-block:: python from datasets import load_dataset dataset = load_dataset("SetFit/ag_news")['test'].select(range(1000)) We will only use the first 1000 examples of the dataset for brevity. Defining a Prompt and a Verbalizer ---------------------------------- During pre-training, the model is only fed raw text, from which it is trained to predict the missing token. **No further fine-tuning is applied, so the model cannot be instructed, in contrast to fine-tuned CLMs like ChatGPT**. Thus, the task must be constructed so that just one word in the sequence condenses the information about the classification task while maintaining its resemblance to the training data. In `AG's News `_, we have the categories 'World', 'Sports', 'Business', and 'Tech', so we can initialize the verbalizer: .. code:: python from promptzl import Vbz verbalizer = Vbz({0: ["World"], 1: ["Sports"], 2: ["Business"], 3: ["Tech"]}) For the prompt we can use the following pattern: .. code:: python from promptzl import Txt, Key prompt = Txt("[Category:") + verbalizer + Txt("] ") + Key() Here, we use the :ref:`prompt-element-objects` as MLMs usually have a significantly shorter context length, and we further need to add the respective mask token from the tokenizer where the *prompt-element-objects* automatically take care of. However, this is just one selected prompt; it is also possible to define further prompts: .. code:: python prompt = Txt('[ Topic: ') + Vbz(vbz_agnews) + Txt('] ') + Key('text') prompt1 = Txt("A ") + Vbz(vbz_agnews) + Txt(' news: ') + Key('text') prompt2 = Key('text') + Txt(' This topic is about ') + Vbz(vbz_agnews) + Txt('.') prompt3 = Txt("[Category:") + Vbz(vbz_agnews) + Txt('] ') + Key('text') All these prompts stem from `Schick and Schütze, 2020 `_ where they found that the :code:`prompt` works the best for this task, so we will further only use this prompt. .. note:: As only one mask token is to be predicted, it is crucial to define a verbalizer where the label words translate to only single tokens, as it can sometimes happen that more complex words are tokenized into different subtokens. An example of this problem is outlined in :ref:`point_for_clms`. Loading the Model ----------------- After loading the dataset and defining the verbalizer and prompt, we can finally load the model: .. code:: python from promptzl import MaskedLM4Classification model = MaskedLM4Classification( 'roberta-large', prompt=prompt ) Classifying the Dataset ----------------------- ...and start to classify the dataset: .. code-block:: python output = model.classify(dataset) .. note:: It is also possible to show a progress bar by setting the :code:`show_progress_bar` parameter to :code:`True` and set the :code:`batch_size` to a desired value if the model does not fit on the GPU. Evaluation of the Predictions ----------------------------- After we have classified the dataset, we can evaluate the predictions. The predictions are stored in the :code:`output` object and can be accessed as follows: .. code-block:: python from sklearn.metrics import accuracy_score accuracy_score(dataset['label'], output.predictions) 0.779 .. note:: When using List[List[str]] instead of Dict[str, List[str]] in the verbalizer, it might be necessary first to adjust the predictions to the values used in the dataset. In this case, the predictions refer to the indices of the lists in the verbalizer. E.g.: :code:`[['negative'], ['positive']]` will produce predictions in the form of zeros and ones. Calibration ----------- It has been found that some tokens are generally less likely to be predicted, causing the model to be biased towards more often recurring tokens in the label word set (more details in :ref:`calibration`). To counteract this, it is possible to calibrate the output. Here, the probabilities are averaged and used to assess the prediction probability in the context of the predicted word's overall average probability. As we can see in the following example, this can lead to a stronger overall performance: .. code-block:: python from sklearn.metrics import accuracy_score output = model.classify(dataset) pred_cali = model.calibrate_output(output) accuracy_score(dataset['label'], pred_cali.predictions) 0.818 Furthermore, it is also possible to use the :meth:`~promptzl.utils.calibrate` method that can be used with a tensor of probabilities.