The need for more interpretable models in NLP has become increasingly apparent in recent years. The Evaluating Rationales And Simple English Reasoning ERASER benchmark is intended to advance research in this area by providing a diverse set of NLP datasets that contain both document labels and snippets of text marked by annotators as supporting these.

Models that provide rationales supporting predictions can be evaluated using this benchmark using several metrics (see below) that aim to quantify different attributes of “interpretability”. We do not privilege any one of these, or provide a single number to quantify performance, because we argue that the appropriate metric to gauge the quality of rationales will depend on the task and use-case.

northeastern salesforce


Coming Soon.


How do I use ERASER?

Step 1 Download the data from our Github page.

Step 2 The evaluation script uses files in the following JSONL format:

    "annotation_id": str, required
    # these classifications *must not* overlap
    # these classifications *must not* overlap
    "rationales": List[
            "docid": str, required
            "hard_rationale_predictions": List[{
                "start_token": int, inclusive, required
                "end_token": int, exclusive, required
            }], optional,
            # token level classifications, a value must be provided per-token
            # in an ideal world, these correspond to the hard-decoding above.
            "soft_rationale_predictions": List[float], optional.
            # sentence level classifications, a value must be provided for every
            # sentence in each document, or not at all
            "soft_sentence_predictions": List[float], optional.
    # the classification the model made for the overall classification task
    "classification": str, optional
    # A probability distribution output by the model. We require this to be normalized.
    "classification_scores": Dict[str, float], optional
    # The next two fields are measures for how faithful your model is (the
    # rationales it predicts are in some sense causal of the prediction), and
    # how sufficient they are. We approximate a measure for comprehensiveness by
    # asking that you remove the top k%% of tokens from your documents,
    # running your models again, and reporting the score distribution in the
    # "comprehensiveness_classification_scores" field.
    # We approximate a measure of sufficiency by asking exactly the converse
    # - that you provide model distributions on the removed k%% tokens.
    # 'k' is determined by human rationales, and is documented in our paper.
    # You should determine which of these tokens to remove based on some kind
    # of information about your model: gradient based, attention based, other
    # interpretability measures, etc.
    # scores per class having removed k%% of the data, where k is determined
    # by human comprehensive rationales
    "comprehensiveness_classification_scores": Dict[str, float], optional
    # scores per class having access to only k%% of the data, where k is
    # determined by human comprehensive rationales
    "sufficiency_classification_scores": Dict[str, float], optional
    # the number of tokens required to flip the prediction -
    # see "Is Attention Interpretable" by Serrano and Smith.
    "tokens_to_flip": int, optional

Step 3 In total, there should be 7 files for each of the datasets. Make sure that each prediction JSONL is named according to the following:

  • BoolQ: BoolQ.jsonl
  • MultiRC: MultiRC.jsonl
  • E-SNLI: E-SNLI.jsonl
  • CoS-E: CoS-E.jsonl
  • Fever: FEVER.jsonl
  • Evidence Inference: E-Inference.jsonl
  • Movies: Movies.jsonl

Step 4 Create a zip of the prediction JSONLs using zip -r *.jsonl. The zip can contain subfolders but should not contain nested zips.

Step 5 Upload this zip using the 'Submit' section, filling in details of the method used to generate the predictions.

Step 6 You may upload at most two submissions a day, and at most six submissions per month. A sample submission with the necessary formatting is available here.

Are there any rules or restrictions on submitted systems?

Submitted systems may use any public or private data when developing their systems, with a few exceptions:

Exception 1 Systems may only use the ERASER-distributed versions of the task datasets, as these use different train/validation/test splits from other public versions in some cases.

Exception 2 Systems may not use the unlabeled test data for the ERASER tasks in system development in any way, and may not build systems that share information across separate test examples in any way.

Exception 3 Beyond this, you may submit results from any kind of system that is capable of producing labels for the six target tasks and the analysis task. This includes systems that do not share any components across tasks or systems not based on machine learning.

How do I add my result to the leaderboard?
An email containing a link to the submission document should be sent to the following google group:
What license is the ERASER data distributed under?
We defer to the licenses in the original datasets concerning the data; please see the respective links to these sources. For code that we distribute, consult our GitHub readme.
How should I cite ERASER?
Download the citation text file here, which contains citations for ERASER, as well as the other datasets.
Are there any private test sets?
This is a work in progress.


Blog Link

Special thanks to Melvin Gruesbeck for designing this website