The short written texts that make up the dataset have been collected manually from both social media and informative website, so as to include both formal and informal language, and each of them has been manually annotated by four expert annotators.
A subgroup of the texts, in order to be properly understood, needs contextual information, which is provided in the distribution, while the others are self-contained and can be understood without any additional information.
GS detection (main task): based on the perspectivist approach introduced in the Overview section, instead of providing a single label obtained through majority voting, we merge all annotations into a numerical GS value.
The dataset distributed to participants preserves the non-aggregated judgments of all annotators, which allows systems to learn from disagreement while at the same their predictions are tested against the GS value emerging from all judgments, as shown in the following table. Such methodology aligns with Madeddu et al., 2023, which indicate that annotations aggregated by majority vote (hard labels) are insufficient and that leveraging disagreement among annotators is more convenient than trying to delete it.
The dataset consists of around 1,000 short texts, split as follows: 80% of the dataset is used as test data, while the remaining 20% of the dataset is made available to participants as development data.
The dataset is distributed with the following information, for each item:
Annotations: a list containing the judgement provided by each annotator. Each entry indicates whether a stereotype is present (yes) or absent (no) according to that annotator.
GS value: a number in the interval [0-1] with up to two digits after the decimal point (e.g., 0, 0.30, 0.45, 1) indicating the degree to which the text reflects or refers to a gender stereotype (where 1 is the maximum and 0 is the minimum degree);
GS category: the category to which the stereotype (if present) belongs to. GS categories are: role, personality, competence, physical, sexual, and relational.
Context: a binary label indicating whether additional contextual information for the text is provided (with_context) or not (no_context).
The dataset is released in .jsonl format.
Each line corresponds to a single data instance with the following structure:
{
"id": "unique_identifier",
"text": "Italian_short_text",
"annotations": ["annotation_A", "annotation_B", "annotation_C", "annotation_D"],
"gs_value": "number_between_0_and_1_with_two_decimals",
"gs_category": "gender_stereotype_category_[role, personality, competence, physical, sexual, relational]",
"context": "with_context" | "no_context"
}
⚠️ | Trigger warning: The dataset may contain sensitive or potentially distressing content.
ℹ️ | Disclaimer: The views and opinions expressed in the dataset do not necessarily reflect the views or positions of the task organizers or the informants, as items in the dataset may be short excerpts from a longer text.
Given the availability of four independent annotations — which inevitably involve some degree of disagreement — we opted for computing a GS Value combining all annotations, rather than deriving a discrete binary label through majority voting. This choice follows recent findings suggesting that leveraging disagreement is more informative than attempting to eliminate it (see Overview Section).
The annotation procedure consists of two steps:
Each annotator assigns a binary label (yes/no) indicating whether the text reflects or refers to a gender stereotype.
The GS Value is computed by combining all four annotations.
The underlying assumption is that, in cases of complete inter-annotator agreement, the GS Value will be close to the extremes of the continuum:
If all four annotators agree that no stereotype is present, the GS Value = 0.
If all four annotators agree that a stereotype is present, the GS Value = 1.
Conversely, disagreement among annotators leads to intermediate GS Values, for example:
0.25 ➡️ three no labels and one yes label
0.50 ➡️ two yes labels and two no labels
0.75 ➡️ three yes labels and one no label
Each text is annotated by four annotators. Given the more exploratory nature of the GS classification sub-task with respect to the main task, we adopted a traditional annotation scheme based on discrete labels. In case of disagreement, the category label is assigned through majority voting. In cases where majority voting was not feasible, the final label is determined by a super-judge.
You can find the annotation guidelines here.