Results

About our Baselines 🚩

To contextualize system performance, participant submissions are compared against four baseline approaches, computed only for the zero-shot track. The baselines include:

A simple heuristic baseline that assigns a constant Gender Stereotype (GS) score of 0.5 and predicts GS categories randomly.
A large language model baseline based on Qwen3-14B.
A large language model baseline based on GPT-5-nano-2025-08-07.
A GPT-5 baseline evaluated under two prompting configurations:
- Joint prompting, where GS detection and classification are addressed within a single prompt.
- Separate prompting, where the two tasks are solved independently using two distinct prompts.

Main Task

GSI:detect - FINAL RESULTS

SubTask

GSI:detect - FINAL RESULTS

Page updated

Report abuse