Results
Results
To contextualize system performance, participant submissions are compared against four baseline approaches, computed only for the zero-shot track. The baselines include:
A simple heuristic baseline that assigns a constant Gender Stereotype (GS) score of 0.5 and predicts GS categories randomly.
A large language model baseline based on Qwen3-14B.
A large language model baseline based on GPT-5-nano-2025-08-07.
A GPT-5 baseline evaluated under two prompting configurations:
Joint prompting, where GS detection and classification are addressed within a single prompt.
Separate prompting, where the two tasks are solved independently using two distinct prompts.
SubTask