
Google DeepMind has developed Gecko, a benchmark that ensures more accurate and reliable testing and comparison of text-to-image models than existing approaches.
A study by Google DeepMind has identified hidden limitations in how the performance of models that convert text to images is currently evaluated. It notes that datasets and metrics primarily used today to assess the capabilities of such models, like DALL-E, Midjourney, and Stable Diffusion, do not provide a complete picture: manual ratings on a small sample of respondents offer a limited view of model quality, while automatically computed metrics may overlook important nuances and diverge from expert opinion.
To address this issue, researchers developed Gecko—a set of tests that evaluate the complexity of text-to-image transformation models. Gecko includes 2000 query texts that assess a wide range of model skills and difficulty levels. Each query evaluates specific auxiliary skills, going beyond vague categories to accurately pinpoint the weaknesses that limit the generated images’ alignment with the queries.

The benchmark not only checks which skills are the model’s weak points but also assesses the level of each skill. Researchers also collected over 100,000 expert ratings of images created by several leading models in response to Gecko queries. Based on these rating data, the benchmark helps determine whether the performance gaps of the tested model are related to its internal constraints, ambiguous prompts, or inconsistent evaluation methods.







