The Alibaba AliceMind model took first place in the VQA Challenge 2021 competition, where it was required to answer 1.1 million questions about 250,000 images. Alibaba’s algorithm demonstrated recognition accuracy of 81.26%, while the accuracy of human recognition was 80.83%.
At the Visual Question Answering (VQA) Challenge 2021, computer vision models study images and answer questions about images. AliceMind surpassed, in particular, the Microsoft model and the people who answered questions in parallel with the models.
The VQA dataset consists of 250,000 COCO images and abstract scenes. There are at least 3 questions for each of them. The answers to the questions belong to one of three types:
- Yes/no. For example: “Is it rainy in the photo?”, ” Is the person in the photo expecting friends?”, ” Is the person in the photo upset?”
- Number. For example: “How many parts is the pizza cut into?”, “How many people are in the photo?”,” How many programs are open on the laptop screen? “
- Others. For example: “Who is wearing glasses in the photo?”, ” What is the child sitting on?”, ” What is the person in the photo doing?”
Alibaba uses AliceMind in the Alime Shop Assistant chatbot, which is used by several tens of thousands of Alibaba sellers every day.