Google has introduced a SimVLM model that generates text from a single image. SimVLM features include forming a simple sentence description, completing a sentence with the first few words, and answering questions about objects in the image.
Visual language models can be used, for example, to generate subtitles for video recordings with a description of the scene. This approach is aimed at studying a single object space based on both visual and linguistic input data, rather than studying two separate object spaces.
The SimVLM model was trained on the ALIGN dataset, which contains about 1.8 billion image-text pairs, and has a transformer architecture.
After training, the model was able to generate text of various types from one image. In particular, SimVLM is able to generate a simple description of an image, complete a sentence with the first few words and answer questions about objects in the image.
Comparison with similar models on the COCO Caption and NoCaps benchmarks showed that SimVLM can achieve comparable accuracy rates despite the lack of training with a teacher, as in other models.