PARP is the most effective method among analogues for cropping large speech recognition models. PARP can be used for automatic speech recognition in rare languages.
Speech recognition technology has become more widespread with the growing popularity of virtual assistants such as Siri, but many of these systems, due to their complexity and high cost of training, work well only with the most common of about 7000 languages in the world. In this regard, millions of speakers of less common languages cannot use voice translation or smart devices.
The researchers turned to the powerful Wave2vec 2.0 speech recognition model, which has about 300 million individual connections and therefore requires large computational resources to teach a particular language. At the first stage of the PARP (Prune, Adjust and Re-Prune) method, Wave2vec 2.0 is trimmed by removing insignificant connections. The resulting subnet is then configured for a specific language, and then truncated again. At the second stage, remote connections are allowed to be restored if they turn out to be important for this particular language.
Despite the presence of fine-tuning the subnet for each language, it turned out that the resulting subnets have a large overlap. For example, for French and Spanish, they coincide by 97%. The researchers conducted experiments using 10 languages, including Italian, Spanish, Russian and Chinese, and obtained similar results.
The researchers also compared PARP with other common cropping methods and found that it provides the best recognition accuracy.