Zyda: 1.3T Dataset for Open Language Modeling
12 June 2024
Zyda: 1.3T Dataset for Open Language Modeling
Zyda is a 1.3 trillion-token open-source dataset designed for open language modeling. Zyda integrates a range of high-quality open datasets, including RefinedWeb, Starcoder, C4, Pile, enhancing them through comprehensive filtering…