Twitter Announced the RecSys 2020 Challenge and Released a Large Dataset of Tweets

Twitter has announced a new challenge called RecSys 2020 which aims towards pushing the state-of-the-art in recommender systems. Researchers engineers from Twitter designed this challenge to encourage the development of new methods and algorithms for predicting user engagement and providing more accurate recommendations.

Together with the challenge, they released a large dataset from Twitter, comprising of more than 200 million public engagements in the form of likes, replies, retweets, comments. According to them, this is the world’s largest open-source engagements dataset and it contains both user and engagement features.
The release contains 3 datasets: a training set, a test set and a validation set. The data was collected subsequently, meaning the test set collection started only after the training set was collected (during a time span of 1 week), and also for the validation set which was collected the week after the test set. Each data entry in the dataset is characterized by 24 features divided into 4 groups: Tweet features, Engaged with User Features, Engaging User Features, and Engagement Features. The dataset is publicly available but can be accessed only with a Twitter developer account.

The rules of the challenge state that all methods will be evaluated on a held-out test set which contains data that is much more recent than the data in the training dataset. Evaluation will be provided mainly using area under curve (AUC) and cross-entropy loss as evaluation metrics.

More details about the RecSys 2020 challenge can be read here. The dataset can be found in the following link.