CodeNet: IBM dataset that generates and analyzes code

At the Think conference, IBM presented Project CodeNet – the largest open-source dataset for training neural networks in programming. The dataset consists of 14 million code examples written in 55 programming languages.

Programmers spend more than half of their working time not writing code, but debugging it. The corresponding cash costs of the IT sector for debugging code are estimated at $ 312 billion per year. Artificial intelligence-based code generation and analysis tools can significantly reduce these costs, allowing programmers to focus on more creative and less routine tasks. The goal of developing CodeNet was to accelerate the development of artificial intelligence systems that automatically translate code into another programming language, identify matches and similarities between different code examples, and configure constraints based on specific developer tasks. Translating code into a more modern or efficient programming language requires knowledge of both the source and destination languages. For example, the Commonwealth Bank of Australia spent about $ 750 million over five years to convert its platform from COBOL to Java. Developing transcompilers is a time-consuming task because different languages have different syntaxes and use different APIs, standard library functions, and variable types.

The dataset contains more than 500 million lines of code in C++, Java, Python, Go, COBOL, Pascal, and FORTRAN. CodeNet is about 10 times the size of the previous largest dataset, containing 52,000 code examples. The dataset contains code samples designed to train neural networks to perform a variety of programming tasks, including code search and clone detection. In addition, the dataset includes metadata and annotations, such as code size, memory size, processor execution time, and code state, which allows you to distinguish efficient code from non-debugged code. More than 90% of the codes in CodeNet contain documentation that includes the problem statement and specifications for input and output formats. For seven million examples, IBM also provided examples of input and output data. Using CodeNet, data analysts can run code samples to extract additional metadata and validate the output of generative neural networks.