PandasAI: Harnessing Language Models for Data Analysis

PandasAI is a library that allows performing basic data analysis through natural language queries. Users can specify one or multiple dataframes and a text query, and receive the output in the form of a new dataframe, a number, or a graph.

PandasAI Capabilities

PandasAI is designed to be used as a complement to standard Pandas. To utilize the library, an OpenAI API or Google PaLM key needs to be provided. The language model automatically recognizes context based on the field names used in the dataframes. When working with multiple dataframes, it determines the keys for joining them. For example, it is possible to request aggregations on multiple records that meet specific conditions:

PandasAI outputs the results in one of three formats: a dataframe, a number, or a graph. The library also provides additional functions called shortcuts, which allow filling missing values, generating features, constructing confusion matrices, ROC curves, and sliding metrics, as well as performing record segmentation based on a selected set of fields.

The result of a query in the form of a dataframe can be used for subsequent queries, enabling command chaining. To protect data privacy, only 5 rows of each dataframe are sent to the language model, containing randomized values for sensitive fields and shuffled values for other fields. However, users have the option to send only the field names to the model.

The library is available at this link.

PandasAI Capabilities

More from Neurohive