Introduction to Pandas Profiling

Pandas profiling is an extended version of the standard Python pandas library for data manipulation and exploration. However, the initial data summary is sometimes more important than just diving into the data manipulation itself. It allows us to form a plan and strategy for approaching the data.

Most people will use the pandas.describe() method to give them a quick insight into the data they are dealing with.

import pandas as pd
data = pd.read_csv ('Automobile_data.csv')
data.describe()

pandas .describe() example (1985 Automobile data)pandas .describe() example (1985 Automobile data)

However, this method can sometimes feel lackluster. This is where the pandas_profiling library comes into play.

A single line of code is enough for the report to be generated.

import pandas_profiling
data = pd.read_csv('Automobile_date.csv')
data.profile_report()

The generated report will be separated into 6 sections:

  • Overview
  • Variables
  • Interactions
  • Correlations
  • Missing values
  • Sample

Overview

Mostly global details about the dataset (number of records, number of variables, overall missingness, duplicates, memory footprint) with alerts regarding cardinality (which fields have a high number of distinct values) and correlations (which fields are highly correlated with another field) of variables.

The dataset in question contains information regarding automobile data, with general information such as the make of the car, as well as detailed information such as the width, height, fuel system, horsepower, and price.

Variables

A variable summary shows a variable’s most important numeric statistics and a graph of data distribution. Additional details can be toggled to show the statistics (quantile and descriptive) of that specific variable, a histogram, frequency of values within that specific variable, and extreme values, indicating possible issues with the field.

pandas profiling Variables

Interactions

A scatter plot on a Cartesian coordinate system is generated based on the chosen numeric variables. Useful to visually detect the relation of the selected variables that would make sense to be related.

Correlations

A correlation plot of variables is generated with the option to change the correlation coefficient (Spearman, Pearson, Kendall, Cramér’s V, Phik) and show its description.

A correlation plot of variables

Missing values

A simple visualization of nullity by column.

A simple visualization of nullity by column.

 

Sample

Two tables showing the first and last 10 records are generated, similarly to the pandas .head() method (which shows only the first 5 records)

table showing the first 10 records generatedtable showing the last 10 records generated


“Pandas Profiling for data exploration” Tech Bite was brought to you by Demir Korać, Data Analyst at Atlantbh.

Tech Bites are tips, tricks, snippets or explanations about various programming technologies and paradigms, which can help engineers with their everyday job.

Leave a Reply