Introduction to Pandas Profiling
Pandas profiling is an extended version of the standard Python pandas library for data manipulation and exploration. However, the initial data summary is sometimes more important than just diving into the data manipulation itself. It allows us to form a plan and strategy for approaching the data.
Most people will use the
pandas.describe() method to give them a quick insight into the data they are dealing with.
import pandas as pd data = pd.read_csv ('Automobile_data.csv') data.describe()
pandas .describe() example (1985 Automobile data)
However, this method can sometimes feel lackluster. This is where the pandas_profiling library comes into play.
A single line of code is enough for the report to be generated.
import pandas_profiling data = pd.read_csv('Automobile_date.csv') data.profile_report()
The generated report will be separated into 6 sections:
- Missing values
Mostly global details about the dataset (number of records, number of variables, overall missingness, duplicates, memory footprint) with alerts regarding cardinality (which fields have a high number of distinct values) and correlations (which fields are highly correlated with another field) of variables.
The dataset in question contains information regarding automobile data, with general information such as the make of the car, as well as detailed information such as the width, height, fuel system, horsepower, and price.
A variable summary shows a variable’s most important numeric statistics and a graph of data distribution. Additional details can be toggled to show the statistics (quantile and descriptive) of that specific variable, a histogram, frequency of values within that specific variable, and extreme values, indicating possible issues with the field.
A scatter plot on a Cartesian coordinate system is generated based on the chosen numeric variables. Useful to visually detect the relation of the selected variables that would make sense to be related.
A correlation plot of variables is generated with the option to change the correlation coefficient (Spearman, Pearson, Kendall, Cramér’s V, Phik) and show its description.
A simple visualization of nullity by column.
Two tables showing the first and last 10 records are generated, similarly to the
pandas .head() method (which shows only the first 5 records)
“Pandas Profiling for data exploration” Tech Bite was brought to you by Demir Korać, Data Analyst at Atlantbh.
Tech Bites are tips, tricks, snippets or explanations about various programming technologies and paradigms, which can help engineers with their everyday job.