Introduction to Pandas Profiling

Pandas profiling is an extended version of the standard Python pandas library for data manipulation and exploration. However, the initial data summary is sometimes more important than just diving into the data manipulation itself. It allows us to form a plan and strategy for approaching the data.

Most people will use the pandas.describe() method to give them a quick insight into the data they are dealing with.

import pandas as pd
data = pd.read_csv ('Automobile_data.csv')
data.describe()

pandas .describe() example (1985 Automobile data)pandas .describe() example (1985 Automobile data)

However, this method can sometimes feel lackluster. This is where the pandas_profiling library comes into play.

A single line of code is enough for the report to be generated.

import pandas_profiling
data = pd.read_csv('Automobile_date.csv')
data.profile_report()

The generated report will be separated into 6 sections:

  • Overview
  • Variables
  • Interactions
  • Correlations
  • Missing values
  • Sample

Overview

Mostly global details about the dataset (number of records, number of variables, overall missingness, duplicates, memory footprint) with alerts regarding cardinality (which fields have a high number of distinct values) and correlations (which fields are highly correlated with another field) of variables.

The dataset in question contains information regarding automobile data, with general information such as the make of the car, as well as detailed information such as the width, height, fuel system, horsepower, and price.

Variables

A variable summary shows a variable’s most important numeric statistics and a graph of data distribution. Additional details can be toggled to show the statistics (quantile and descriptive) of that specific variable, a histogram, frequency of values within that specific variable, and extreme values, indicating possible issues with the field.

pandas profiling Variables

Interactions

A scatter plot on a Cartesian coordinate system is generated based on the chosen numeric variables. Useful to visually detect the relation of the selected variables that would make sense to be related.

Correlations

A correlation plot of variables is generated with the option to change the correlation coefficient (Spearman, Pearson, Kendall, Cramér’s V, Phik) and show its description.

A correlation plot of variables

Missing values

A simple visualization of nullity by column.

A simple visualization of nullity by column.

 

Sample

Two tables showing the first and last 10 records are generated, similarly to the pandas .head() method (which shows only the first 5 records)

table showing the first 10 records generatedtable showing the last 10 records generated


“Pandas Profiling for data exploration” Tech Bite was brought to you by Demir Korać, Data Analyst at Atlantbh.

Tech Bites are tips, tricks, snippets or explanations about various programming technologies and paradigms, which can help engineers with their everyday job.

Protractor parallel execution
QA/Test AutomationTech Bites
May 12, 2023

Protractor parallel execution

Why Parallel Testing? When designing automation test suites, the ability to run tests in parallel is a key feature because of the following benefits: Execution time - Increasing tests efficiency and, finally, faster releases Device compatibility - There are many devices with various versions and specifications that need to be…
Introduction to GraphQL
QA/Test AutomationTech Bites
May 11, 2023

Introduction to GraphQL

In today's Tech Bite topic, we will get familiar with GraphQL and explain its growth in popularity in recent years as the new standard for working with APIs. What is GraphQL? GraphQL stands for Graph Query Language. It is an API specification standard and, as its name says - a…
IQR in Automation Testing
QA/Test AutomationTech Bites
April 25, 2023

IQR in Automation Testing: Unleashing the Data Analytics Potential

Usually, when people talk about statistics, they think only about numbers, but it is much more.  Statistics is an irreplaceable tool in different fields like data analysis, predictive modeling, quality control, scientific research, decision-making models, etc. Overall, statistics provide a valuable toolset for understanding, describing, and making predictions about data,…

Leave a Reply