BlogData Analytics

Data Analysis: Flexibility in the Data Preparation

By June 21, 2017 No Comments

The Atlantbh Analytics Team

All data analysts need to be able to prepare data, carry out calculations and present conclusions that will influence commercial decisions.

However, what sets Atlantbh analysts apart is flexibility in the data preparation phase. The analytics team has individuals that are R developers, experienced in Java and Python, while SQL is a must-know for every member. This flexibility improves the quality of analysis output, and the delivery speed as well.

Data Preparation

Data preparation may include downloading, decompressing, format standardization and file conversion, before finally importing the dataset into analytical tools and running various data inspections. The time needed to prepare the data is often disregarded, but it shouldn’t be as it can sometimes equal the time needed for the analysis itself!

Three case studies involving the Atlantbh analytics team will show what we mean about flexibility in the data preparation phase.

Case Study 1 – Bottleneck: Preprocessing Large Files

The analytics team had a performance issue when preprocessing files that are over 10 GB. These files had to be downloaded on local machines, standardized and uploaded back to the analytical server for in-depth analysis. In this scenario, a bottleneck was caused by a download speed limitation of 1 MB per second when using conventional downloading tools. Downloading a single file took over three hours!

The data storage server supported partitioned file download. After a couple of days of development, the analytics team created a small Java application that supported the downloading of files in small chunks, concurrently, effectively decreasing the three-hour wait time to just 15 minutes.

Case Study 2 – Accessing a 90 GB JSON Compressed File

In this job, there was a requirement to standardize a 90 GB JSON compressed file that had an estimated 400 GB size after decompression. Bearing in mind that the average MAC disk size is 250 GB, this use case was a serious challenge!

The solution was to implement a Java application that would read content directly from a compressed file without decompressing the entire file on the disk. While reading the file, a field was detected with high redundancy data that would not impact the results of the analysis. The field was removed on the fly, the file structure was standardized and the content saved into a new file. The new file was then just 20 GB when compressed.

Case Study 3 – Standardizing Complex Data Structures

The Atlantbh analytics team had a large number of unconventional data deliveries that had to be standardized prior to analysis. One of the most interesting scenarios was delivery of 50-plus PostgreSQL table dumps and PDFs with 100 pages explaining software that generated data along with data structure and table correlations. Table sizes ranged from thousands to millions of records.

The first step was to understand the data and what were relevant tables and fields for the analysis. The second step was writing and optimizing SQL queries that would extract the data. Finally, a tool was developed that exported data in a JSON file from Postgres tables. The entire process was finished in no longer than a week.

Quicker and Better Analysis

Depending on the domain that is being analyzed, data delivery forms can vary significantly. When an analytics team is able to adapt and standardize data into a digestible form, work can progress much faster.

The analytics team’s technical skills can simplify the data preparation phase and increase the processing speed. Good technical skills also improve the quality of analysis through development of new types of analysis. Look out for the next post about the Atlantbh analytics team for more detail!