A data processing-based solution refers to a system or approach that involves data collection, manipulation, analysis, and utilization to solve a particular problem or address a specific need. Data processing is crucial in various fields and industries, including business, healthcare, finance, science, and technology. It enables companies to turn raw data into valuable information for business. These kinds of solutions require a particular approach from the quality assurance standpoint. This article will explain QA professionals’ challenges with data processing-based solutions and introduce strategies to conquer them.
Data Processing Stages
From the QA standpoint, we must ensure that every data processing stage meets quality standards. We often hear the term ETL process (extract, transform, and load) and its initial part of the data processing-based solutions. ETL uses a set of business rules to clean and organize raw data and prepare it for the next steps in data processing according to the What is ETL? – Extract Transform Load Explained – AWS. It is possible to have many steps that are part of the data processing workflow, but these are the crucial ones that are always present in data processing according to the 5 Best Data Processing Software: Complete Guide:
- Data Collection:
This starting point in data processing includes collecting raw data from correct sources, such as message brokers, file storage, etc. This step refers to the ‘extract’ part of the ETL.
- Data Preparation:
This step represents filtering invalid, unneeded, or inaccurate data and converting data to the format needed for further processing, and it refers to the ‘transform’ part of the ETL.
- Data Input:
In this step, prepared data is provided to the processing step, and it refers to the ‘load’ part of the ETL, in which the data is moved from staging(initial) to the target area.
The data undergoes different transformations to get the desired output, including calculations and various data processing methods (machine learning or other algorithms).
- Data output/interpretation:
In this step, final processing starts. Data teams display it on some UI in easy-to-read formats for users, such as graphs, widgets, dashboards, tables, video, audio, etc. Sometimes, engineers develop applications (web, mobile, desktop, etc.) that present data; sometimes, they generate reports based on the data. There are several main data types used to present data in a usable way:
– Text: telling story for data
– Chart: showing trends such as growth or decline
– Table: presenting statistical data
– Image: images are also widely used for data presentations
- Data storage:
Data teams store data and metadata for many reasons, such as quick access when needed, further processing, keeping backup data, etc. We can save it in databases, data warehouses, file storage, etc.
In most cases this whole process is automated by using various tools, and it can be based on many processing models, and these are the most common ones according to the Difference between Batch Processing and Real Time Processing System – GeeksforGeeks:
- Batch Processing:
Processes large volumes of data in batches by schedule.
- Real-Time Processing:
Processes data as it arrives, in real-time or near-real-time.
Main QA Challenges
We have several challenges from a Quality Assurance standpoint when working with data processing-based solutions. Here are the main challenges:
- Ensuring data quality and integrity: QA teams ensure the accuracy and integrity of data throughout the data pipelines. They verify that data is not lost, duplicated, or corrupted during the ETL process (including data loads, transformations, etc., through the data pipelines).
- Data validation and verification: QA teams need to create data verification and validation test cases (manual or automated) that verify that data meets quality standards and that data transformations comply with business needs. Sometimes, these transformations can be complex and include calculations and aggregations.Example: Let us say that we have a system that consumes data from some message broker like Kafka or RabbitMQ, then saves the message to the initial data area (S3 bucket, SQL database, etc.), and then goes to the target area. After that, we do some data transformations, including mathematical operations, to get the final data outcome which is saved in a report file as an Excel sheet. For this kind of solution, we can have the following test case steps:
– Provide data to the message broker.
– Connect to the initial data area and verify that the data sent to the message broker is stored and valid.
– Connect to the target data area and verify that data has been transformed according to the accepted criteria.
– Verify that the report file is created in the needed place.
– Verify that the content of the file meets accepted criteria.
- Handling large volumes of data: This kind of solution, in most cases, includes big data sets. From a QA standpoint, verifying that the system will stay stable under a large volume of data without bottlenecks or data loss is essential. A good solution for this verification is to develop load tests (using frameworks such as JMeter, k6, etc.). It can also be tricky while developing automated test cases because large data can take time and resources, so it needs a particular approach.
- QA teams can not perform all test case scenarios in lower environments: Often, mocked data on the lower environments is not enough to test some features thoroughly, so the QA teams are completing testing upper environments with the production data. One solution that can help in this case is to have one test environment that consumes the same data as the production environment. Paying attention and developing practices to keep mocked data distinct from production data is important.
- Insufficient data to test the feature: Often, a lot of data is needed in the system to test some features thoroughly, and we can not predict how the system will behave when more data comes in. One of the options to overcome this challenge is to generate mocked data temporarily or to postpone the deployment to the production environment until the data is accurate.
- Data privacy and security: Attention to data privacy regulations(GDPR, HIPAA) is essential. QA teams must store the data securely, focus on access control, and adequately isolate the test environments.
- Documentation and knowledge transfer: QA teams should collaborate with Product Owners on creating and maintaining up-to-date documentation. For solution quality maintenance, it is necessary to ensure knowledge transfer and documentation of test cases.
- Cross-functional Collaboration: Effective collaboration between QA teams, product owners, software engineers, data engineers, data scientists, and other professionals included in development is crucial. Through well-established communication, it is ensured that everyone has a shared understanding of quality requirements. Sometimes, more than one team works on data processing-based solutions, which requires well-established contact through cross-team channels and documentation. For example, one team works on the data collection and preparation, and the other team works on the other data processing stages.
- Working with various data processing tools: QA professionals need to work with multiple data processing tools. The most popular ones are AWS (with its data processing resources such as Redshift, Glue, Lambda, etc.), Google Cloud Platform (with its data processing resources such as BigQuery, Cloud Dataprep, etc.), Azure Cloud (with its data processing resources such as SQL Warehouse, Data Factory, Azure Functions, etc.), and Snowflake, among others. They need to be capable of following trends in data processing by continuously improving their skills and keeping up to date with new tools and practices.
- Time-dependent testing: This requires a particular approach for the manual verification and test automation.If the system is based on a batch processing model, it triggers periodically, so we must adapt our test cases. For example, the processing stage runs every hour, so after providing new data, we have to wait for the processing schedule to verify that the data is successfully processed.On the other hand, if the system is based on a real-time processing model we may encounter other difficulties related to the continuous data streams, especially during the test automation. For example, if our automated test case expects some output based on calculations, and the new data stream enters the system it could cause a false test failure.
In the continued development of data processing-based solutions, we must recognize the role of Quality Assurance. The challenges of QA in data processing-based solutions may be complex, but they are solvable with the right strategies and commitment to excellence. All in all, QA is the unsung hero in ensuring the integrity and reliability of data. The data-driven future holds limitless opportunities, and with a strong QA, we can be sure that we are approaching with precision and confidence.
“QA Challenges on Data Processing-based Solutions” Tech Bite was brought to you by Haris Habul, Senior Quality Assurance Engineer at Atlantbh.
Tech Bites are tips, tricks, snippets or explanations about various programming technologies and paradigms, which can help engineers with their everyday job.