Make it work, then make it fast

Atlantbh had the opportunity to implement a back-end software solution for an insurance company. The client required a data processing engine able to transform different sizes of data varying from a few kilobytes to dozens of gigabytes, using the same mechanism. Having a unified engine which would have the same transformation logic both for the files containing 2 records (small data sets) and 20 million records (big data sets) was important to achieve simplicity.

From the functional point of view, following requirements had to be addressed:

  • To start a processing flow, user needs to provide the input data, along with the configuration data set. Both are provided using the predefined schema and are validated by the system.
  • Once triggered, each processing flow contains multiple logical steps, each of them transforming the data.
  • Processing flow is dynamic and driven by the configuration.
  • Complexity is very different for each step (from simple operations to application of very extensive business rules).
  • Files given as input and produced as output in each step, often need to be in a different format and sometimes the number of files cannot be predicted in advance.

There was also set of non-functional requirements:

  • Performance requirements are given for both small and big data sets and defined as highly important aspect for business. For example, small data sets such as files up to 10k records need to be processed in a range of minutes and should never be blocked by the big ones. There is also a minimum number of processes that should run in parallel, without affecting performance, etc.
  • Different customers should be able to use a system and have full control over their data. They shouldn’t be affected by other users by any means, and access to their data should be secured.
  • It should be possible to control priorities for the different processing flow types.

Faced with the combination of business and technical constraints, we needed to come up with a right architecture that will be both easy to extend and maintain. On top of this, a working solution was needed within a short timeframe, in order to enable testing and verification of complex business rules.


Challenge 1: Technology which will support required functionalities needs to be chosen

Solution: The choice of technology to be used for data processing and transformation was already made by the client. They decided on Apache Spark, considering that it can support the scalability required by the system. This way the development team is limited by the technology they can use, but are free to use it in a way that will support all business cases – known at the time and possibly needed in the future.

Challenge 2: Define how Apache Spark will be used to support all business cases.

Possibilities and Solution: To make an adequate decision on how to design the architecture and implement certain functionalities using Spark in a way that would support all business cases, two approaches were considered:

  • usage of Thrift Server, and
  • writing custom Spark jobs using Scala, submitted for processing via custom Process Manager.

1) The usage of Thrift server seamed simpler for implementation but had limitations that would complicate things in the long run

With Thrift Server, the main focus would be primarily on describing how the actual data transformations would be performed, as all the responsibilities related to the actual technical usage of the Spark cluster would have been shifted to the Thrift Server.

One processing context is used and reused. Jobs would be performed over the same context in case of a small data set. Ensuring that any unnecessary overhead for preparation of processing was removed.

On the other hand, the thrift server available inside the used platform seriously limited the flexibility. Only SQL language for transformations was allowed – seriously limiting the transformation capabilities for more complex business cases.  Data IO was limited to the same data format and no option of fine-grained control over resources or parallelism configuration, which is integral for fine-tuning the transformation processing for different sizes of data.

2) Writing custom Spark jobs using Scala was a more complex solution

The benefit was that it gave the development team the needed flexibility and freedom in designing the jobs.

It required more expertise on the low level of abstraction and would shift the responsibility of resource allocation for specific jobs on the development team. But this way, each processing job would be independent and more flexibility could be achieved, since Scala does not have as many limits as SQL and it can be combined with other languages this way.

Considering both options but having in mind the requirements given in the introduction, it was decided that we would go with the second solution. More flexible solution was chosen to ensure that new problems and constraints will not emerge along the way and the team had the expertise to face the challenge.

Challenge 3: Meet performance requirements despite the different challenges

Moving further, the chosen solution showed certain fallbacks related to the performance. Spark was running on the Hadoop PaaS platform, which used YARN as a resource manager.

It meant that each Spark job required the following steps to be executed (List 1*):

  • job acceptance,
  • resource allocation,
  • spinning up of a new context,
  • read resource(s) from HDFS,
  • execution,
  • write output resource(s) to HDFS and
  • shut down used context.

This brought unnecessary overhead to each job, which was prominently visible in case of small data sets, where supporting steps could have taken more time than the main execution step.

Solution: Due to challenges faced and time constraints given, we agreed with the client to adopt the approach: make it work, then make it fast. We enabled the client to perform business rules validation, while the team worked on the optimizations to make performances better.

Optimizations were a multipart process, but we would emphasize three parts that made the most impact.

1) Overhead reduced by chaining smaller jobs into large one

Initially, each step was implemented separately, emphasizing the issue with the YARN process manager and stage overheads – going through the same stages (List 1*) for each part of the transformation process.

Therefore, we decided to chain all jobs (executions) into one large job, where input and output operations were performed just once at the beginning and the end of the encompassing job instead of each individual step.

This required complex framework outside of the jobs to configure, correctly orchestrate and take the burden of reading/writing operations instead of Spark jobs.

2) “Fast-lane” introduced for the small data sets

On a larger scale, outside of a single data transformation flow, there were performance overheads with running small data sets on a full-blown actual Spark cluster. The main issue was that, the overheads coming from Spark resource scheduling, contact initialization and driver – executors communication were not justifiable for relatively small job executions which came with smaller datasets.

Change has been made to run small data sets in existing pre-configured local Spark contexts, without changing or separating implementation of the business rules. With the introduction of a “fast-lane” for running smaller spark jobs on an already pre-existing context, the overhead from the usual spark application spin-up has been removed – effectively reducing the processing time of small jobs to only the time taken to actually execute the business logic. Large datasets continued to be executed on the Spark cluster, while small jobs would either go to the Spark Local Context or the cluster, depending on client requirements.

3) Maximized throughput with resource utilization configurations

For scenarios where both large and small jobs are running on the cluster, there is the natural problem of properly utilizing the available resources to maximize the average throughput between job sizes.

Therefore, each job’s resource utilization was dynamically configured based on the dataset size that needed to be processed. A queuing mechanism has been established, with proper job size and resource consumption configurations defined, based on the underlying cluster. Additional factor which we needed to take care of is to allow parallel executions of small and large jobs, but making sure that large jobs would not block the cluster by using all available resources.

In the end, these Spark optimizations played a large role in achieving our final goal, meeting customer performance requirements!


When faced with a short deadline, it may seem more desirable to use technologies, tools or frameworks which require no or limited deep-dive knowledge about how they work under the surface. Nevertheless, we were aware that in this case it wouldn’t be the right approach. We considered not just the actual requirements which were sufficiently complex, but the possibility of them being significantly extended in the future which is very common for the type of the system we worked on. Therefore, we have decided to go for something more complex and flexible, which in the short run may have seemed like a bad decision, but in the long run paid large dividends.

Two important factors influenced choosing this path. First, we had the personnel and expertise to be able to quickly adopt all the necessary knowledge. Secondly, we were able to identify with our client what needs to be done to make it ‘just work on time’ and then focused on the ‘make it fast’ part.

This case study was written by Atlanters: Nejra Suljić, Irma Smajić, Dino Prašo, and Ridvan Appa Bugis.

If you enjoyed it, share it with your friends and colleagues – or read more blogs about Big Data!


Leave a Reply