Reducing Runtime and Computing Costs on a Big Data Pipeline

CHALLENGE

Healthcare organizations aggregate petabytes of data which help drive life-changing health and business decisions. To avoid errors and redundancies, it’s imperative that this data — which includes everything from patient records to medical charts, employee time sheets, hospital expenses and more — is processed quickly and accurately from one central location.

A large, not-for-profit healthcare organization charged Six Feet Up with improving its existing Extract, Transform and Load (ETL) pipeline. At the beginning of the engagement, the existing ETL pipeline took approximately 24 hours to process data. In their high-paced environment, an entire day was too long, leaving the healthcare organization with little to no time to troubleshoot issues and/or rerun the data before violating the service-level agreements. To reduce the pipeline’s runtime and ensure scalability, the in-house development team needed additional Python expertise.

Six Feet Up’s big data challenge:

update the pipeline’s codebase to reduce the potential for human error (e.g., misconfiguration, cascading bugs where the fix for one pipeline breaks another, and other brittleness),
optimize the ETL pipeline database for speed, and
ensure clear visibility into the pipeline’s data outputs.

Implementation Details

Update the pipeline’s codebase to reduce the potential for human error

The sheer volume of data being processed by the healthcare organization’s existing ETL pipeline (approximately 1.7 million petabytes daily) — which was built in-house — continues to grow at an exponential rate.

When working with big data, Six Feet Up’s team of expert developers use the motto, “The larger the database, the tighter the code.” Ensuring the code is as clean and tight as possible reduces runtime and minimizes the potential for human error.

Using open source technologies and leveraging existing tools’ features, Six Feet Up consolidated the code base to standardize libraries, eliminate one-off data notebooks and remove tens of thousands of lines of repetitive code.

Optimizing the ETL pipeline database for speed and ensuring clear visibility into the pipeline’s data outputs

To complete this project, Six Feet Up:

utilized Azure Monitor for monitoring, alerting, reporting and debugging the pipeline’s process;
migrated the pipeline to Apache Airflow 2.0 for a better user interface (UI) with higher visibility and clarity for each process;
upgraded to Azure Data Lake Storage Gen2 which significantly reduced storage costs;
parallelized the pipeline’s processes to reduce the impact of failures, make scheduling simpler and create more leeway to meet the organization’s SLAs;
created an automated test suite for catching regressions early in the process;
expanded the pipeline’s functionality to allow the ETL service to scale up several thousand more pipelines; and
documented the process as wikis in Azure DevOps.

Throughout the implementation process, Six Feet Up provided the in-house development team with valuable support and knowledge so they could continue optimizing and improving the pipeline as more data is accumulated. The pipeline’s ability to easily scale up allows for additional data analysis and more accurate health and business information for the healthcare organization to act upon.

RESULTS

Six Feet Up — in collaboration with the healthcare organization’s internal development team — has built a plan that will reduce the ETL pipeline’s runtime by 20 hours (from 24 hours to 4 hours) for all but the largest pipelines. This extra time will give the healthcare organization the opportunity to troubleshoot and resolve any unforeseen issues before the systems development life cycle (SDLC) ends.

Additionally, the fully functional and maintainable pipeline will provide greater data visibility for the healthcare company’s data acquisition teams and drastically reduce computing costs.

Today, this pipeline is being used to provide accurate and complete datasets which allow the organization to make critical business and health-related decisions.

Reducing Runtime and Computing Costs on a Big Data Pipeline

CHALLENGE

Implementation Details

Update the pipeline’s codebase to reduce the potential for human error

Optimizing the ETL pipeline database for speed and ensuring clear visibility into the pipeline’s data outputs

RESULTS

More Projects:

Contact Us

HEAR FROM US