Optimizing a Machine Learning Pipeline to Reduce Run Time

CHALLENGE

Professional sports organizations rely on analytics to provide nearly real time data regarding players’ on-field performances. From how fast a player runs, to the football’s speed, the intensity of a collision, advanced run-pass tendencies, safety rotation, play similarity, line strength and more - this data is full of important metrics. The faster teams receive the data, the faster they can make adjustments to improve and optimize the game play.

For one professional sports analytics firm, it was taking four and a half days — far too long for the real time analysis and reporting needed — to process the data.

The challenge for Six Feet Up’s team of expert developers:

streamline the firm’s machine learning data pipeline model to reduce run time and provide better visibility into the pipeline’s processes; and
move the process to a serverless model that would save both time and money.

Implementation Details

The discovery phase of this project required an in-depth analysis of the firm’s existing monolithic process which included reviewing the code and technology stack to ensure the appropriate tooling was in place. In this situation, the pipeline included Python and several popular machine learning libraries, but relied heavily on NoSQL and Amazon Relational Database Service (RDS) accessed with long running EC2 instances as opposed to AWS’ Lambda service.

Parallelize the machine learning data pipeline’s processes using Amazon Web Services (AWS) Lambda

After the discovery phase, Six Feet Up’s expert team of developers hit the ground running to execute an optimization strategy that would leverage the physical server aspect of serverless computing. Specifically, the team:

separated the code base into individual elements,
spun up parallel machines with the same code base elements,
hid cached data in Lambdas virtual file systems,
allowed Python to start off from Lambda’s “disks” of cached data, and
customized large machine learning libraries to reduce their sizes.

Deploy machine learning models with appropriate monitoring and alerting services to reduce troubleshooting turn-around time

By parallelizing and componentizing the process behind the pipeline, Six Feet Up produced results faster than the systems development life cycle (SDLC), which allows for quicker reconfigurations and optimizations to the process. The cost to keep multiple machines busy with no room for system errors versus parallelizing the process is negligible, but the time and effort saved account for considerable cost savings.

RESULTS

In three short weeks, Six Feet Up brought the professional sports analytics firm’s machine learning pipeline’s run time down from four and a half days of waiting and hoping components wouldn’t fail to just under 90 seconds — faster than their coffee could even finish brewing. Today, the pipeline continues to be used as the firm’s primary data source for validating major professional football game models against each other.