How to Effectively Reduce AI Pipeline Runtime

3:45pm - 4:10pm on Friday, October 4 in PennTop North

Hichame El Khalfi, Deepshikha Gandhi

Description

In this talk, we will discuss how and why it’s important to migrate PySpark pipelines to use PyPy instead of CPython.

An example will be shared involving a core AI pipeline that ingests more than 4 TB of data (Parquet, TSV, and Json) per run, and produces optimized models on behalf of marketing clients. We’ll outline how migration to PyPy brought a decrease in runtime of 30% overall without any code changes, while keeping the Operational team happy.

We will also offer recommendations on the steps to follow to accomplish runtime reduction – from unit testing, which Spark configuration to use, and how to deploy into production – and touch on some limitations that can be faced with PyPy.