Improving Beam-Dataflow Pipelines for Text Data Processing

In this session, we’ll share a few recipes to improve Beam-Dataflow pipelines when dealing with sequence data. These methods came from our experience of processing and preparing large datasets for ML use at Carted. We’ll provide a step-by-step framework of how to analyze the issues that can start surfacing when processing text data at scale and will share our approaches to dealing with them.

We hope you’ll apply these recipes to your own Beam-Dataflow pipelines to improve their performance.

Some topics that we’ll cover in this session have been discussed in our blog post: https://www.carted.com/blog/improving-dataflow-pipelines-for-text-data-processing/. The accompanying GitHub repository of the blog post is available here: https://github.com/carted/processing-text-data.

Improving Beam-Dataflow Pipelines for Text Data Processing

Sayak Paul

Nilabhra Roy Chowdhury

Improving Beam-Dataflow Pipelines for Text Data Processing