The batch processing data pipelines at Twitter process hundreds of petabytes of data on a daily basis. These pipelines run on on-premise Hadoop clusters and use Scalding as the framework for data processing.We are currently in the midst of migrating our batch processing pipelines to the cloud. With approximately 5000 production pipelines, it becomes necessary to automate the migration of these pipelines to the cloud.
Scalding is an abstract API for expressing computations on data, and allows backend implementations to be plugged in. It currently has both a Cascading and Spark backend. In this talk, we’ll describe a new Apache Beam backend for Scalding that we are in the process of implementing at Twitter. This backend expresses the various computations in Scalding as PTransforms in Beam. With this backend, we can run Scalding pipelines on Beam in the cloud with near zero code changes, thus automating the migration of these pipelines to the cloud.