This workshop encompasses several talks and a workshop around Scio, which is the open source Scala API for Apache Beam.
Introduction to Scio and how it leverages some features of the scala programming language.
We will work through a series of kata-like exercises for Scio, where we progressively reveal new concepts and SDK utilities, and build up our knowledge of how to use Scio in our applications.
Joining large datasets is one of the main tasks when working with Beam and Scio. Joins are a big source of runtime and cost for these sorts of pipelines, as they cause most PCollection data to be serialized and transferred over to new workers. This talk studies how Scio can save you time and money with clever join strategies and approximate algorithms.
We will explain the use case and algorithm behind the rollupAndCount aggregation, that is part of the scio-extra package. When creating a dataset with rollup dimensions, there is a potentially huge fan-out transform before the aggregation step that can incur large costs in shuffle. It is possible to reduce this fan-out drastically by rethinking the problem. This talk will go into some backstory of the use case we had at Spotify and explain how we developed the algorithm behind rollupAndCount to solve this problem more efficiently.