TitleSpeaker(s)RecordingSlides
Welcome
Opening message for Beam Summit 2022.
Beam Summit Team
Google's investment on Beam, and internal use of Beam at Google
Kerry will speak about why Google cares about Apache Beam, and also about how many internal teams at Google use Beam’s Python and Go SDKs for ML workloads and log analytics.
Kerry Donny-Clark
Machine learning design patterns: between Beam and a hard place
In a recent book entitled Machine Learning Design Patterns, we captured best practices and solutions to recurring problems in machine learning. Many of these design patterns are best implemented using Beam. The obvious example is the Transform design pattern, which allows you to replicate arbitrary operations from the training graph in the serving graph while keeping both training and serving code efficient and maintainable. Indeed, the tf.transform package makes this easy.
Lak Lakshmanan
Tailoring pipelines at Spotify
At Spotify we run more than 18,000 data pipelines. To help our developers build and maintain those, we made Scio - a Scala API for Beam, inspired by Apache Spark & Scalding. This session will show some of the advantages of Scio, and how we combine it with Beam for performance and ease of development.
Rickard Zwahlen
The adoption, current state and future of Apache Beam at Twitter
Evaluation, effort for adoption, current state and future of Apache Beam adoption at Twitter.
Lohit Vijayarenu
RunInference: Machine Learning Inferences in Beam
Users of machine learning frameworks must currently implement their own PTransforms for predictions or inferences. Only TensorFlow makes a RunInference beam transform available, but it’s not highly accessible since it’s hosted in the TFX-BSL repo. We are creating implementations of RunInference for two popular machine learning frameworks, scikit-learn and PyTorch. These will take advantage of both internal optimizations like shared resource models, as well as framework-specific optimizations such as vectorization. It will have a clean simple unified interface, and use types intuitive to developers for inputs and outputs (numpy for scikit-learn, Tensors for PyTorch).
Andy Ye
Vega: Scaling MLOps Pipelines at Credit Karma using Apache Beam and Dataflow
At Credit Karma, we enable financial progress for more than 100 million of our members by recommending them personalized financial products when they interact with our application. In this talk we are introducing our machine learning platform that uses Apache Beam and Google Dataflow to build interactive and production MLOps pipelines to serve relevant financial products to Credit Karma users. Vega, Credit Karma’s Machine Learning Platform, uses Bigquery, Apache Beam, Distributed Tensorflow and Airflow for building MLOps pipelines.
Debasish Das, Vishnu Venkataraman & Raj Katakam
Houston, we've got a problem: 6 principles for pipelines design taken from the Apollo missions
Based in the inspiring talk about the Apollo XI “Light years ahead”, in this talk we will cover 6 design principles for data pipelines, with a focus on streaming pipelines. Submitting a job to a runner is not that different (well, a bit) from “submitting” a spacecraft into space. You lose direct access to the device, and can only send a couple of commands, read logs and receive telemetry. The 6 principles presented in this talk will make your pipeline to land on the moon rather than crashing!
Israel Herraiz & Paul Balm
Speeding up development with Apache Beam (Adobe Experience Platform)
At the core of Adobe Experience Platform (AEP), there is a large Apache Kafka deployment: 20+ data centers, 300+ billion messages a day. We use it to import/export data for external customers and integrate internal solutions. Processing those events involves lots of boilerplate code and practices: understanding the streaming platform, optimizing for throughput, instrumenting metrics, deploying the service, alerting, and monitoring. Out of these requirements, we built a team to construct and be responsible for the infrastructure where Apache Beam jobs would run.
Constantin Scacun & Alexander Falca
State of the Go SDK 2022
Learn what’s happened and what happening and what’s going to happen to the Apache Beam Go SDK. Provides an overview of improvements to the Go SDK since the last beam summit, especially what’s been happening since the Go SDK left Experimental. The Go Ecosystem, Cross Language, Streaming, and more!
Robert Burke
How the sausage gets made: Dataflow under the covers
The team that builds the Dataflow runner likes to say it “just works”. But how does it really work? How does it decide to autoscale? What happens on failures? Where is state data stored? Can I have infinite state data? What goes into Dataflow’s latency? What’s the checkpoint logic? These and other questions to be answered in this session!
Pablo Estrada
Powering Real-time Data at Intuit: A Look at Golden Signals powered by Beam
The Stream Processing Platform powers real-time data applications at Intuit using the Apache Beam framework. The platform makes it easy for an engineer to access, transform, and publish streaming data with turn-key solutions to debug and deploy their streaming applications on a managed platform in a manner that is compliant with and leverage Intuit central capabilities. Since launch 3 years ago, the platform has grown to over 150 pipelines in production and handled at peak ~17.
Omkar Deshpande, Dunja Panic, Nick Hwang & Nagaraja Tantry
How to break Wordle with Beam and BigQuery
In this session we will cover how I used Beam and BigQuery to find the best combination of words to play Wordle.
Iñigo San Jose Visiers
Introduction to performance testing in Apache Beam
We all know that the benchmarking is a very important but quite tough and ambiguous part of performance testing for every software system. Furthermore, benchmarking the systems that support different language SDKs and distributed data processing runtimes, like Apache Beam, is even more tough and ambiguous. In this talk we will discover what kind of benchmark frameworks Apache Beam already supports (Nexmark, TPC-DS), how they can be used for different purposes (like release testing) by developers and ordinary users, and what are the plans for the future work.
Alexey Romanenko
Migration Spark to Apache Beam/Dataflow and hexagonal architecture + DDD
In my previous customer, we did a code migration from Spark/Dataproc et Apache Beam/Dataflow. I proposed an hexagonal architecture + Domain driven design with Apache Beam, in order to isolate to business code (bounded context/domain) and technical code (infrastucture). This architecture is used with code decoupling and dependency injection. I used Dagger2 and i am going to explain why :) The purpose is showing a Beam project with this architecture and explain why it’s interesting.
Mazlum Tosun
BlueVoyant: Detecting Security Dumpster Fires on the Internet
At BlueVoyant, we process billions of cyber events per second in an effort to secure the supply chains of several high-stakes Fortune 500 companies. Cyber attacks exploit the weakest links in a target company, and modern tech infrastructures are composed of more third-party dependencies than ever before. Our Third-Party Risk (3PR) product requires us to monitor, analyze, and pinpoint weaknesses throughout hundreds of thousands of third party entities, each consisting of tens to millions of assets that may be exploited.
Alfredo Gimenez, Adam Najman, Tucker Leavitt & Tyler Flach
From script slums to beam skyscrapers
Our data pipelines grew organically overtime one script piling up over another one. We got to a point where speed became too slow and too error prone to change. Sounds familiar? Come join me in this session where I would walk thro' our journey to transform data processing from serial python scripts to massively parallel Beam.
Shailesh Mangal
Data Integration on cloud made easy using Apache Beam
How we have used Apache Beam on Google cloud Dataflow runner to build a robust Data Integration solution to on-board data from various source systems [RDBMS , REST APIs] to Google Cloud Bigquery
Parag Ghosh
How to benchmark your Beam pipelines for cost optimization and capacity planning
Are your deployed Beam pipelines properly sized to meet your SLOs? Currently there’s no standard way of benchmarking your Dataflow pipelines to measure expected output rate or utilization, which are required for capacity planning and cost optimization. This talk is to present a methodology and a toolset to benchmark performance (event latency and throughput) of a custom Beam pipeline using Dataflow as runner. Results would be synthesized into prescriptive sizing guidelines to achieve best performance/cost ratio.
Roy Arsan
Strategies for caching data in Dataflow using Beam SDK
This session discusses different strategies for caching data in Dataflow using the Beam SDK
Zeeshan
Collibra’s Telemetry Backbone - OpenTelemetry and Apache Beam
At a certain time, when you’re a company that’s building out a successful SaaS platform there comes a time when classic observability tools don’t cut it anymore. Or they don’t scale as well as you grow or they become too expensive. That’s why the SRE team started to look at observability as a big data problem: In combination with a specification for telemetry (OpenTelemetry) and big data tools like Apache Beam they build the Telemetry Backbone, the streaming hub for all of Collibra’s telemetry data.
Alex Van Boxel
Cloud Spanner change streams and Apache Beam
This session will give an overview of Cloud Spanner change streams, discuss the challenges of capturing change streams with the Apache Beam framework and Dataflow, and dive into the specific use case of streaming change records from Spanner into BigQuery using Beam.
Haikuo Liu, Nancy Xu & Le Chang
Implementing Cloud Agnostic Machine Learning Workflows with Apache Beam on Kubernetes
The need for a highly efficient data processing workflow is fast becoming a necessity in every organization implementing and deploying Machine Learning models at scale. In most cases, ML teams leverage the managed service solutions already in place by the cloud infrastructure provider they choose. While this approach is good enough for most teams to get going, the long-term cost of keeping the platform running may be prohibitively higher over time.
Charles Adetiloye & Alexander Lerma
New Avro serialization and deserialization in Beam SQL
At Palo Alto Networks we heavily rely on Avro, using it as the primary storage format and use Beam Row as in memory. We de/serialize billions Avro records per second. One day we realized Avro Row conversion routines consume much of CPU time. Then the story begins ….
Talat Uyarer
Where is Beam leading Data Processing now?
Kenn Knowles, the PMC chair for Apache Beam, will speak about the last year of developments in Beam, and the exciting things that are coming to Beam.
Kenn Knowles
Palo Alto Networks' massive-scale deployment of Beam
Talat will talk about how Beam was adopted at Palo Alto, and describe how they deploy thousands of Beam pipelines to keep their customers secure.
Talat Uyarer
Beam as a High-Performance Compute Grid
Risk Management is a key function across all Financial Services. Often this entails heavy computer simulation exploring how financial markets could evolve. Counterparty Credit Risk requires particularly extreme compute capacity given the billions of daily calculations it performs. Increased regulatory demand (FRTB SA CVA, BCBS 239, etc.) has required further compute power. Traditionally, these simulations are broken up into small tasks and distributed in a compute grid. Map reduction techniques are then used to aggregate and calculate statistical properties.
Peter Coyle & Raj Subramani
Oops, I wrote a Portable Beam Runner in Go
Like all the SDKs, the Apache Beam Go SDK has a direct runner, for simple testing. However, unlike Python and Java, it has languished, not covering all parts of the model supported by the SDK. Worse, it doesn’t even use the FnAPI! This dive into the runner side of Beam will cover how I wrote my own testing oriented replacement for the Go Direct runner, and how it can become useful for Java, Python, and future SDKs.
Robert Burke
Optimizing a Dataflow pipeline for cost efficiency: lessons learned at Orange
This session will present a Beam use case in the telco industry. We will introduce the use case and context as well as the solution implementation including things such as configuration, autoscaling, writing to BigQuery, profiling and improving the code. In each point we will go over the thought process behind the decision and its effects on cost/performance.
Jérémie Gomez & Thomas Sauvagnat
Visually build Beam pipelines using Apache Hop
In this workshop we’ll give a detailed overview of the Apache Hop platform. We’ll design a few simple Beam pipelines using the Hop GUI after which we’ll run them on a few runners like GCP DataFlow and Spark. After that we’ll cover best practices for version control, unit testing, integration testing and much more. No prior knowledge of Apache Hop or Beam is required to follow this workshop.
Matt Casters
Combine by Example - OpenTelemetry Exponential Histogram
Apache Beam’s concept can sometimes be overwhelming. Sometimes is hard to see why something is designed the way it is. One of these concepts is the Combine function. This session takes a non-trivial data structure (the OpenTelemetry Exponential Histogram) as an example and goes through each step in the design on making it a super-efficient Combine function. Learn by example.
Alex Van Boxel
Unified Streaming and Batch Pipelines at LinkedIn using Beam
Many use cases at LinkedIn require real-time processing and periodic backfilling of data. Running a single codebase for both needs is an emerging requirement. In this talk, we will share how we leverage Apache Beam to unify Samza stream and Spark batch processing. We will present the first unified production use case Standardization. By leveraging Beam on Spark for its backfilling, we reduced the backfilling time by 93% while only using 50% of resources.
Shangjin Zhang & Yuhong Cheng
Log ingestion and data replication at Twitter
Data Analytics at Twitter rely on petabytes of data across data lakes and analytics databases. Data could come from log events generated by twitter micro services based on user action(in the range of trillions of events per day) or data is generated by processing jobs which processes the log events. The Data Lifecycle Team at twitter manages large scale data ingestion and replication of data across twitter data centers and public cloud.
Praveen Killamsetti & Zhenzhao Wang
Beam in Production: Working with DataFlow Flex temples and Cloud Build
In this session, we will look into how to get a Beam pipeline successfully deployed via a CICD framework (Google’s cloud build) by utilising Dataflow’s flex temples and discuss lessons learned along the way.
Ragy Abraham
Detecting Change-Points in Real-Time with Apache Beam
Apache Beam provides an expressive and powerful toolset for transformation over bounded and unbounded (streaming) data. One common but not obvious transformation that Oden has needed to implement is Change-Point Detection. Change-Point Detection is at the heart of many of Oden’s real-time features to its manufacturing clients. For example, many clients need to track when their factory’s production lines stop running in order to root-cause issues and improve capacity. Oden continuously streams sampled real-world properties of the machines used in those lines, such as motor-speed, and uses change-detection to differentiate periods where the line was “stopped,” “ramping up,” or “running.
Devon Peticolas
Relational Beam: Process columns, not rows!
Last year we kicked off relational work with a vision of automatically optimizing your pipeline. Now we have a panel of contributors who are working towards this goal! We will demo the new optimizer in Java core, showing how we can automatically prune columns from IOs. Then we will discuss our upcoming work to make vectorized execution a reality through native columnar support in Python and Java. Finally we will discuss usage best practices around using Schemas, Dataframes, and SQL to ensure you can benefit from these changes.
Andrew Pilloud & Brian Hulette
Streaming NLP infrastructure on Dataflow
Trustpilot is an e-commerce reviews platform delivering millions of new reviews to businesses each week. We are using Apache Beam on GCP Dataflow to deliver real-time streaming inferences with the latest NLP transformer models. Our talk will touch on: Infrastructure setup to enable Python Beam to interface with Kafka for streaming data Taking advantage of Beam’s unified programming model to enable batch jobs for backfilling via BigQuery Working with GPUs on Dataflow to speed up local model inference MLOps: Using Dataflow as part of a continuous evaluation model monitoring setup
Alex Chan & Angus Neilson
Error handling with Apache Beam and Asgarde library
I created a library for error handling with Apache Beam Java and Kotlin. Asgarde allows error handling with less code and more concise/expressive code. The purpose is showing Beam native error handling, and the same with Asgarde Java. I will also show Asgarde Kotlin with even more concise code and a more functional style. https://github.com/tosun-si/asgarde/ The example with Asgarde will store the bad sink in a Bigquery table (DLQ). We used Asgarde in production code at my actual big customer (L’Oréal/ France).
Mazlum Tosun
Writing a native Go streaming pipeline
Over the past year, the Beam Go Sdk has rolled out several features to support native streaming DoFns. This talk will dig into those features and discuss how they can be used to build streaming pipelines written entirely in Go. Attendees will come away with an understanding of some of the challenges associated with processing unbounded datasets. They will also learn how they can build their own custom streaming splittable DoFns to address those challenges.
Danny McCormick & Jack McCluskey
Use of shared handles for Cache reuse across DoFn’s in Python
The session will talk about how we use shared handles to enrich our events with metadata using shared handles.
Amruta Deshmukh
Developing PulsarIO Connector
Overview of how PulsarIO Connector was implemented, how it works, and how to use it.
Marco Robles
Playing the Long Game - Transforming Ricardo's Data Infrastructure with Apache Beam
The Data Intelligence Team at Ricardo, Switzerland’s largest online marketplace, is a long-time Apache Beam user. Starting on-premise, we transitioned from running our own Apache Flink cluster on Kubernetes with over 40 streaming pipelines written in Apache Beam’s Java SDK to Cloud Dataflow. This talk describes the obstacles and discusses our learnings along the way.
Tobias Kaymak
Beam data pipelines on microservice architectures
Wayfair is the world’s largest online destination for all things home including furniture, household items, fixtures, appliances etc. Unparalleled selections and high quality imagery are keys to provide a rich & unique user experience. Photo studios are expensive to operate and require significant time to produce an image. 3D modeling and imagery is one of the main focus areas of investment and the in-house applications were redesigned and developed using domain-driven design patterns of software engineering.
Pragalbh Srivastava
Apache Beam backend for open source Scalding
The batch processing data pipelines at Twitter process hundreds of petabytes of data on a daily basis. These pipelines run on on-premise Hadoop clusters and use Scalding as the framework for data processing.We are currently in the midst of migrating our batch processing pipelines to the cloud. With approximately 5000 production pipelines, it becomes necessary to automate the migration of these pipelines to the cloud. Scalding is an abstract API for expressing computations on data, and allows backend implementations to be plugged in.
Navin Viswanath
Scaling up pandas with the Beam DataFrame API
Beam’s Python SDK is incredibly powerful due to its high scalability and advanced streaming capabilities, but its unfamiliar API has always been a barrier to adoption. Conversely, the popular Python pandas library has seen explosive growth in recent years due to its ease of use and tight integration with interactive notebook environments, but it can only process data with a single node - it cannot be used to process distributed datasets in parallel.
Brian Hulette
Online clustering and semantic enrichment of textual data with Apache Beam
In this talk, we will be looking at some Apache Beam design patterns and best practices that you can use to successfully build your own Machine Learning workflows. We will be looking at examples of applying Apache Beam to solve real-world problems, such as real-time semantic enrichment and online clustering of text content. We will also discover how to use Beam efficiently as an orchestrator for Machine Learning services and how to design your entire system with scalability in mind.
Konstantin Buschmeier
Improving Beam-Dataflow Pipelines for Text Data Processing
In this session, we’ll share a few recipes to improve Beam-Dataflow pipelines when dealing with sequence data. These methods came from our experience of processing and preparing large datasets for ML use at Carted. We’ll provide a step-by-step framework of how to analyze the issues that can start surfacing when processing text data at scale and will share our approaches to dealing with them. We hope you’ll apply these recipes to your own Beam-Dataflow pipelines to improve their performance.
Sayak Paul & Nilabhra Roy Chowdhury
Beam Playground: discover, learn and prototype with Apache Beam
The talk will be aimed at presenting Apache Beam discovery and learning through Beam Playground, proof of concept and contributing paths to those who are looking to learn, use and develop Apache Beam. The planned structure of the talk will be: Discovery and learning experience. I would like to go through the Apache Beam Playground to show how to use it and how Apache Beam code can be executed without the need to install Beam SDK.
Daria Malkova
GCP Beam Common Customer Issues
An overview of common customer issues seen when Apache Beam is used on the Cloud Dataflow Runner.
Svetak Sundhar
The Ray Beam Runner Project: A Vision for Unified Batch, Streaming, and ML
The Ray Beam Runner Project is an initiative to create a new Pythonic Beam Runner based on the Ray distributed compute framework. The project was conceived based on strong community interest in integrating Ray with Beam, and a prototype quickly proved its viability. We will discuss the current state of this initiative, and our long-term vision to provide a unified authoring and execution environment for mixed-purpose batch, streaming, and ML pipelines.
Patrick Ames, Jiajun Yao & Chandan Prasad
Scio in-depth workshop
This workshop encompasses several talks and a workshop around Scio, which is the open source Scala API for Apache Beam.
Michel Davit, Israel Herraiz, Claire McGinty, Kellen Dye & Annica Ivert
Splittable DoFns in Python: a hands-on workshop
We will review the concept of Splittable DoFns and we will write two I/O connectors using this kind of DoFns: one in batch (for reading large files in a given format), and one for streaming (for reading from Kafka topics).
Israel Herraiz & Miren Esnaola
Real time liveness status of industrial sensors with Apache Beam on Dataflow runner and Yugabyte
When there is lot of telemetry coming in from thousands of devices, how to identify , in real time, when particular sensor has failed. Architecture and demonstration using - Apache beam on Dataflow, Yugabyte and Grafana - to identify failed feed from industrial sensors in real time. Demo scenario will have 200 sensors across - 5 factories , each factory with 6 sections and 6 different types of sensors. it will illustrate not only failures but see how the sensor status changes back to active when put back into service.
Kamaljeet Singh
Supporting ACID transactions in a NoSQL database with Apache Beam
In this session we will see how we can use Apache Beam to enhance a generic eventually consistent NoSQL database (e.g. Apache Cassandra) by ACID transactions. We will see how we can use gRPC with splittable DoFn to create an RPC streaming source of requests into Apache Beam Pipeline and then how we can process these requests inside the Pipeline ensuring the requests are applied atomically, consistently, independently and durably despite the data being backed by an eventually consistent data store.
Jan Lukavský