Opening message for Beam Summit 2022.
|Beam Summit Team|
Kerry will speak about why Google cares about Apache Beam, and also about how many internal teams at Google use Beam’s Python and Go SDKs for ML workloads and log analytics.
In a recent book entitled Machine Learning Design Patterns, we captured best practices and solutions to recurring problems in machine learning. Many of these design patterns are best implemented using Beam. The obvious example is the Transform design pattern, which allows you to replicate arbitrary operations from the training graph in the serving graph while keeping both training and serving code efficient and maintainable. Indeed, the tf.transform package makes this easy.
At Spotify we run more than 18,000 data pipelines. To help our developers build and maintain those, we made Scio - a Scala API for Beam, inspired by Apache Spark & Scalding. This session will show some of the advantages of Scio, and how we combine it with Beam for performance and ease of development.
Evaluation, effort for adoption, current state and future of Apache Beam adoption at Twitter.
Users of machine learning frameworks must currently implement their own PTransforms for predictions or inferences. Only TensorFlow makes a RunInference beam transform available, but it’s not highly accessible since it’s hosted in the TFX-BSL repo. We are creating implementations of RunInference for two popular machine learning frameworks, scikit-learn and PyTorch. These will take advantage of both internal optimizations like shared resource models, as well as framework-specific optimizations such as vectorization. It will have a clean simple unified interface, and use types intuitive to developers for inputs and outputs (numpy for scikit-learn, Tensors for PyTorch).
At Credit Karma, we enable financial progress for more than 100 million of our members by recommending them personalized financial products when they interact with our application. In this talk we are introducing our machine learning platform that uses Apache Beam and Google Dataflow to build interactive and production MLOps pipelines to serve relevant financial products to Credit Karma users. Vega, Credit Karma’s Machine Learning Platform, uses Bigquery, Apache Beam, Distributed Tensorflow and Airflow for building MLOps pipelines.
|Debasish Das & Vishnu Venkataraman|
Based in the inspiring talk about the Apollo XI “Light years ahead”, in this talk we will cover 6 design principles for data pipelines, with a focus on streaming pipelines. Submitting a job to a runner is not that different (well, a bit) from “submitting” a spacecraft into space. You lose direct access to the device, and can only send a couple of commands, read logs and receive telemetry. The 6 principles presented in this talk will make your pipeline to land on the moon rather than crashing!
|Israel Herraiz & Paul Balm|
At the core of Adobe Experience Platform (AEP), there is a large Apache Kafka deployment: 20+ data centers, 300+ billion messages a day. We use it to import/export data for external customers and integrate internal solutions. Processing those events involves lots of boilerplate code and practices: understanding the streaming platform, optimizing for throughput, instrumenting metrics, deploying the service, alerting, and monitoring. Out of these requirements, we built a team to construct and be responsible for the infrastructure where Apache Beam jobs would run.
|Constantin Scacun & Alexander Falca|
Learn what’s happened and what happening and what’s going to happen to the Apache Beam Go SDK. Provides an overview of improvements to the Go SDK since the last beam summit, especially what’s been happening since the Go SDK left Experimental. The Go Ecosystem, Cross Language, Streaming, and more!
The team that builds the Dataflow runner likes to say it “just works”. But how does it really work? How does it decide to autoscale? What happens on failures? Where is state data stored? Can I have infinite state data? What goes into Dataflow’s latency? What’s the checkpoint logic? These and other questions to be answered in this session!
The Stream Processing Platform powers real-time data applications at Intuit using the Apache Beam framework. The platform makes it easy for an engineer to access, transform, and publish streaming data with turn-key solutions to debug and deploy their streaming applications on a managed platform in a manner that is compliant with and leverage Intuit central capabilities. Since launch 3 years ago, the platform has grown to over 150 pipelines in production and handled at peak ~17.
|Omkar Deshpande, Dunja Panic, Nick Hwang & Nagaraja Tantry|
In this session we will cover how I used Beam and BigQuery to find the best combination of words to play Wordle.
|Iñigo San Jose Visiers|
We all know that the benchmarking is a very important but quite tough and ambiguous part of performance testing for every software system. Furthermore, benchmarking the systems that support different language SDKs and distributed data processing runtimes, like Apache Beam, is even more tough and ambiguous. In this talk we will discover what kind of benchmark frameworks Apache Beam already supports (Nexmark, TPC-DS), how they can be used for different purposes (like release testing) by developers and ordinary users, and what are the plans for the future work.
In my previous customer, we did a code migration from Spark/Dataproc et Apache Beam/Dataflow. I proposed an hexagonal architecture + Domain driven design with Apache Beam, in order to isolate to business code (bounded context/domain) and technical code (infrastucture). This architecture is used with code decoupling and dependency injection. I used Dagger2 and i am going to explain why :) The purpose is showing a Beam project with this architecture and explain why it’s interesting.
At BlueVoyant, we process billions of cyber events per second in an effort to secure the supply chains of several high-stakes Fortune 500 companies. Cyber attacks exploit the weakest links in a target company, and modern tech infrastructures are composed of more third-party dependencies than ever before. Our Third-Party Risk (3PR) product requires us to monitor, analyze, and pinpoint weaknesses throughout hundreds of thousands of third party entities, each consisting of tens to millions of assets that may be exploited.
|Alfredo Gimenez, Adam Najman, Tucker Leavitt & Tyler Flach|
Our data pipelines grew organically overtime one script piling up over another one. We got to a point where speed became too slow and too error prone to change. Sounds familiar? Come join me in this session where I would walk thro' our journey to transform data processing from serial python scripts to massively parallel Beam.
How we have used Apache Beam on Google cloud Dataflow runner to build a robust Data Integration solution to on-board data from various source systems [RDBMS , REST APIs] to Google Cloud Bigquery
Are your deployed Beam pipelines properly sized to meet your SLOs? Currently there’s no standard way of benchmarking your Dataflow pipelines to measure expected output rate or utilization, which are required for capacity planning and cost optimization. This talk is to present a methodology and a toolset to benchmark performance (event latency and throughput) of a custom Beam pipeline using Dataflow as runner. Results would be synthesized into prescriptive sizing guidelines to achieve best performance/cost ratio.
This session discusses different strategies for caching data in Dataflow using the Beam SDK
At a certain time, when you’re a company that’s building out a successful SaaS platform there comes a time when classic observability tools don’t cut it anymore. Or they don’t scale as well as you grow or they become too expensive. That’s why the SRE team started to look at observability as a big data problem: In combination with a specification for telemetry (OpenTelemetry) and big data tools like Apache Beam they build the Telemetry Backbone, the streaming hub for all of Collibra’s telemetry data.
|Alex Van Boxel|
This session will give an overview of Cloud Spanner change streams, discuss the challenges of capturing change streams with the Apache Beam framework and Dataflow, and dive into the specific use case of streaming change records from Spanner into BigQuery using Beam.
|Haikuo Liu, Nancy Xu & Le Chang|
The need for a highly efficient data processing workflow is fast becoming a necessity in every organization implementing and deploying Machine Learning models at scale. In most cases, ML teams leverage the managed service solutions already in place by the cloud infrastructure provider they choose. While this approach is good enough for most teams to get going, the long-term cost of keeping the platform running may be prohibitively higher over time.
|Charles Adetiloye & Alexander Lerma|
At Palo Alto Networks we heavily rely on Avro, using it as the primary storage format and use Beam Row as in memory. We de/serialize billions Avro records per second. One day we realized Avro Row conversion routines consume much of CPU time. Then the story begins ….
Kenn Knowles, the PMC chair for Apache Beam, will speak about the last year of developments in Beam, and the exciting things that are coming to Beam.
Talat will talk about how Beam was adopted at Palo Alto, and describe how they deploy thousands of Beam pipelines to keep their customers secure.
Risk Management is a key function across all Financial Services. Often this entails heavy computer simulation exploring how financial markets could evolve. Counterparty Credit Risk requires particularly extreme compute capacity given the billions of daily calculations it performs. Increased regulatory demand (FRTB SA CVA, BCBS 239, etc.) has required further compute power. Traditionally, these simulations are broken up into small tasks and distributed in a compute grid. Map reduction techniques are then used to aggregate and calculate statistical properties.
|Peter Coyle & Raj Subramani|
Like all the SDKs, the Apache Beam Go SDK has a direct runner, for simple testing. However, unlike Python and Java, it has languished, not covering all parts of the model supported by the SDK. Worse, it doesn’t even use the FnAPI! This dive into the runner side of Beam will cover how I wrote my own testing oriented replacement for the Go Direct runner, and how it can become useful for Java, Python, and future SDKs.
This session will present a Beam use case in the telco industry. We will introduce the use case and context as well as the solution implementation including things such as configuration, autoscaling, writing to BigQuery, profiling and improving the code. In each point we will go over the thought process behind the decision and its effects on cost/performance.
|Jérémie Gomez & Thomas Sauvagnat|
In this workshop we’ll give a detailed overview of the Apache Hop platform. We’ll design a few simple Beam pipelines using the Hop GUI after which we’ll run them on a few runners like GCP DataFlow and Spark. After that we’ll cover best practices for version control, unit testing, integration testing and much more. No prior knowledge of Apache Hop or Beam is required to follow this workshop.
Apache Beam’s concept can sometimes be overwhelming. Sometimes is hard to see why something is designed the way it is. One of these concepts is the Combine function. This session takes a non-trivial data structure (the OpenTelemetry Exponential Histogram) as an example and goes through each step in the design on making it a super-efficient Combine function. Learn by example.
|Alex Van Boxel|
|Shangjin Zhang & Yuhong Cheng|
Data Analytics at Twitter rely on petabytes of data across data lakes and analytics databases. Data could come from log events generated by twitter micro services based on user action(in the range of trillions of events per day) or data is generated by processing jobs which processes the log events. The Data Lifecycle Team at twitter manages large scale data ingestion and replication of data across twitter data centers and public cloud.
|Praveen Killamsetti & Zhenzhao Wang|
In this session, we will look into how to get a Beam pipeline successfully deployed via a CICD framework (Google’s cloud build) by utilising Dataflow’s flex temples and discuss lessons learned along the way.
Apache Beam provides an expressive and powerful toolset for transformation over bounded and unbounded (streaming) data. One common but not obvious transformation that Oden has needed to implement is Change-Point Detection. Change-Point Detection is at the heart of many of Oden’s real-time features to its manufacturing clients. For example, many clients need to track when their factory’s production lines stop running in order to root-cause issues and improve capacity. Oden continuously streams sampled real-world properties of the machines used in those lines, such as motor-speed, and uses change-detection to differentiate periods where the line was “stopped,” “ramping up,” or “running.
Last year we kicked off relational work with a vision of automatically optimizing your pipeline. Now we have a panel of contributors who are working towards this goal! We will demo the new optimizer in Java core, showing how we can automatically prune columns from IOs. Then we will discuss our upcoming work to make vectorized execution a reality through native columnar support in Python and Java. Finally we will discuss usage best practices around using Schemas, Dataframes, and SQL to ensure you can benefit from these changes.
|Andrew Pilloud & Brian Hulette|
Trustpilot is an e-commerce reviews platform delivering millions of new reviews to businesses each week. We are using Apache Beam on GCP Dataflow to deliver real-time streaming inferences with the latest NLP transformer models. Our talk will touch on: Infrastructure setup to enable Python Beam to interface with Kafka for streaming data Taking advantage of Beam’s unified programming model to enable batch jobs for backfilling via BigQuery Working with GPUs on Dataflow to speed up local model inference MLOps: Using Dataflow as part of a continuous evaluation model monitoring setup
|Alex Chan & Angus Neilson|
I created a library for error handling with Apache Beam Java and Kotlin. Asgarde allows error handling with less code and more concise/expressive code. The purpose is showing Beam native error handling, and the same with Asgarde Java. I will also show Asgarde Kotlin with even more concise code and a more functional style. https://github.com/tosun-si/asgarde/ The example with Asgarde will store the bad sink in a Bigquery table (DLQ). We used Asgarde in production code at my actual big customer (L’Oréal/ France).
Over the past year, the Beam Go Sdk has rolled out several features to support native streaming DoFns. This talk will dig into those features and discuss how they can be used to build streaming pipelines written entirely in Go. Attendees will come away with an understanding of some of the challenges associated with processing unbounded datasets. They will also learn how they can build their own custom streaming splittable DoFns to address those challenges.
|Danny McCormick & Jack McCluskey|
Overview of how PulsarIO Connector was implemented, how it works, and how to use it.
The Data Intelligence Team at Ricardo, Switzerland’s largest online marketplace, is a long-time Apache Beam user. Starting on-premise, we transitioned from running our own Apache Flink cluster on Kubernetes with over 40 streaming pipelines written in Apache Beam’s Java SDK to Cloud Dataflow. This talk describes the obstacles and discusses our learnings along the way.
Wayfair is the world’s largest online destination for all things home including furniture, household items, fixtures, appliances etc. Unparalleled selections and high quality imagery are keys to provide a rich & unique user experience. Photo studios are expensive to operate and require significant time to produce an image. 3D modeling and imagery is one of the main focus areas of investment and the in-house applications were redesigned and developed using domain-driven design patterns of software engineering.
The batch processing data pipelines at Twitter process hundreds of petabytes of data on a daily basis. These pipelines run on on-premise Hadoop clusters and use Scalding as the framework for data processing.We are currently in the midst of migrating our batch processing pipelines to the cloud. With approximately 5000 production pipelines, it becomes necessary to automate the migration of these pipelines to the cloud. Scalding is an abstract API for expressing computations on data, and allows backend implementations to be plugged in.
Beam’s Python SDK is incredibly powerful due to its high scalability and advanced streaming capabilities, but its unfamiliar API has always been a barrier to adoption. Conversely, the popular Python pandas library has seen explosive growth in recent years due to its ease of use and tight integration with interactive notebook environments, but it can only process data with a single node - it cannot be used to process distributed datasets in parallel.
In this talk, we will be looking at some Apache Beam design patterns and best practices that you can use to successfully build your own Machine Learning workflows. We will be looking at examples of applying Apache Beam to solve real-world problems, such as real-time semantic enrichment and online clustering of text content. We will also discover how to use Beam efficiently as an orchestrator for Machine Learning services and how to design your entire system with scalability in mind.
In this session, we’ll share a few recipes to improve Beam-Dataflow pipelines when dealing with sequence data. These methods came from our experience of processing and preparing large datasets for ML use at Carted. We’ll provide a step-by-step framework of how to analyze the issues that can start surfacing when processing text data at scale and will share our approaches to dealing with them. We hope you’ll apply these recipes to your own Beam-Dataflow pipelines to improve their performance.
|Sayak Paul & Nilabhra Roy Chowdhury|
The talk will be aimed at presenting Apache Beam discovery and learning through Beam Playground, proof of concept and contributing paths to those who are looking to learn, use and develop Apache Beam. The planned structure of the talk will be: Discovery and learning experience. I would like to go through the Apache Beam Playground to show how to use it and how Apache Beam code can be executed without the need to install Beam SDK.
An overview of common customer issues seen when Apache Beam is used on the Cloud Dataflow Runner.
The Ray Beam Runner Project is an initiative to create a new Pythonic Beam Runner based on the Ray distributed compute framework. The project was conceived based on strong community interest in integrating Ray with Beam, and a prototype quickly proved its viability. We will discuss the current state of this initiative, and our long-term vision to provide a unified authoring and execution environment for mixed-purpose batch, streaming, and ML pipelines.
|Patrick Ames, Jiajun Yao & Chandan Prasad|
This workshop encompasses several talks and a workshop around Scio, which is the open source Scala API for Apache Beam.
|Michel Davit, Israel Herraiz, Claire McGinty, Kellen Dye & Annica Ivert|
We will review the concept of Splittable DoFns and we will write two I/O connectors using this kind of DoFns: one in batch (for reading large files in a given format), and one for streaming (for reading from Kafka topics).
|Israel Herraiz & Miren Esnaola|
When there is lot of telemetry coming in from thousands of devices, how to identify , in real time, when particular sensor has failed. Architecture and demonstration using - Apache beam on Dataflow, Yugabyte and Grafana - to identify failed feed from industrial sensors in real time. Demo scenario will have 200 sensors across - 5 factories , each factory with 6 sections and 6 different types of sensors. it will illustrate not only failures but see how the sensor status changes back to active when put back into service.
In this session we will see how we can use Apache Beam to enhance a generic eventually consistent NoSQL database (e.g. Apache Cassandra) by ACID transactions. We will see how we can use gRPC with splittable DoFn to create an RPC streaming source of requests into Apache Beam Pipeline and then how we can process these requests inside the Pipeline ensuring the requests are applied atomically, consistently, independently and durably despite the data being backed by an eventually consistent data store.