Log ingestion and data replication at Twitter

Data Analytics at Twitter rely on petabytes of data across data lakes and analytics databases. Data could come from log events generated by twitter micro services based on user action(in the range of trillions of events per day) or data is generated by processing jobs which processes the log events. The Data Lifecycle Team at twitter manages large scale data ingestion and replication of data across twitter data centers and public cloud. Delivering the data either in streaming or batch fashion to data lakes(HDFS, GCS) and data warehouse(Google BigQuery) in a reliable and scalable way at lowest possible latency is a complex problem. In this talk, we will explain our log ingestion architecture and data replication architecture across storage systems and explain how we use beam based ingestion/replication pipelines for both batch and streaming use cases to achieve our goal.

Log ingestion and data replication at Twitter

Praveen Killamsetti

Zhenzhao Wang

Log ingestion and data replication at Twitter