Building Real-time Data Products at LinkedIn with Apache Samza

•

33 likes•2,034 views

The world is going real-time. MapReduce, SQL-on-Hadoop and similar batch processing tools are fine for analyzing and processing data after the fact — but sometimes you need to process data continuously as it comes in, and react to it within a few seconds or less. How do you do that at Hadoop scale? Apache Samza is an open source stream processing framework designed to solve these kinds of problems. It is built upon YARN/Hadoop 2.0 and Apache Kafka. You can think of Samza as a real-time, continuously running version of MapReduce. Samza has some unique features that make it powerful. It provides high performance for stateful processing jobs, including aggregation and joins between many input streams. It is designed to support an ecosystem of many different jobs written by different teams, and it isolates them from each other, so that one badly behaved job can’t affect the others.

Data & Analytics

{"
"eventType: " "PageViewEvent,"
"3mestamp: " "1413215518,"
"viewerId:" " "1234,"
"sessionId: " "fa1afe101234deadbeef,"
"pageKey:" " "proﬁleFview,"
"viewedProﬁleId:"4321,"
"trackingKey:" "invita3onFemail,"
"..."etc."metadata"about"what"content"was"displayed..."
}"

{"
"eventType: " "ProﬁleEditEvent,"
"3mestamp: " "1413215518,"
"proﬁleId:" " "1234,"
"old:"{"
" "loca3on:" ""San"Francisco,"CA","
" "industry:" ""Internet"},"
"new:"{"
" "loca3on:" ""New"York,"NY","
" "industry:" ""Financial"Services"}"
}"

References (fun stuff to read)!
1.  Martin Kleppmann:“Designing data-intensive applications.” O’Reilly Media, to appear in 2015. http://dataintensive.net!
2.  Jay Kreps:“Why local state is a fundamental primitive in stream processing.” 31 July 2014. http://radar.oreilly.com/2014/07/why-local-
state-is-a-fundamental-primitive-in-stream-processing.html!
3.  Jay Kreps:“I ︎ Logs.” O'Reilly Media, September 2014. http://shop.oreilly.com/product/0636920034339.do!
4.  Nathan Marz and James Warren:“Big Data: Principles and best practices of scalable realtime data systems.” Manning MEAP, to appear
January 2015. http://manning.com/marz/!
5.  Jakob Homan:“Real time insights into LinkedIn's performance using Apache Samza.” 18 Aug 2014. http://engineering.linkedin.com/samza/
real-time-insights-linkedins-performance-using-apache-samza!
6.  Martin Kleppmann:“Moving faster with data streams:The rise of Samza at LinkedIn.” 14 July 2014. http://engineering.linkedin.com/stream-
processing/moving-faster-data-streams-rise-samza-linkedin!
7.  Praveen Neppalli Naga:“Real-time Analytics at Massive Scale with Pinot.” 29 Sept 2014. http://engineering.linkedin.com/analytics/real-
time-analytics-massive-scale-pinot!
8.  David He:“Monitor and Improve Web Performance Using RUM DataVisualization.” 19 Sept 2014. http://engineering.linkedin.com/
performance/monitor-and-improve-web-performance-using-rum-data-visualization!
9.  Lili Wu, Sam Shah, Sean Choi, Mitul Tiwari, and Christian Posse:“The Browsemaps: Collaborative Filtering at LinkedIn,” at 6th Workshop
on Recommender Systems and the Social Web, Oct 2014. http://ls13-www.cs.uni-dortmund.de/homepage/rsweb2014/papers/
rsweb2014_submission_3.pdf!
10.  Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.:“All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC),
October 2012. http://www.socc2012.org/s18-das.pdf!
11.  Apache Samza documentation. http://samza.incubator.apache.org!

Viewers also liked

A Random Walk Through Search ResearchNick Watkins

LinkedIn Data ProductsVitaly Gordon

Developing Data ProductsPeter Skomoroch

Streaming SQLJulian Hyde

Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner

Building Automated REST APIs with PythonJeff Knupp

Introduction to Streaming AnalyticsGuido Schmutz

Experience economyTrieu Nguyen

ML and Data Science at Uber - GITPro talk 2017Sudhir Tonse

Recomendation system: Community Detection Based Recomendation System using Hy...Rajul Kukreja

Customer Event Hub - the modern Customer 360° viewGuido Schmutz

Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Impetus Technologies

Graph database super starandres_taylor

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson

General Tips for participating Kaggle CompetitionsMark Peng

Optimizing Apache Spark SQL JoinsDatabricks

Stream Computing & Analytics at UberSudhir Tonse

Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share

Viewers also liked (18)

A Random Walk Through Search Research

LinkedIn Data Products

Developing Data Products

Streaming SQL

Cassandra's Sweet Spot - an introduction to Apache Cassandra

Building Automated REST APIs with Python

Introduction to Streaming Analytics

Experience economy

ML and Data Science at Uber - GITPro talk 2017

Recomendation system: Community Detection Based Recomendation System using Hy...

Customer Event Hub - the modern Customer 360° view

Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...

Graph database super star

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

General Tips for participating Kaggle Competitions

Optimizing Apache Spark SQL Joins

Stream Computing & Analytics at Uber

Cassandra Data Modeling - Practical Considerations @ Netflix

Similar to Building Real-time Data Products at LinkedIn with Apache Samza

Introduction to CQRS and DDDDVladik Khononov

MixpanelNexThoughts Technologies

Aligning Web Services with the Semantic Web to Create a Global Read-Write Gra...Markus Lanthaler

MongoDB Performance TuningPuneet Behl

Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)Dan Robinson

A model based approach for developing event-driven architectures with AsyncAPIabgolla

Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...confluent

SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"South Tyrol Free Software Conference

[OpenInfra Days Korea 2018] (Track 4) CloudEvents 소개 - 상호 운용 가능성을 극대화한 이벤트 데이...OpenStack Korea Community

Microservices in Java and Scala (sfscala)Chris Richardson

Implementing and Visualizing Clickstream data with MongoDBMongoDB

Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Lucidworks

Similar to Building Real-time Data Products at LinkedIn with Apache Samza (12)

Introduction to CQRS and DDDD

Mixpanel

Aligning Web Services with the Semantic Web to Create a Global Read-Write Gra...

MongoDB Performance Tuning

Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

A model based approach for developing event-driven architectures with AsyncAPI

Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...

SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"

[OpenInfra Days Korea 2018] (Track 4) CloudEvents 소개 - 상호 운용 가능성을 극대화한 이벤트 데이...

Microservices in Java and Scala (sfscala)

Implementing and Visualizing Clickstream data with MongoDB

Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...

Recently uploaded

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

Digital Marketing Plan, how digital marketing worksdeepakthakur548787

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics

Learn How Data Science Changes Our WorldEduminds Learning

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell

Cyber awareness ppt on the recorded dataTecnoIncentive

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

Principles and Practices of Data VisualizationKianJazayeri1

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Recently uploaded (20)

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf

Digital Marketing Plan, how digital marketing works

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

Student profile product demonstration on grades, ability, well-being and mind...

Real-Time AI Streaming - AI Max Princeton

Data Factory in Microsoft Fabric (MsBIP #82)

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...

Learn How Data Science Changes Our World

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx

Cyber awareness ppt on the recorded data

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

Decoding Patterns: Customer Churn Prediction Data Analysis Project

Unveiling the Role of Social Media Suspect Investigators in Preventing Online...

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

Semantic Shed - Squashing and Squeezing.pptx

Principles and Practices of Data Visualization

Insurance Churn Prediction Data Analysis Project

Building Real-time Data Products at LinkedIn with Apache Samza

9. {" "eventType: " "PageViewEvent," "3mestamp: " "1413215518," "viewerId:" " "1234," "sessionId: " "fa1afe101234deadbeef," "pageKey:" " "proﬁleFview," "viewedProﬁleId:"4321," "trackingKey:" "invita3onFemail," "..."etc."metadata"about"what"content"was"displayed..." }"

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22. {" "eventType: " "ProﬁleEditEvent," "3mestamp: " "1413215518," "proﬁleId:" " "1234," "old:"{" " "loca3on:" ""San"Francisco,"CA"," " "industry:" ""Internet"}," "new:"{" " "loca3on:" ""New"York,"NY"," " "industry:" ""Financial"Services"}" }"

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48. References (fun stuff to read)! 1.  Martin Kleppmann:“Designing data-intensive applications.” O’Reilly Media, to appear in 2015. http://dataintensive.net! 2.  Jay Kreps:“Why local state is a fundamental primitive in stream processing.” 31 July 2014. http://radar.oreilly.com/2014/07/why-local- state-is-a-fundamental-primitive-in-stream-processing.html! 3.  Jay Kreps:“I ︎ Logs.” O'Reilly Media, September 2014. http://shop.oreilly.com/product/0636920034339.do! 4.  Nathan Marz and James Warren:“Big Data: Principles and best practices of scalable realtime data systems.” Manning MEAP, to appear January 2015. http://manning.com/marz/! 5.  Jakob Homan:“Real time insights into LinkedIn's performance using Apache Samza.” 18 Aug 2014. http://engineering.linkedin.com/samza/ real-time-insights-linkedins-performance-using-apache-samza! 6.  Martin Kleppmann:“Moving faster with data streams:The rise of Samza at LinkedIn.” 14 July 2014. http://engineering.linkedin.com/stream- processing/moving-faster-data-streams-rise-samza-linkedin! 7.  Praveen Neppalli Naga:“Real-time Analytics at Massive Scale with Pinot.” 29 Sept 2014. http://engineering.linkedin.com/analytics/real- time-analytics-massive-scale-pinot! 8.  David He:“Monitor and Improve Web Performance Using RUM DataVisualization.” 19 Sept 2014. http://engineering.linkedin.com/ performance/monitor-and-improve-web-performance-using-rum-data-visualization! 9.  Lili Wu, Sam Shah, Sean Choi, Mitul Tiwari, and Christian Posse:“The Browsemaps: Collaborative Filtering at LinkedIn,” at 6th Workshop on Recommender Systems and the Social Web, Oct 2014. http://ls13-www.cs.uni-dortmund.de/homepage/rsweb2014/papers/ rsweb2014_submission_3.pdf! 10.  Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.:“All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC), October 2012. http://www.socc2012.org/s18-das.pdf! 11.  Apache Samza documentation. http://samza.incubator.apache.org!

Building Real-time Data Products at LinkedIn with Apache Samza

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Building Real-time Data Products at LinkedIn with Apache Samza

Similar to Building Real-time Data Products at LinkedIn with Apache Samza (12)

More from Trieu Nguyen

More from Trieu Nguyen (20)

Recently uploaded

Recently uploaded (20)

Building Real-time Data Products at LinkedIn with Apache Samza