SlideShare a Scribd company logo
1 of 48
Download to read offline
Enabling Real-time Analytics Applications @ LinkedIn’s Scale
Mayank Shrivastava Jackie Jiang
Senior Software Engineer
Seunghyun Lee
Senior Software EngineerStaff Software Engineer
Apache Pinot
1
2
3
4
Agenda
Introduction
Pinot @ LinkedIn
How to use Pinot
Pinot Performance
How is data generated and used at LinkedIn
Actor Verb
Member
Job
Post
Company
Object Life Cycle
Create
Generate
Analyze
Product
DataInsights
600+ million
members
Tens of
million posts
likes/shared
per day
3+ million
jobs posted
per month
30 million
companies
Trillions of events per day
Real-time Analytics Applications at LinkedIn
How to build an online analytics application?
• Real-time data ingestion
• Millions of active users, 1000s of queries per sec
• Super low latency (10s ms)
• Highly available, always on
Approach 1. Join on the fly
Event Stream
Profile View
Profile View Table
Member Table
Application
Server
Who viewed my profile
• Real-time
(depending on storage)
• High latency due to join
Approach 2. Pre Join + Pre Aggregate
• Near real-time ingestion
• Latency varies with query
selectivity
Event Stream
Profile View
Profile View
Table
Member Table
Application
Server
Who viewed my profile
Stream
Processing
Engine
Pre Join +
Pre Aggr
Approach 3. Pre Join + Pre Aggregate + Pre Cube
• Very fast
• Batch ingestion (hourly / daily)
• Storage explosion
• Re-bootstrap on schema change
Event Stream
Profile View Profile View
Table
Member Table
Application
Server
Who viewed my profile
Batch
Processing
Engine
Pre Join +
Pre Aggr +
Pre Cube
Latency vs. Flexibility
Profile View Table
Member Table Pre-Join Pre-Aggregation Pre-Cube
Spark SQL
Presto
Hive
Big Query
Druid
Elastic Search
Pinot
Kylin
KV Store
Latency
Flexibility
lowhigh
lowhigh
Pinot
Who Viewed My Profile @ LinkedIn
Data Lake
Stream
Processing
WVMP
Dashboard
Ad-hoc Queries
Espresso
Raw Tracking
Data
Pre-joined
Data
Pre Join +
Pre Aggr
What is Apache Pinot?
• OLAP Datastore
• Columnar, indexed storage
• Low latency analytics
• Distributed – highly available, reliable, scalable
• Lambda architecture
○ Offline data pushes + Real-time stream ingestion
• Open Source
1
2
3
4
Agenda
Introduction
Pinot @ LinkedIn
How to use Pinot
Pinot Performance
Pinot @ LinkedIn
70+ 2000+ 100K+ 1M+
Member Facing
Use Cases
Dashboards
for Internal
Business Metrics
Queries
Per Second
Records Ingested
Per Second
Pinot @ LinkedIn: Member Facing Analytics Report
• Providing analytics reports
for Linkedin member-facing
applications
• Very high QPS (Thousands)
• Requires strict latency SLA
(10s ms - sub-sec)
Pinot @ LinkedIn: Interactive Dashboard
• Visualization tool for
multi-dimensional metrics
• Complex, explorative queries
• 2000+ metrics,
used by 1000+ employees
Pinot @ LinkedIn: Anomaly Detection
• Efficiently detect and
investigate anomalies in
metrics
• Third Eye: Part of Apache
Pinot open source
Pinot Usage @ Other Companies
1
2
3
4
Agenda
Introduction
Pinot @ LinkedIn
How to use Pinot
Pinot Performance
How to use Pinot
Batch Data Ingestion
Real-time Data Ingestion
SQL-like Query Interface (PQL)
Let’s build something cool
Event RSVP Data
How to use Pinot: Workflow
Define
Schema
Define Table
Configuration
Create
Table
One Time Setup
Raw Data
Generate
Pinot
Segments
Push Data
Streaming
Data
Setup
Stream Data
Source
Batch
(Scheduled Job)
Real-time
(One Time Setup)
Data Ingestion
HDFS, S3,
ADSL, NFS...
Kafka,
Event Hub...
How to use Pinot: Define Schema
● Schema name: meetupRsvp
● Dimension field specs
○ event_name (string)
○ event_time (long)
○ country (string)
○ city (string)
○ …
● Metrics field specs
○ rsvp_count (int)
● Time field spec
○ timestamp (long)
■ timetype: epoch / datetime
■ granularity: millisecond /
second/hour/day
• Dimension: an attribute of your data (filter,
group by)
• Metric: a number that is used to measure
characteristics of a dimension (aggregation)
• Time: a timestamp of an event (partitioning,
retention management)
SELECT event_name, sum(rsvp_count)
FROM meetupRsvp
WHERE country = “us”
GROUP BY event_name
TOP 10
Example Query - Top 10 events in US
How to use Pinot: Configure and Create Table
Pinot Schema
Table Config
● Table name: meetupRsvp
● Table type: batch / realtime
/ hybrid
● Replication factor: 2
● Index Columns: ...
● Bloom filters: ...
● Retention: 30 days
● ...
Pinot
Admin Client
How to use Pinot: Batch Ingestion
Raw DataRaw Data
Raw Data
Segment
Generation
Job
(library)
Json, CSV, Avro,
Parquet, ORC...
Pinot
Schema
Table
Config
Pinot
Segment
Pinot
Segment
Pinot
Segment
HDFS, S3, ADLS, NFS...
HDFS, S3, ADLS, NFS...
How to use Pinot: Batch Ingestion
Raw Data
Segment
Generation
Job
(library)
Json, Avro,
Parquet, ORC...
Pinot
Schema
Table
Config
Pinot
Segment
Pinot
Segment
Pinot
Segment
Segment
Push Job
(library)
HDFS, S3, ADLS, NFS... HDFS, S3, ADLS, NFS...
How to use Pinot: Segment Assignment
Segment
Push Job
Controller
Helix
Zookeeper
Server-0 Server-1 Server-2
Pinot
• Assignment strategies
○ Uniform
○ Replica Group
○ Partition Aware
Segment Store
S0 S2S1
HDFS, S3, ADLS, NFS...
● S0: Sever-0, Server-1
● S1: Server-1, Server-2
● S2: Server-0, Server-2
S0 S2 S1 S0 S2 S1
1. Table name
2. Segment name
3. Segment URI path
How to use Pinot: Query Routing
Segment
Push Job
Controller
Helix
• Routing Strategies
○ Uniform
○ Replica Group
○ Partition Aware
Broker
Queries
Segment Store
S0 S2S1
HDFS, S3, ADLS, NFS...
Server-0 Server-1 Server-2
Pinot
S0 S2 S1 S0 S2 S1
How to use Pinot: Batch + Realtime
Segment
Push Job
Controller
Helix
Real-time
Servers
Offline
Servers
Broker
Queries
Pinot
Streaming
Data
Kafka,
Event Hub,
Kinesis...
Table Config
● Table name: meetupRsvp
● Table type: real-time
● Replication factor: 2
● Kafka broker: ...
● Kafka topic name: ...
● Retention: 5 days
● ...
• A single schema for both
offline + real-time tables
How to use Pinot: Batch + Realtime
Segment
Push Job
Controller
Helix
Real-time
Servers
Offline
Servers
Broker
Queries
Pinot
Streaming
Data
Kafka,
Event Hub,
Kinesis...
• Real-time servers keep
consumed data in
memory, periodically
flush data to segment
store.
• Broker handles offline
and real-time federation.
Quick Demo
Event RSVP Data
1
2
3
4
Agenda
Introduction
Pinot @ LinkedIn
How to use Pinot
Pinot Performance
Interactive Dashboard select sum(pageView) from T
where country = us
and browser = chrome
...
group by time
• Human-driven queries
• Slice and dice over arbitrary dimensions
5000 Queries Pinot Druid
Total Time 11 minutes 24 minutes
P50 84ms 136ms
P90 206ms 667ms
Site Facing Analytics
select sum(articleViewCount) from T
where articleId = x
...
and time >= y time < z
group by viewer[title|geo|industry]
• Pre-defined queries with different
filtering values
• Usually have a filter on the primary key
(e.g. articleId)
• High QPS (thousands), low latency
(< 100ms for 99%) requirements
Anomaly Detection
for d1 in [us, ca, ...]
for d2 in [chrome, firefox, ...]
...
select sum(pageViews) from T
where country = d1 and browser = d2…
group by time
Filter Aggregation
select …
where country = us …
Slow, scan 60-70% data
select …
where country = ireland …
Scan less than 1%
• Identifying issues requires monitoring
all possible combinations
• Data distribution can be skewed
Secret behind Pinot
Aggregation
Filter
Storage
Scan Star-Tree Pre-aggregation
Scan Inverted Index
Columnar Store Encoding/Compression
Sorted Index Star-Tree Index
❏ Common Techniques
❏ Pinot & Druid
❏ Pinot Only
select sum(pageView) from T
where country = us
and browser = chrome
Columnar Store
• Read relevant columns only
country browser ...
us chrome ...
ca firefox ...
jp ie ...
us firefox ...
ca ie ...
… … ...
Raw Data
Row Based
Column Based
Aggregation
Filter
Storage
select sum(pageView) from T
where country = us
and browser = chrome
Columnar us chrome ...
ca firefox ...
jp ie ...
country
us
ca
jp
us
ca
…
browser
chrome
firefox
ie
firefox
ie
…
...
...
...
...
...
...
...
Encoding & Compression Dictionary
Forward Index
country
ca
jp
us
…
browser
chrome
firefox
ie
…
country
2
0
1
2
0
...
browser
0
1
2
1
2
...
• Storage compression
○ Dictionary encoding
○ Bit compression
Aggregation
Filter
Storage Encoding/Compression
select sum(pageView) from T
where country = us
and browser = chrome
Column Based
country
us
ca
jp
us
ca
…
browser
chrome
firefox
ie
firefox
ie
…
docId
0
1
2
3
4
…
docId
0
1
2
3
4
...
dictId
0
1
2
…
Inverted Index
docId country browser
0 us chrome
1 ca firefox
2 jp ie
3 us firefox
4 ca ie
… … …
Raw Data country docIds
ca 1, 4...
jp 2...
us 0, 3...
... ...
Inverted Index
browser docIds
chrome 0 ...
firefox 1, 3...
ie 2, 4...
... ...• Storing bitmap for each value
• Fast filtering:
○ Constant time value lookup
○ Bit operations for AND/OR clause
Aggregation
Filter
Storage
Inverted
Index
select sum(pageView) from T
where country = us
and browser = chrome
Sorted Index
• Better data compression:
○ Run length encoding
○ Can be accessed as
forward/inverted index
• Spatial locality
country start docId end docId
ca 0 80
jp 81 100
us 101 300
… … …
docId country
0 ca
... …
100 jp
101 us
… …
300 us
… …
sorted index
inverted index
Aggregation
Filter
Storage
Sorted Index
select sum(pageView) from T
where country = us
and browser = chrome
Latency vs. Space Trade-off
latency
space requirement
scan
pre-cubeStar-Tree
select sum(pageView) from T
where country = us
and browser = chrome
Aggregation
Filter
Storage
Star-Tree Pre-aggregation
Star-Tree Index
Star-Tree Index
latency
space requirement
T=infinity
T=1,000,000
T=10,000
T=100
T=1
• Configurable trade-off between latency and space by partial
pre-aggregation technique
• Be able to achieve a hard upper bound for query latencies
Star-Tree Index
Flexible Query Execution Plan
Query Optimization
select max(col) from T Use metadata instead of scanning
select sum(metric) from T
where country = us and accountId = x
Reorder filter based on the available indexes
(apply accountId before country predicate)
Segment level physical query planner can intelligently choose the best way
to solve the query based on the segment metadata and available indexes.
Global Optimizations
Problem Solution
Querying all segments
Segment pruning to minimize the number of
segments to query
Querying all servers
Smart segment assignment to reduce the fan-out
to servers
Conclusion
User Activity
Data
Member
Facing
Applications
Interactive
Dashboard
Anomaly
Detection
Contributing to Pinot
• We are looking for contributions!
• Apache Pinot (incubating) 0.1.0 is available at
https://pinot.apache.org
• Pinot Twitter Account
https://twitter.com/ApachePinot
• Pinot Meetup Page
https://www.meetup.com/apache-pinot
• Pinot Slack Channel
https://tinyurl.com/pinotSlackChannel
Folks behind Pinot
Mayank Shrivastava
Subbu Subramaniam
Jean-Francois Im
Jackie Jiang
Seunghyun Lee
Jennifer Dai
Neha Pawar
Jialiang Li
Sunitha Beeram
Shraddha Sahay
Kishore Gopalakrishna
Xiang Fu
James Shao
Prasanna Ravi
John Gutmann
Dino Occhialini
Walter Huf
Xiaohui Sun
Long Huynh
Akshay Rai
Alexander Pucher
Jihao Zhang
Felix Cheung
Olivier Lamy
Jim Jagielski
Marcel Siegrist
Roman Shaposhnik
Anurag Shendge
Thank you

More Related Content

What's hot

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Seunghyun Lee
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Spark Summit
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondBowen Li
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 

What's hot (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyond
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 

Similar to Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale

Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd SeasonSATOSHI TAGOMORI
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWSSungmin Kim
 
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Amazon Web Services
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQLWSO2
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Writing a Fullstack Application with Javascript - Remote media player
Writing a Fullstack Application with Javascript - Remote media playerWriting a Fullstack Application with Javascript - Remote media player
Writing a Fullstack Application with Javascript - Remote media playerTikal Knowledge
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
Análisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackAnálisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackElasticsearch
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitterTwitter Developers
 
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon KinesisSRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon KinesisAmazon Web Services
 
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and LogsCloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and LogsAltoros
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElasticsearch
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...Amazon Web Services
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Growing into a proactive Data Platform
Growing into a proactive Data PlatformGrowing into a proactive Data Platform
Growing into a proactive Data PlatformLivePerson
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubySATOSHI TAGOMORI
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Jason Flittner
 

Similar to Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale (20)

Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd Season
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Writing a Fullstack Application with Javascript - Remote media player
Writing a Fullstack Application with Javascript - Remote media playerWriting a Fullstack Application with Javascript - Remote media player
Writing a Fullstack Application with Javascript - Remote media player
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Análisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackAnálisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic Stack
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
 
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon KinesisSRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
 
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and LogsCloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Growing into a proactive Data Platform
Growing into a proactive Data PlatformGrowing into a proactive Data Platform
Growing into a proactive Data Platform
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 

Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale

  • 1. Enabling Real-time Analytics Applications @ LinkedIn’s Scale Mayank Shrivastava Jackie Jiang Senior Software Engineer Seunghyun Lee Senior Software EngineerStaff Software Engineer Apache Pinot
  • 2. 1 2 3 4 Agenda Introduction Pinot @ LinkedIn How to use Pinot Pinot Performance
  • 3. How is data generated and used at LinkedIn Actor Verb Member Job Post Company Object Life Cycle Create Generate Analyze Product DataInsights 600+ million members Tens of million posts likes/shared per day 3+ million jobs posted per month 30 million companies Trillions of events per day
  • 5. How to build an online analytics application? • Real-time data ingestion • Millions of active users, 1000s of queries per sec • Super low latency (10s ms) • Highly available, always on
  • 6. Approach 1. Join on the fly Event Stream Profile View Profile View Table Member Table Application Server Who viewed my profile • Real-time (depending on storage) • High latency due to join
  • 7. Approach 2. Pre Join + Pre Aggregate • Near real-time ingestion • Latency varies with query selectivity Event Stream Profile View Profile View Table Member Table Application Server Who viewed my profile Stream Processing Engine Pre Join + Pre Aggr
  • 8. Approach 3. Pre Join + Pre Aggregate + Pre Cube • Very fast • Batch ingestion (hourly / daily) • Storage explosion • Re-bootstrap on schema change Event Stream Profile View Profile View Table Member Table Application Server Who viewed my profile Batch Processing Engine Pre Join + Pre Aggr + Pre Cube
  • 9. Latency vs. Flexibility Profile View Table Member Table Pre-Join Pre-Aggregation Pre-Cube Spark SQL Presto Hive Big Query Druid Elastic Search Pinot Kylin KV Store Latency Flexibility lowhigh lowhigh Pinot
  • 10. Who Viewed My Profile @ LinkedIn Data Lake Stream Processing WVMP Dashboard Ad-hoc Queries Espresso Raw Tracking Data Pre-joined Data Pre Join + Pre Aggr
  • 11. What is Apache Pinot? • OLAP Datastore • Columnar, indexed storage • Low latency analytics • Distributed – highly available, reliable, scalable • Lambda architecture ○ Offline data pushes + Real-time stream ingestion • Open Source
  • 12. 1 2 3 4 Agenda Introduction Pinot @ LinkedIn How to use Pinot Pinot Performance
  • 13. Pinot @ LinkedIn 70+ 2000+ 100K+ 1M+ Member Facing Use Cases Dashboards for Internal Business Metrics Queries Per Second Records Ingested Per Second
  • 14. Pinot @ LinkedIn: Member Facing Analytics Report • Providing analytics reports for Linkedin member-facing applications • Very high QPS (Thousands) • Requires strict latency SLA (10s ms - sub-sec)
  • 15. Pinot @ LinkedIn: Interactive Dashboard • Visualization tool for multi-dimensional metrics • Complex, explorative queries • 2000+ metrics, used by 1000+ employees
  • 16. Pinot @ LinkedIn: Anomaly Detection • Efficiently detect and investigate anomalies in metrics • Third Eye: Part of Apache Pinot open source
  • 17. Pinot Usage @ Other Companies
  • 18. 1 2 3 4 Agenda Introduction Pinot @ LinkedIn How to use Pinot Pinot Performance
  • 19. How to use Pinot Batch Data Ingestion Real-time Data Ingestion SQL-like Query Interface (PQL)
  • 20. Let’s build something cool Event RSVP Data
  • 21. How to use Pinot: Workflow Define Schema Define Table Configuration Create Table One Time Setup Raw Data Generate Pinot Segments Push Data Streaming Data Setup Stream Data Source Batch (Scheduled Job) Real-time (One Time Setup) Data Ingestion HDFS, S3, ADSL, NFS... Kafka, Event Hub...
  • 22. How to use Pinot: Define Schema ● Schema name: meetupRsvp ● Dimension field specs ○ event_name (string) ○ event_time (long) ○ country (string) ○ city (string) ○ … ● Metrics field specs ○ rsvp_count (int) ● Time field spec ○ timestamp (long) ■ timetype: epoch / datetime ■ granularity: millisecond / second/hour/day • Dimension: an attribute of your data (filter, group by) • Metric: a number that is used to measure characteristics of a dimension (aggregation) • Time: a timestamp of an event (partitioning, retention management) SELECT event_name, sum(rsvp_count) FROM meetupRsvp WHERE country = “us” GROUP BY event_name TOP 10 Example Query - Top 10 events in US
  • 23. How to use Pinot: Configure and Create Table Pinot Schema Table Config ● Table name: meetupRsvp ● Table type: batch / realtime / hybrid ● Replication factor: 2 ● Index Columns: ... ● Bloom filters: ... ● Retention: 30 days ● ... Pinot Admin Client
  • 24. How to use Pinot: Batch Ingestion Raw DataRaw Data Raw Data Segment Generation Job (library) Json, CSV, Avro, Parquet, ORC... Pinot Schema Table Config Pinot Segment Pinot Segment Pinot Segment HDFS, S3, ADLS, NFS... HDFS, S3, ADLS, NFS...
  • 25. How to use Pinot: Batch Ingestion Raw Data Segment Generation Job (library) Json, Avro, Parquet, ORC... Pinot Schema Table Config Pinot Segment Pinot Segment Pinot Segment Segment Push Job (library) HDFS, S3, ADLS, NFS... HDFS, S3, ADLS, NFS...
  • 26. How to use Pinot: Segment Assignment Segment Push Job Controller Helix Zookeeper Server-0 Server-1 Server-2 Pinot • Assignment strategies ○ Uniform ○ Replica Group ○ Partition Aware Segment Store S0 S2S1 HDFS, S3, ADLS, NFS... ● S0: Sever-0, Server-1 ● S1: Server-1, Server-2 ● S2: Server-0, Server-2 S0 S2 S1 S0 S2 S1 1. Table name 2. Segment name 3. Segment URI path
  • 27. How to use Pinot: Query Routing Segment Push Job Controller Helix • Routing Strategies ○ Uniform ○ Replica Group ○ Partition Aware Broker Queries Segment Store S0 S2S1 HDFS, S3, ADLS, NFS... Server-0 Server-1 Server-2 Pinot S0 S2 S1 S0 S2 S1
  • 28. How to use Pinot: Batch + Realtime Segment Push Job Controller Helix Real-time Servers Offline Servers Broker Queries Pinot Streaming Data Kafka, Event Hub, Kinesis... Table Config ● Table name: meetupRsvp ● Table type: real-time ● Replication factor: 2 ● Kafka broker: ... ● Kafka topic name: ... ● Retention: 5 days ● ... • A single schema for both offline + real-time tables
  • 29. How to use Pinot: Batch + Realtime Segment Push Job Controller Helix Real-time Servers Offline Servers Broker Queries Pinot Streaming Data Kafka, Event Hub, Kinesis... • Real-time servers keep consumed data in memory, periodically flush data to segment store. • Broker handles offline and real-time federation.
  • 31. 1 2 3 4 Agenda Introduction Pinot @ LinkedIn How to use Pinot Pinot Performance
  • 32. Interactive Dashboard select sum(pageView) from T where country = us and browser = chrome ... group by time • Human-driven queries • Slice and dice over arbitrary dimensions 5000 Queries Pinot Druid Total Time 11 minutes 24 minutes P50 84ms 136ms P90 206ms 667ms
  • 33. Site Facing Analytics select sum(articleViewCount) from T where articleId = x ... and time >= y time < z group by viewer[title|geo|industry] • Pre-defined queries with different filtering values • Usually have a filter on the primary key (e.g. articleId) • High QPS (thousands), low latency (< 100ms for 99%) requirements
  • 34. Anomaly Detection for d1 in [us, ca, ...] for d2 in [chrome, firefox, ...] ... select sum(pageViews) from T where country = d1 and browser = d2… group by time Filter Aggregation select … where country = us … Slow, scan 60-70% data select … where country = ireland … Scan less than 1% • Identifying issues requires monitoring all possible combinations • Data distribution can be skewed
  • 35. Secret behind Pinot Aggregation Filter Storage Scan Star-Tree Pre-aggregation Scan Inverted Index Columnar Store Encoding/Compression Sorted Index Star-Tree Index ❏ Common Techniques ❏ Pinot & Druid ❏ Pinot Only select sum(pageView) from T where country = us and browser = chrome
  • 36. Columnar Store • Read relevant columns only country browser ... us chrome ... ca firefox ... jp ie ... us firefox ... ca ie ... … … ... Raw Data Row Based Column Based Aggregation Filter Storage select sum(pageView) from T where country = us and browser = chrome Columnar us chrome ... ca firefox ... jp ie ... country us ca jp us ca … browser chrome firefox ie firefox ie … ... ... ... ... ... ... ...
  • 37. Encoding & Compression Dictionary Forward Index country ca jp us … browser chrome firefox ie … country 2 0 1 2 0 ... browser 0 1 2 1 2 ... • Storage compression ○ Dictionary encoding ○ Bit compression Aggregation Filter Storage Encoding/Compression select sum(pageView) from T where country = us and browser = chrome Column Based country us ca jp us ca … browser chrome firefox ie firefox ie … docId 0 1 2 3 4 … docId 0 1 2 3 4 ... dictId 0 1 2 …
  • 38. Inverted Index docId country browser 0 us chrome 1 ca firefox 2 jp ie 3 us firefox 4 ca ie … … … Raw Data country docIds ca 1, 4... jp 2... us 0, 3... ... ... Inverted Index browser docIds chrome 0 ... firefox 1, 3... ie 2, 4... ... ...• Storing bitmap for each value • Fast filtering: ○ Constant time value lookup ○ Bit operations for AND/OR clause Aggregation Filter Storage Inverted Index select sum(pageView) from T where country = us and browser = chrome
  • 39. Sorted Index • Better data compression: ○ Run length encoding ○ Can be accessed as forward/inverted index • Spatial locality country start docId end docId ca 0 80 jp 81 100 us 101 300 … … … docId country 0 ca ... … 100 jp 101 us … … 300 us … … sorted index inverted index Aggregation Filter Storage Sorted Index select sum(pageView) from T where country = us and browser = chrome
  • 40. Latency vs. Space Trade-off latency space requirement scan pre-cubeStar-Tree select sum(pageView) from T where country = us and browser = chrome Aggregation Filter Storage Star-Tree Pre-aggregation Star-Tree Index
  • 41. Star-Tree Index latency space requirement T=infinity T=1,000,000 T=10,000 T=100 T=1 • Configurable trade-off between latency and space by partial pre-aggregation technique • Be able to achieve a hard upper bound for query latencies
  • 43. Flexible Query Execution Plan Query Optimization select max(col) from T Use metadata instead of scanning select sum(metric) from T where country = us and accountId = x Reorder filter based on the available indexes (apply accountId before country predicate) Segment level physical query planner can intelligently choose the best way to solve the query based on the segment metadata and available indexes.
  • 44. Global Optimizations Problem Solution Querying all segments Segment pruning to minimize the number of segments to query Querying all servers Smart segment assignment to reduce the fan-out to servers
  • 46. Contributing to Pinot • We are looking for contributions! • Apache Pinot (incubating) 0.1.0 is available at https://pinot.apache.org • Pinot Twitter Account https://twitter.com/ApachePinot • Pinot Meetup Page https://www.meetup.com/apache-pinot • Pinot Slack Channel https://tinyurl.com/pinotSlackChannel
  • 47. Folks behind Pinot Mayank Shrivastava Subbu Subramaniam Jean-Francois Im Jackie Jiang Seunghyun Lee Jennifer Dai Neha Pawar Jialiang Li Sunitha Beeram Shraddha Sahay Kishore Gopalakrishna Xiang Fu James Shao Prasanna Ravi John Gutmann Dino Occhialini Walter Huf Xiaohui Sun Long Huynh Akshay Rai Alexander Pucher Jihao Zhang Felix Cheung Olivier Lamy Jim Jagielski Marcel Siegrist Roman Shaposhnik Anurag Shendge