Archmage, Pinterest’s Real-time Analytics Platform on Druid

Archmage, Pinterest’s
Real-time Analytics
Platform on Druid
October 2020
Jian Wang, Tech Lead, Pinterest
Jiaqi Gu, Software Engineer, Pinterest
1

3© 2020 Pinterest. All rights reserved.
Agenda
Motivation
Challenges
Use cases
Cluster stats
Architecture
Learnings
1
2
3
4
5
6

Motivation
● Cons of Hbase based precomputed key value look up system
○ Key value data model doesn’t ﬁt into analytics query pattern
○ Cardinality explosion anytime a new column is added
○ Impossible to precompute all ﬁlter combinations
○ More work is needed on the application side to do aggregation
We want a better system as demand for Pinterest’s analytics
use cases increase...
Why do we replace Hbase with Druid for analytics use
cases?
Example key value model:
country=usa,device=iphone,gender=male,click=123
country=china,device=iphone,gender=female,click=456
country=japan,device=android,gender=male,click=789
country=usa,device=iphone,gender=female,click=135

Challenges
What are the unique challenges of onboarding to use
Druid in Pinterest?
● Clients expects low latency on par to key value store
○ Migrated from a Hbase based key value lookup backend, clients expects
latency to stay at lower 100ms while vanilla Druid only guarantees
subseconds/seconds latency
● Pinterest scale data volume
○ Largest batch use case: 300 TB with seconds SLA
○ Largest real time use case: 500k write QPS with SLA requirement of 500
query QPS and 200 ms p99
● Cost effective
○ We want the lowest cost for the best performance possible

Use cases
Many of company’s analytics use cases are powered by
Druid
● Partner and advertiser reporting
○ Provides stats on board/pins impressions, clicks, saves, etc.

Use cases
Druid
● Realtime spam detection
○ Detects spamming events for user login and pin operations

Use cases
Druid
● Partner and advertiser reporting
○ Stats on board/pins impressions, clicks, saves, etc.
● Realtime spam detection
○ Detects spamming events for user login and pin operations
● Experiment metrics
○ AB testing experiment metrics
● Ads delivery debugger
○ Debugging tool for Ads delivery status
● And many more ...

Cluster stats
We have both online and oﬄine use clusters
● Biggest online use cluster
○ 200 r4.8x historical nodes hosting 32TB, 50 i3.2x hosting 100TB
○ QPS 250
○ Query P99 ranges from 100ms to ~1.5s depending on use cases
● Biggest offline use cluster
○ 160 i3en.2x historical nodes hosting 280TB
○ QPS < 1
○ P99 2s

Architecture
Batch ingestion
Real-time ingestion
Archmage

Architecture
Archmage
● Proxy service
○ A thrift service that acts as a proxy between clients and druid to ease
integration with other services in Pinterest
○ Handles druid service discovery by watching broker znode on Druid
zookeeper
○ Thrift to HTTP/HTTP to thrift request/response translation
○ Metrics reporting
○ Speculative execution
○ Query optimization and rewriting
○ Shadow cluster dark traﬃc routing

Architecture
Query
● Thrift API
○ Clients send a thrift request with a SQL ﬁeld to Archmage who does the
forwarding to Druid
● UI
○ Individual clients’ use case speciﬁc UI
○ Internal UI with SQL editor tool for ad-hoc queries
○ Apache Superset for dashboarding

Architecture
Ingestion
● Batch ingestion
○ Hadoop: extracted library which bypassed Druid locking
○ Reads input from s3 and writes Druid segment ﬁles on s3
● Real time ingestion
○ Kafka: exactly-once-delivery
○ Evaluated push-based Tranquility library but deprecated

Learnings
Tiered setup
● Need disk access? Look for host types with good 4KB page size
random read IOPS
○ Disk is needed when segments are not accessed often or simply the data volume
is so large thus too expensive to have a full in memory setup
○ Druid uses mmap and abstracts a segment into a byte array. Only speciﬁc portion
of the byte array is loaded from disk (e.g., for a certain column) during query time
and the loading is done in 4KB pages which means a host type (excluding process
memory) with 256G RAM behaves pretty much the same as one with 1G RAM if 1)
the 4KB page size random read IOPS are the same 2) you expect scan diﬀerent
segments for each query
○ For AWS, host types with on-instance SSD work the best: i3 > i3en >> other
instance types attaching an EBS disk

Learnings
Tiered setup
● Recent data? All in memory
○ Recent data is expected to be queried more often so we want to avoid
query time disk I/O by caching all data in page cache
○ Put most recent segments (e.g., last 3 months) into memory intensive
instance types with 1:1 RAM/disk ratio: r5.8x with attached EBS
○ Background threads in historical nodes to read segment ﬁles (equivalent
to `cat 0000.smoosh > /dev/null`) on server bootstrap and new segment
downloading to force OS to load into page cache to avoid query time on
demand loading
○ The exact period of “recent” is recommended to be ﬁgured out through
request analysis. Druid real time ingestion is a good choice.

Learnings
Middle managers
● Need as much intention in tuning as historical nodes
○ Monitor metrics on Kafka ingestion oﬀset and timestamp lag
○ Increase intermediatePersistPeriod if you are sensitive to query latency
on middle managers
○ Use a custom partitioner on Kafka producer side to improve data locality
○ Use lateMessageRejectionPeriod and earlyMessageRejectionPeriod to
avoid scattered late and early events to create a lot of small segments
○ Reindexing (compaction) jobs
○ Be careful not to use Kafka transaction on producer side prior to Druid
0.15

Learnings
Group by queries
● Tail latency
○ Many are convertible to top N if you add a limit clause
○ Add a combined dimension if group by dimensions are more than 2 but
fixed
○ Enable push limit down to sacrifice some accuracy for performance
○ Enable parallel broker side merge
○ Limit number of rows to do group by if possible from the application side
○ Make sure you have enough merge buffers to not run out them

Learnings
Secondary dimension query time pruning other than time
● Cluster computing resource is limited
○ Each segment is processed in one processing thread whose number is
usually identical to number of cores
○ Cores are the expensive and are always fewer than number of segments
○ We should be cautious on which segments to scan for a query
● Shard specs with query time partition dimensions pruning
○ Batch ingestion
■ Hash based shard spec
■ Even size single dimension shard spec
○ Real time ingestion
■ Stream hash based shard spec

Learnings
○ Batch ingestion
● Worked well in most use cases
● Added missing query time pruning based on hashing and
partition dimensions
● However: skewed data which leads to skewed segment size,
long ingestion tail latency and query performance issue

Learnings
○ Batch ingestion
● Default single dimension shard spec will ﬁt data for the same partition
dimension value into a single segment
● Added a custom partitioner to distribute data for skewed partition dimension
value to multiple segments
● Replaced the two very slow hadoop jobs (roll up input and calculate per
partition dimension value number of rows to decide partition) with reading
output from a SparkSQL job

Learnings
○ Realtime
■ Stream hash based shard spec
● Real time ingestion defaults to use numbered shard spec which doesn’t have
metadata on what data is in it which means every query has all segment fanout,
making it very hard to support high query QPS
● The stream hashed shard spec is a real time version of batch Hash based shard
spec
● Let Kafka producer puts records to diﬀerent kafka partition id based on:
hash(partition dimensions) % number of kafka partitions
● Cons: this approach doesn’t allow increasing kafka partitions which will lead to
incorrect results during the transition period

Learnings
Operation tips
● druid.broker.select.tier and druid.server.priority
○ Controls routing for dark reads, Druid conﬁg AB testing and no downtime deploy

Learnings
Operation tips
● skipCoordinatorRun
○ Use this runtime conﬁg when deploy/restart historical nodes to avoid coordinator
triggering unnecessary segments movements
● maxSegmentsInNodeLoadingQueue and maxSegmentsToMove
○ Segments are represented as children under a historical host znode
○ Load queue znodes not compressed
○ Be careful of hitting zk buﬀer limit (default to a few MBs) when loading a large
number of segments to a historical node

Time for questions
@Pinterest
25
Thank you!
Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

Dates: November 10, 2020
druidsummit.org
26
Register Now for
the Next Druid
Virtual Summit

Archmage, Pinterest’s Real-time Analytics Platform on Druid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Archmage, Pinterest’s Real-time Analytics Platform on Druid

Similar to Archmage, Pinterest’s Real-time Analytics Platform on Druid (20)

More from Imply

More from Imply (18)

Recently uploaded

Recently uploaded (20)

Archmage, Pinterest’s Real-time Analytics Platform on Druid