Video available at: https://www.youtube.com/watch?v=z4JTjUp3NC0
To scale the building of decision trees on large amounts of Indeed job search data, we created a system called Imhotep. In addition to being a crucial tool for building these machine learning models, Imhotep has proven to be applicable to many different analytics problems. The core of Imhotep is a distributed system that manages the parallel execution of queries across a set of time-sharded inverted indices.
This talk covers Imhotep’s primitive operations that allow us to build decision trees, drill into data, build graphs, and even execute sql-like queries in IQL (Imhotep Query Language). We will also discuss what makes Imhotep fast, highly available, and fault tolerant.
32. Groups
● Documents are placed into numbered
groups
● Every document starts in group 1
● Group 0 means “filtered out”
33. Groups
● Groups are stateful and scoped to a session
● Regroup operations update group for each
doc in shard
34. width
Metric Regroup
● Iterate over doc_id->metric lookup
● Set group to
(value - start)/ bucket_width
● Useful for making graphs (buckets on x-axis)
1 2 3 4 5
start end
35. Get Group Stats
● For each group, sums a metric for all docs in
that group
36. Bucket By Day
1. Regroup on time metric
2. Get Group Stats for count metric (always 1)
37. Total Job Searches From 2014-03-09
to 2014-03-23 Per Day
2014-03-09 2014-03-16 2014-03-23
38. Total and US Job Searches From
2014-03-09 to 2014-03-23 Per Day
2014-03-09 2014-03-16 2014-03-23
40. Inverted Index
● Like index in the back of a book
● words = terms, page numbers = doc ids
● Term list is sorted
● Doc list for each term is sorted
41. doc id query country impressions clicks
0 software Canada 10 1
1 blank Canada 10 0
2 sales US 5 0
3 software US 8 1
4 blank US 10 1
Standard Index
42. Constructing an Inverted Index
query country impression clicks
doc id blank sales software Canada US 5 8 10 0 1
0 ✔ ✔ ✔ ✔
1 ✔ ✔ ✔ ✔
2 ✔ ✔ ✔ ✔
3 ✔ ✔ ✔ ✔
4 ✔ ✔ ✔ ✔
43. Constructing an Inverted Index
field term 0 1 2 3 4
query blank ✔ ✔
sales ✔
software ✔ ✔
country Canada ✔ ✔
US ✔ ✔ ✔
impressions 5 ✔
8 ✔
10 ✔ ✔ ✔
clicks 0 ✔ ✔
1 ✔ ✔ ✔
44. Inverted Index
field term doc list
query blank 1, 4
sales 2
software 0, 3
country Canada 0, 1
US 2, 3, 4
impressions 5 2
8 3
10 0, 1, 4
clicks 0 1, 2
1 0, 3, 4
45. Inverted Indexes
Allow you to:
● Quickly find all documents containing
a term
● Intersect several terms to perform
boolean queries
46. Lucene
● Open source inverted index implementation
● Reasonably fast
● Widely used, well tested
47. Global and US Job Searches From
2014-03-09 to 2014-03-23 Per Day
2014-03-09 2014-03-16 2014-03-23
48. field term doc list
query blank 1, 4
sales 2
software 0, 3
country Canada 0, 1
US 2, 3, 4
impressions 5 2
8 3
10 0, 1, 4
clicks 0 1, 2
1 0, 3, 4
Searches in the US only
49. field term doc list
query blank 1, 4
sales 2
software 0, 3
country Canada 0, 1
US 2, 3, 4
impressions 5 2
8 3
10 0, 1, 4
clicks 0 1, 2
1 0, 3, 4
Searches in the US only
50. Searches in the US only
field term doc list
country Canada 0, 1
US 2, 3, 4
51. Searches in the US only
Query Regroup
● Regroup all docs which do not match a
boolean query to group zero
field term doc list
country Canada 0, 1
US 2, 3, 4
52. Term Regroup
Splits docs in a group into one of two new
groups based on presence/absence of a term
country:US everything else
1
32
56. Inverted Index Optimizations
● Compressed data structures
○ Better use of RAM and processor cache
○ Better use of memory bandwidth
○ Increased CPU usage and time
● Micro optimizations matter!
57. Delta / Varint Encoding
● Doc id lists are sorted
● Delta between a doc id and the previous doc
id is sufficient
● Deltas are usually small integers
● Use less bits for small integers and more bits
for large integers
77. Inverted Index Compression
Size of Organic Dataset for last 5 months
● Original: 102 TB
● Inverted: 51 TB
● Delta / Varint: 17 TB
78. Flamdex
● Two files per field (terms/docs)
● Can add fields without rebuilding index
● Faster varint decoding
● No TF or positions (or wasted time decoding
them)
82. Vectorized Varint Decoding
01001010 11001000 01110001 01001110
10011011 01101010 10110101 00010111
01110110 10001101 10110011 11000001
pmovmskb: Extract top bit of each byte
83. Vectorized Varint Decoding
01001010 11001000 01110001 01001110
10011011 01101010 10110101 00010111
01110110 10001101 10110011 11000001
pmovmskb: Extract top bit of each byte
010010100111
84. Vectorized Varint Decoding
01001010 11001000 01110001 01001110
10011011 01101010 10110101 00010111
01110110 10001101 10110011 11000001
pmovmskb: Extract top bit of each byte
010010100111
Lookup in 4096 entry lookup table
85. 010010100111
Pattern of leading bits determines:
● how many varints to decode
● sizes and offsets of varints
● length of longest varint in bytes
● number of bytes to consume
86. 010010100111
Pattern of leading bits determines:
● how many varints to decode
● sizes and offsets of varints
● length of longest varint in bytes
● number of bytes to consume
87. 010010100111
Pattern of leading bits determines:
● how many varints to decode
● sizes and offsets of varints
● length of longest varint in bytes
● number of bytes to consume
88. 010010100111
Pattern of leading bits determines:
● how many varints to decode
● sizes and offsets of varints
● length of longest varint in bytes
● number of bytes to consume
89. 010010100111
Pattern of leading bits determines:
● how many varints to decode
● sizes and offsets of varints
● length of longest varint in bytes
● number of bytes to consume
90. 010010100111
Pattern of leading bits determines:
● how many varints to decode
● sizes and offsets of varints
● length of longest varint in bytes
● number of bytes to consume
91. 010010100111
Pattern of leading bits determines:
● how many varints to decode
● sizes and offsets of varints
● length of longest varint in bytes
● number of bytes to consume
92. 010010100111
Pattern of leading bits determines:
● how many varints to decode
● sizes and offsets of varints
● length of longest varint in bytes
● number of bytes to consume
93. 010010100111
Pattern of leading bits determines:
● how many varints to decode
● sizes and offsets of varints
● length of longest varint in bytes
● number of bytes to consume
95. Vectorized Varint Decoding
● Decode six 1-2 byte varints in parallel
● Need to pad out all 1 byte varints to 2 bytes
pshufb: Intel SSSE3 instruction to shuffle
bytes
110. Term Stats Iterator
● For each term in a field, sum metrics across
all docs containing that term
111. Term Stats Iterator
● For each term in a field, sum metrics across
all docs containing that term
● How do we compute this across many
machines?
112. dallas 5
boston 12
austin 3
atlanta 16
dallas 8
chicago 19
austin 4
atlanta 12
chicago 9
boston 13
austin 7
atlanta 21
113. dallas 5
boston 12
austin 3
atlanta 16
dallas 8
chicago 19
austin 4
atlanta 12
chicago 9
boston 13
austin 7
atlanta 21
114. dallas 5
boston 12
austin 3
atlanta 16
dallas 8
chicago 19
austin 4
atlanta 12
chicago 9
boston 13
austin 7
atlanta 21
115. dallas 5
boston 12
austin 3
atlanta 16
dallas 8
chicago 19
austin 4
atlanta 12
chicago 9
boston 13
austin 7
atlanta 21
116. dallas 5
boston 12
austin 3
atlanta 16
chicago 9
boston 13
austin 7
atlanta 21
atlanta 49
dallas 5
boston 12
austin 3
atlanta 16
dallas 8
chicago 19
austin 4
atlanta 12
chicago 9
boston 13
austin 7
atlanta 21
117. atlanta 49
dallas 5
boston 12
austin 3
atlanta 16
chicago 9
boston 13
austin 7
atlanta 21
dallas 5
boston 12
austin 3
atlanta 16
dallas 8
chicago 19
austin 4
atlanta 12
chicago 9
boston 13
austin 7
atlanta 21
118. dallas 5
boston 12
austin 3
dallas 8
chicago 19
austin 4
atlanta 12
chicago 9
boston 13
austin 7
atlanta 21
atlanta 49atlanta 49
119. dallas 5
boston 12
austin 3
dallas 8
chicago 19
austin 4
chicago 9
boston 13
austin 7
atlanta 21
atlanta 49atlanta 49
120. chicago 9
boston 13
austin 7
atlanta 49atlanta 49
dallas 5
boston 12
austin 3
dallas 8
chicago 19
austin 4
176. Shard Distribution
● Lots of datasets for different event types
● Each dataset is split into one shard per
(hour/day)
● Each shard has 2 replicas for fault tolerance
● How do we assign shards to machines?
178. Homogeneous vs. Heterogeneous
Systems
● Must decide how or if you will handle
heterogeneous hardware
● Cannot balance for both space and load on
heterogeneous hardware
186. Hot Spots
Maybe random is good enough?
On average about 10% more data read from
the most loaded machine than the least
187. Two Choice Randomized Load
Balancing
● 2 replicas of each shard to choose from
● Greedily choose the machine that currently
has the least load from this client
188. Two Choice Randomized Load
Balancing
● 2 replicas of each shard to choose from
● Greedily choose the machine that currently
has the least load from this client
● On average about 1% more data read from
the most loaded machine than the least
189. Rendezvous Hashing
● Assignment of a shard to machines
determined only by the machines that exist
in the cluster
● Hash all pairs of shard ID and machine ID
and pick the largest two
190. Rendezvous Hashing
Shard ID: organic.2014-03-02T06:00:00
H(Shard ID + m1
) = 0.592624
H(Shard ID + m2
) = 0.294647
H(Shard ID + m3
) = 0.736681
H(Shard ID + m4
) = 0.647578
H(Shard ID + m5
) = 0.835598
194. Rendezvous Hashing
● No coordination required - deterministic
algorithm used to determine assignment
● No centralized storage for shard to machine
assignment
223. How You Can Use Imhotep
Data Ingestion
● TSV Uploader
● Hadoop
Data Access
● Imhotep Primitives
● IQL
224. Next @IndeedEng Talk
Large Scale Interactive Analytics
with Imhotep
Tom Bergman, Product Manager
Zak Cocos, Manager of Marketing Sciences
April 30, 2014
http://engineering.indeed.com/talks
227. Next @IndeedEng Talk
Large Scale Interactive Analytics
with Imhotep
Tom Bergman, Product Manager
Zak Cocos, Manager of Marketing Sciences
April 30, 2014
http://engineering.indeed.com/talks