@IndeedEng: Imhotep - Large Scale Analytics and Machine Learning at Indeed

Imhotep
Large Scale Analytics and
Machine Learning at Indeed

Jeff Plaisance
Engineering Manager

Indeed is a
Search Engine for Jobs

Indeed is a data driven
organization

Indeed is a data driven
organization
Data driven organizations
need great tools

What does Imhotep allow you to do?
● Decision Tree Building
● Analytics

Indeed’s Analytics Philosophy
Analytics systems should be:
1. Interactive
2. Not Sampled
3. Not Approximate

Imhotep answers questions
What was the weekly average query time in the
last quarter from people doing the query
“software”?

What percent of jobsearch results pages are for
page 2 and beyond?

What are the 5 most common queries in each
country?

Total Job Searches From 2014-03-09
to 2014-03-23
?

Document
query: “indeed software engineer”
location: “austin”
impressions: 10
clicks: 2
time: 2014-03-17T12:00:00

Shard
0 21 3 4
5 76 8 9
10 1211 13 14

Server
2014/03/02 2014/03/09 2014/03/11
2014/03/12 2014/03/22 2014/03/24
Documents Documents Documents
Documents Documents Documents

Cluster
2014-03-02
Server A
2014-03-03
Server B
2014-03-04
Server C

Cluster
2014-03-02 2014-03-03
Server B
2014-03-04
Server CServer A

Cluster
2014-03-02 2014-03-03
Server B
2014-03-04
Server C
Client
Session
Server A

to 2014-03-23
secret

to 2014-03-23 Per Day
2014-03-09 2014-03-16 2014-03-23

Metrics
● 64 bit integers
● Exactly one value per doc
● Random access by doc id

Metrics
● Time
● Clicks
● Impressions
● Revenue
● … or anything else that is a number

Groups
● Documents are placed into numbered
groups
● Every document starts in group 1
● Group 0 means “filtered out”

Groups
● Groups are stateful and scoped to a session
● Regroup operations update group for each
doc in shard

width
Metric Regroup
● Iterate over doc_id->metric lookup
● Set group to
(value - start)/ bucket_width
● Useful for making graphs (buckets on x-axis)
1 2 3 4 5
start end

Get Group Stats
● For each group, sums a metric for all docs in
that group

Bucket By Day
1. Regroup on time metric
2. Get Group Stats for count metric (always 1)

Total and US Job Searches From
2014-03-09 to 2014-03-23 Per Day
2014-03-09 2014-03-16 2014-03-23

Inverted Index
● Like index in the back of a book
● words = terms, page numbers = doc ids
● Term list is sorted
● Doc list for each term is sorted

doc id query country impressions clicks
0 software Canada 10 1
1 blank Canada 10 0
2 sales US 5 0
3 software US 8 1
4 blank US 10 1
Standard Index

Constructing an Inverted Index
query country impression clicks
doc id blank sales software Canada US 5 8 10 0 1
0 ✔ ✔ ✔ ✔
1 ✔ ✔ ✔ ✔
2 ✔ ✔ ✔ ✔
3 ✔ ✔ ✔ ✔
4 ✔ ✔ ✔ ✔

Constructing an Inverted Index
field term 0 1 2 3 4
query blank ✔ ✔
sales ✔
software ✔ ✔
country Canada ✔ ✔
US ✔ ✔ ✔
impressions 5 ✔
8 ✔
10 ✔ ✔ ✔
clicks 0 ✔ ✔
1 ✔ ✔ ✔

Inverted Index
field term doc list
query blank 1, 4
sales 2
software 0, 3
country Canada 0, 1
US 2, 3, 4
impressions 5 2
8 3
10 0, 1, 4
clicks 0 1, 2
1 0, 3, 4

Inverted Indexes
Allow you to:
● Quickly find all documents containing
a term
● Intersect several terms to perform
boolean queries

Lucene
● Open source inverted index implementation
● Reasonably fast
● Widely used, well tested

Global and US Job Searches From
2014-03-09 to 2014-03-23 Per Day
2014-03-09 2014-03-16 2014-03-23

field term doc list
query blank 1, 4
sales 2
software 0, 3
country Canada 0, 1
US 2, 3, 4
impressions 5 2
8 3
10 0, 1, 4
clicks 0 1, 2
1 0, 3, 4
Searches in the US only

field term doc list
country Canada 0, 1
US 2, 3, 4

Query Regroup
● Regroup all docs which do not match a
boolean query to group zero
field term doc list
country Canada 0, 1
US 2, 3, 4

Term Regroup
Splits docs in a group into one of two new
groups based on presence/absence of a term
country:US everything else
1
32

Multiterm Regroup
Generalization of term regroup to N terms and
N+1 new groups
country:US everything elsecountry:CA country:FR
52 3 4
1

Inverted Index Compression
Size of Organic Dataset for last 5 months
● Original: 102 TB
● Inverted: 51 TB

Inverted Index Optimizations
● Compressed data structures
○ Better use of RAM and processor cache
○ Better use of memory bandwidth
○ Increased CPU usage and time
● Micro optimizations matter!

Delta / Varint Encoding
● Doc id lists are sorted
● Delta between a doc id and the previous doc
id is sufficient
● Deltas are usually small integers
● Use less bits for small integers and more bits
for large integers

Delta Encoding
field term doc list
query nursing 34, 86, 247, 301, 674, 714

Delta Encoding
field term doc list
query nursing 34, 86, 247, 301, 674, 714
34, 52, 161, 54, 373, 40

Small Integer Compression
● Golomb/Rice
● Varint
● Binary Packing
● PForDelta

Small Integer Compression
● Golomb/Rice
● Varint
● Bit Packing
● PForDelta

Varint Encoding
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0
9838

Varint Encoding
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0
9838
? 1 1 0 1 1 1 0

Varint Encoding
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9838
? 1 1 0 1 1 1 0
0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0

Varint Encoding
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0
9838
1 1 1 0 1 1 1 0

Varint Encoding
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0
9838
1 1 1 0 1 1 1 0
? 1 0 0 1 1 0 0

Varint Encoding
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0
9838
1 1 1 0 1 1 1 0
0 1 0 0 1 1 0 0

Inverted Index Compression
Size of Organic Dataset for last 5 months
● Original: 102 TB
● Inverted: 51 TB
● Delta / Varint: 17 TB

Flamdex
● Two files per field (terms/docs)
● Can add fields without rebuilding index
● Faster varint decoding
● No TF or positions (or wasted time decoding
them)

Varints
Pros:
● Compression
● Can fit more of index in RAM
● Higher information throughput per byte read
from disk

Varints
Cons:
● Decodes one byte at a time
● Lots of branch mispredictions
● Not fast to decode

Vectorized Varint Decoding
01001010 11001000 01110001 01001110
10011011 01101010 10110101 00010111
01110110 10001101 10110011 11000001

01001010 11001000 01110001 01001110
10011011 01101010 10110101 00010111
01110110 10001101 10110011 11000001
pmovmskb: Extract top bit of each byte

01001010 11001000 01110001 01001110
10011011 01101010 10110101 00010111
01110110 10001101 10110011 11000001
010010100111

01001010 11001000 01110001 01001110
10011011 01101010 10110101 00010111
01110110 10001101 10110011 11000001
010010100111
Lookup in 4096 entry lookup table

010010100111
Pattern of leading bits determines:
● how many varints to decode
● sizes and offsets of varints
● length of longest varint in bytes
● number of bytes to consume

010010100111
Decoding options for:
● up to twelve 1 byte varints
● six 1-2 byte varints
● four 1-3 byte varints
● two 1-5 byte varints

● Decode six 1-2 byte varints in parallel
● Need to pad out all 1 byte varints to 2 bytes
pshufb: Intel SSSE3 instruction to shuffle
bytes

01001010 11001000 01110001 01001110
10011011 01101010 10110101 00010111
01110110 10001101 10110011 11000001
Decode 6 varints from 9 bytes

01001010 11001000 01110001 01001110
10011011 01101010 10110101 00010111
01110110 10001101 10110011 11000001
Pad out 1 byte ints to 2 bytes

01001010 00000000 11001000 01110001
01001110 00000000 10011011 01101010
10110101 00010111 01110110 00000000
Pad out 1 byte ints to 2 bytes

01001010 00000000 11001000 01110001
01001110 00000000 10011011 01101010
10110101 00010111 01110110 00000000
Reverse bytes in 2 byte varints

00000000 01001010 01110001 11001000
00000000 01001110 01101010 10011011
00010111 10110101 00000000 01110110
Reverse bytes in 2 byte varints

00000000 01001010 01110001 11001000
00000000 01001110 01101010 10011011
00010111 10110101 00000000 01110110
Mask out leading purple 1’s

00000000 01001010 01110001 01001000
00000000 01001110 01101010 00011011
00010111 00110101 00000000 01110110
Mask out leading purple 1’s

00000000 01001010 01110001 01001000
00000000 01001110 01101010 00011011
00010111 00110101 00000000 01110110
Shift top bytes of each varint 1 bit right
(mask/shift/or)

00000000 01001010 00111000 11001000
00000000 01001110 00110101 00011011
00001011 10110101 00000000 01110110
Shift top bytes of each varint 1 bit right
(mask/shift/or)

00000000 01001010 00111000 11001000
00000000 01001110 00110101 00011011
00001011 10110101 00000000 01110110
● ~10 instructions
● No branches
● Less than 2 instructions per varint

00000000 01001010 00111000 11001000
00000000 01001110 00110101 00011011
00001011 10110101 00000000 01110110
● Imhotep spends ~40% of its CPU time
decoding varints

00000000 01001010 00111000 11001000
00000000 01001110 00110101 00011011
00001011 10110101 00000000 01110110
● Imhotep spends ~40% of its CPU time
decoding varints
● Vectorized decoder ~3-5x faster
○ Decompresses at 1.5 GB per second
○ ~2x overall system performance

Term Stats
atlanta 49
austin 14
boston 25
chicago 28
dallas 13
houston 36
new york 68
san francisco 54

Term Stats Iterator
● For each term in a field, sum metrics across
all docs containing that term

Term Stats Iterator
● For each term in a field, sum metrics across
all docs containing that term
● How do we compute this across many
machines?

dallas 5
boston 12
austin 3
atlanta 16
dallas 8
chicago 19
austin 4
atlanta 12
chicago 9
boston 13
austin 7
atlanta 21

dallas 5
boston 12
austin 3
atlanta 16
chicago 9
boston 13
austin 7
atlanta 21
atlanta 49
dallas 5
boston 12
austin 3
atlanta 16
dallas 8
chicago 19
austin 4
atlanta 12
chicago 9
boston 13
austin 7
atlanta 21

atlanta 49
dallas 5
boston 12
austin 3
atlanta 16
chicago 9
boston 13
austin 7
atlanta 21
dallas 5
boston 12
austin 3
atlanta 16
dallas 8
chicago 19
austin 4
atlanta 12
chicago 9
boston 13
austin 7
atlanta 21

dallas 5
boston 12
austin 3
dallas 8
chicago 19
austin 4
atlanta 12
chicago 9
boston 13
austin 7
atlanta 21
atlanta 49atlanta 49

dallas 5
boston 12
austin 3
dallas 8
chicago 19
austin 4
chicago 9
boston 13
austin 7
atlanta 21

chicago 9
boston 13
austin 7
dallas 5
boston 12
austin 3
dallas 8
chicago 19
austin 4

austin 14
atlanta 49
chicago 9
boston 13
austin 7
dallas 5
boston 12
austin 3
dallas 8
chicago 19
austin 4

dallas 5
boston 12
austin 14
atlanta 49
chicago 9
boston 13
austin 7
dallas 8
chicago 19
austin 4

dallas 8
chicago 19
dallas 5
boston 12
austin 14
atlanta 49
chicago 9
boston 13
austin 7

chicago 9
boston 13
dallas 8
chicago 19
dallas 5
boston 12
austin 14
atlanta 49

chicago 9
boston 13
dallas 8
chicago 19
dallas 5
boston 12
boston 25
austin 14
atlanta 49

boston 25
austin 14
atlanta 49
chicago 9
boston 13
dallas 8
chicago 19
dallas 5
boston 12

dallas 5
boston 25
austin 14
atlanta 49
chicago 9
boston 13
dallas 8
chicago 19

chicago 9dallas 5
boston 25
austin 14
atlanta 49
dallas 8
chicago 19

chicago 9dallas 5
chicago 28
boston 25
austin 14
atlanta 49
dallas 8
chicago 19

chicago 28
boston 25
austin 14
atlanta 49
chicago 9dallas 5
dallas 8
chicago 19

dallas 8
chicago 28
boston 25
austin 14
atlanta 49
chicago 9dallas 5

dallas 8
chicago 28
boston 25
austin 14
atlanta 49
dallas 5

dallas 8
dallas 13
chicago 28
boston 25
austin 14
atlanta 49
dallas 5

dallas 5 dallas 8
dallas 13
chicago 28
boston 25
austin 14
atlanta 49

dallas 8
dallas 13
chicago 28
boston 25
austin 14
atlanta 49

dallas 13
chicago 28
boston 25
austin 14
atlanta 49

Term Stats
1-6
TS 1 TS 2 TS 3 TS 4 TS 5 TS 6

TS 1-6 TS 7-12 TS 13-18
Term Stats 1-
18

Amdahl’s Law
● The speedup of a program using multiple
processors is limited by the time needed for
the sequential fraction of the program

Amdahl’s Law
● Sequential part of FTGS is last step in
merge
● Can we distribute some part of the final
merge?

Hash Partition + Interleave
● Send all stats for each unique term to the
same thread based on a hash of the term
● Interleave merged terms

dallas 5
boston 12
atlanta 16
dallas 8
atlanta 12
boston 13
atlanta 21

dallas 5
boston 12
atlanta 16
dallas 8
atlanta 12
boston 13
atlanta 21
atlanta 49

dallas 5
boston 12 dallas 8 boston 13
boston 25
atlanta 49

dallas 5 dallas 8
dallas 13
boston 25
atlanta 49

dallas 13
boston 25
atlanta 49

dallas 13
boston 25
atlanta 49
chicago 28
austin 14

atlanta 49
dallas 13
boston 25
atlanta 49
chicago 28
austin 14

austin 14
atlanta 49
dallas 13
boston 25
chicago 28
austin 14

chicago 28
dallas 13
boston 25
austin 14
atlanta 49

boston 25
austin 14
atlanta 49
chicago 28
dallas 13
boston 25

dallas 13
boston 25
austin 14
atlanta 49
chicago 28

chicago 28
boston 25
austin 14
atlanta 49
dallas 13 chicago 28

chicago 28
boston 25
austin 14
atlanta 49
dallas 13

dallas 13
dallas 13
chicago 28
boston 25
austin 14
atlanta 49

Shard Distribution
● Lots of datasets for different event types
● Each dataset is split into one shard per
(hour/day)
● Each shard has 2 replicas for fault tolerance
● How do we assign shards to machines?

Shard Distribution Considerations
● Space
● Load
● Hot Spots
● Adding/Removing machines

Homogeneous vs. Heterogeneous
Systems
● Must decide how or if you will handle
heterogeneous hardware
● Cannot balance for both space and load on
heterogeneous hardware

1 TB
3 TB

12 shards
50% capacity used
4 shards
50% capacity used

12 shards
50% capacity used
4 shards
50% capacity used
read hotspot

8 shards
33% capacity used
8 shards
100% capacity used
wasted space

Hot Spots
When accessing any subset of a dataset,
evenly spread the load across CPUs, drives,
network cards

Hot Spots
When accessing any subset of a dataset,
evenly spread the load across CPUs, drives,
network cards
This is hard

Hot Spots
Maybe random is good enough?

Hot Spots
Maybe random is good enough?
On average about 10% more data read from
the most loaded machine than the least

Two Choice Randomized Load
Balancing
● 2 replicas of each shard to choose from
● Greedily choose the machine that currently
has the least load from this client

Two Choice Randomized Load
Balancing
● 2 replicas of each shard to choose from
● Greedily choose the machine that currently
has the least load from this client
● On average about 1% more data read from
the most loaded machine than the least

Rendezvous Hashing
● Assignment of a shard to machines
determined only by the machines that exist
in the cluster
● Hash all pairs of shard ID and machine ID
and pick the largest two

Rendezvous Hashing
Shard ID: organic.2014-03-02T06:00:00
H(Shard ID + m1
) = 0.592624
H(Shard ID + m2
) = 0.294647
H(Shard ID + m3
) = 0.736681
H(Shard ID + m4
) = 0.647578
H(Shard ID + m5
) = 0.835598

Rendezvous Hashing
0
1
m5
m3
m4
m1
m2

Rendezvous Hashing
● No coordination required - deterministic
algorithm used to determine assignment
● No centralized storage for shard to machine
assignment

Expected max hash for a shard is
Rendezvous Hashing

Expected max hash for a shard is
Probability that new machine will get shard
Rendezvous Hashing

1. Query Regroup on query:software
2. Metric Regroup on time, width 7 days
3. Get Group Stats on query time and count,
divide after summing

1. Get Group Stats on count
2. Query Regroup on “-page:1”
3. Get Group Stats on count
4. Divide -page:1 count by total count

1. Multiterm Regroup on all values of country
2. Term Group Stats Iteration on query

IQL
select count()
from jobsearch
‘2014-01-01’
‘2014-03-26’
group by country, query[5]

IQL
select count()
from jobsearch
‘2014-01-01’
‘2014-03-26’
Metrics

select count()
from jobsearch
‘2014-01-01’
‘2014-03-26’
IQL
Dataset

select count()
from jobsearch
‘2014-01-01’
‘2014-03-26’
IQL
Regroup

select count()
from jobsearch
‘2014-01-01’
‘2014-03-26’
IQL
Term Group
Stats

Imhotep
Large Scale Analytics and Machine
Learning

Imhotep
Large Scale Analytics and Machine
Learning
● Varint Decoding:
High Performance Vector Instructions
● Stream Merging: Hash Partition +
Interleave
● Shard Distribution: Rendezvous Hashing

How You Can Use Imhotep
Data Ingestion
● TSV Uploader
● Hadoop
Data Access
● Imhotep Primitives
● IQL

Next @IndeedEng Talk
Large Scale Interactive Analytics
with Imhotep
Tom Bergman, Product Manager
Zak Cocos, Manager of Marketing Sciences
April 30, 2014
http://engineering.indeed.com/talks

@IndeedEng: Imhotep - Large Scale Analytics and Machine Learning at Indeed

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

More from indeedeng

More from indeedeng (11)

Recently uploaded

Recently uploaded (20)

@IndeedEng: Imhotep - Large Scale Analytics and Machine Learning at Indeed