The document describes Airbnb's search architecture. It discusses how Airbnb indexes over 800,000 listings across 190 countries using Lucene and maintains the index in real-time. It also covers the various components involved in ranking search results, including over 150 ranking signals, loading ranking models, and performing second pass ranking on results.
6. Search Backend
Technical Stack
____________________________
DropWizard as a service framework (incl. Jetty, Jersey, Jackson)
Guice dependency injection framework, Guava libraries, etc.
ZooKeeper (via Smartstack) for service discovery.
Lucene for index storage and simple retrieval.
In-house built real time indexing, ranking, advanced filtering.
7. Search Backend
~150 search threads
4 indexing threads
Data maintained by indexers:
Inverted Lucene index for retrieval
Forward index for ranking signals
Relevance models
JVM
8. Indexing
What’s in the Lucene index?
____________________________
Positions of listings indexed using Lucene’s spatial module (RecursivePrefixTreeStrategy)
Categorical and numerical properties like room type and maximum occupancy
Calendar information
Full text (descriptions, reviews, etc.)
~40 fields per listing from a variety of data sources, all updated in real time
12. Indexing
SpinalTap
____________________________
Responsible for detecting updates happening to the ground truth data
(no need to maintain search index invalidation logic in application code)
Tails binary update logs from MySQL servers (5.6+)
Converts them into actionable data objects, called “Mutations”
Broadcasts using a distributed queue, like Kafka or RabbitMQ
13. Indexing
# sources for mysql binary logs
sources:
- name : airslave
host : localhost
port : 11
user : spinaltap
password: spinaltap
- name : calendar_db
host : localhost
port : 11
user : spinaltap
password: spinaltap
!
destinations:
- name : kafka
clazzName :
com.airbnb.spinaltap.destination.kafka.KafkaDestination
!
pipes:
- name : search
sources : [“airslave", "calendar_db"]
tables : ["production:listings,calendar_db:schedule2s"]
destination : kafka
SpinalTap Pipes
____________________________
Each pipe connects one or more binlog sources (MySQL) with a
destination (e.g. Kafka)
Configured via YAML files
14. Indexing
{
"seq" : 3,
"binlogpos" : "mysql-bin.000002:5217:5273",
"id" : -1857589909002862756,
"type" : 2,
"table" : {
"id" : 70,
"name" : "users",
"db" : "my_db",
"columns" : [ {
"name" : "name",
"type" : 15,
"ispk" : false
}, {
"name" : "age",
"type" : 2,
"ispk" : false
} ]
},
"rows" : [ {
"1" : {
"name" : "eric",
"age" : 31,
},
"2" : {
"name" : "eric",
"age" : 28,
}
} ]
}
SpinalTap Mutations
____________________________
Each binlog entry is parsed and converted into one of three
event types: “Insert”, “Delete” or “Update”
“Insert” and “Delete” carry the entire row to be inserted or
deleted
“Update” mutations contain both the old and the current row
Additional information: unique id, sequence number, column
and table metadata
15. Indexing
Medusa
____________________________
Documents in index contain data from ~15 different source tables
Lucene needs a copy of all fields (not just fields that changed) to update the index
We also need a mechanism to build the entire index from scratch, without putting too much strain on MySQL
16. Indexing
Reads from SpinalTap or directly from MySQL
Data from multiple tables is joined into Thrift objects,
which correspond to Lucene documents
The intermediate Thrift objects are persisted in Redis
As changes are detected, updated objects are pushed
to the Search instances to update Lucene indexes
Can bootstrap the entire index in 3 minutes via
multithreaded streaming
Leader election via ZooKeeper
Medusa PersistentStorage
Search1 Search2 … SearchN
17. Ranking
Ranking Problem
____________________________
Not a text search problem
Users are almost never searching for a specific item, rather they’re looking to “Discover”
The most common component of a query is location
Highly personalized – the user is a part of the query
Optimizing for conversion (Search -> Inquiry -> Booking)
Evolution through continuos experimentation
18. Ranking
Ranking Components
____________________________
Relevance
Quality
Bookability
Personalization
Desirability of location
New host promotion
etc.
19. Ranking
Several hundred signals determining search ranking:
Properties of the listing (reviews, location, etc.)
Behavioral signals (mined from request logs)
Image quality and click ability (computer vision)
Host behavior (response time/rate, cancellations, etc.)
Host preferences model
DB snapshots Logs
20. Ranking
public void attemptLoadData() {
DateTime remoteTs = dataLoader.getModTime(pathToSignals);
!
if (currentTs == null || remoteTs.isAfter(currentTs) {
Map<K, D> newSignals = loadData();
if (newSignals != null && (signalsMap == null || isHealthy(newSignals)) {
synchronized (this) {
signalsMap = newSignals;
currentTs = remoteTs;
this.notifyAll();
}
} else {
LOG.severe("Failed to load the avro file: " + pathToSignals);
}
}
}
!
…
!
ThreadedLoader<Integer, QualitySignalsAvro> qualitySignalsLoader =
loaders.get(LoaderCollection.Loader.QualitySignals);
final QualitySignalsAvro qs = qualitySignalsLoader.get(hostingId, true);
Loading Signals
____________________________
Storing signals in a separate data structure
Pros:
Good fit for this type of update pattern: not real-time, but
almost everything changes on each load
No need for costly Lucene index rebuild
Greatly simplifies design
Cons:
Unable to use Lucene retrieval on such data
21. Life of a Query
Query
Understanding
Retrieval
External Calls
Geocoding
Configuring retrieval options
Choosing ranking models Quality
Populator Scorer
2000 results
Third Pass Ranking
Result Generation AirEvents Logging
Bookability
2000 results Relevance
Filtering and Reranking
Pricing Service
Social Connections
25 results
25 results
22. Ranking
Second Pass Ranking
____________________________
Traditional ranking works like this:
!
then sort by rr
In contrast, second pass operates on the entire list at once:
!
Makes it possible to implement features like result diversity, etc.
23. Life of a Query
Query
Understanding
Retrieval
External Calls
Geocoding
Configuring retrieval options
Choosing ranking models Quality
Populator Scorer
2000 results
Third Pass Ranking
Result Generation AirEvents Logging
Bookability
2000 results Relevance
Filtering and Reranking
Pricing Service
Social Connections
25 results
25 results
28. Outside of the scope of this talk
____________________________
Ranking models
Machine Learning infrastructure
Tools (loadtest, deploy, etc.)
Other Search Infrastructure services: UserProfiler, Pricing, Social, Hoods, etc.