Using Cassandra for RTB systems

Real Time Bidding with
Apache Cassandra

RTB @ Kenshoo:

Introducing RTB

- Concepts
- Architecture
- Challenges

Real Time Bidding (RTB)
●

Real-time bidding is a dynamic auction process where each
impression is a bid for in (near) real time versus a static auction

●

Kenshoo is engaged In Facebook Exchange (FBX)

●

In FBX, each bid has a life-time of 120ms. All transactions have to
complete within that period, and the winning ad is presented to the
user.

●

Kenshoo employs ad re-targeting, where search engine campaigns
are extended to the social network, thus giving a much higher ROI for
our customers

RTB Logical Architecture
RTB
RTB Front
Opt Out

Bidder

Win

Error

Pixel Matcher

Cassandra
Cookie to
Segment(s)

RTB Backend

Bid decision
Trees

Campaigns
Metadata

RTB Brain
RTB Reporter

RTB @ Kenshoo:

Focus on RTB Cassandra
- Architecture
- Challenges

Requirements
●

●

Handle 25K+ requests within the 120ms bid time-frame including
network latencies
Ability to scale up to 1M per minute requests while keeping the
current latency

●

Handle ~10K writes/second with low latency

●

Multi DC Configuration, all nodes must be sync-ed in real-time

●

Seamless Operations: Compactions and Repairs

●

High Security

C* Physical Architecture
(US) West Region

(US) East Region

App
App
App

App
App
App

Internet

GRE

VPN

FBX WEST

VPN

FBX EAST

C* Cluster Information
●
●
●
●
●
●
●
●
●

Cassandra version 1.2.6
Oracle Java 7
Manual tokens, Vnodes Are Coming Soon
Multi-DC Configuration
Network Topology
DC Connectivity between VPCs via Linux GRE
Amazon C3.2xlarge instance type
Ubuntu 13.10 with EXT4
SSD (Ephemeral)

The Ring

C* Cluster Network Between Sites
●

For security reasons we,
○
○

●

Do not use EC2Snitch or EC2MultiRegionSnitch
Connected the nodes via VPN (Linux GRE)

Linux GRE is fast, reliable and provides high throughput
(~1Gb/s)

C* Cluster Storage
●

We started with Amazon EBS:
○
○
○

●

With small #nodes (up to 4 nodes): You want persistent
storage; avoid running repairs if you lose a node
4xEBS devices in RAID10 configuration: Provide up to 1000
IOPs and bursts of up to 2000 IOPS
Cheap in AWS

8 nodes with Ephemeral Devices:
○
○
○
○

Lower risk: if you lose a node, recovery isn’t as heavy on the
whole cluster
We used RAID0
Higher performance (double than EBS)
Free, bundled within the instances

C* Cluster Storage continued
●

16 nodes with Ephemeral Devices:
○
○
○

●

When load became heavy we grew to 16 nodes
Compactions and repairs harmed the cluster latency
We had to use Provisioned IOPs devices for C* maintenance

C3 Instance type with SSD:
○ Came just in time providing ephemeral SSD storage
○ They solved our performance problems and enabled
seamless compactions and repairs
○ Amazon currently has scarce deployment of this H/W and
nodes are not stable
○ Not available yet in all regions
○ C3 Nodes Deployment are not always a possiblity due to AWS
capacity issues
○ Amazon promised to resolve the C3 issues next month

Monitoring
●

We heavily rely on DataStax OpsCenter

●

We grab OpsCenter Metrics out for graphings

●

We wrote our own Read/Write Speed Test on separate dedicated KeySpace on
each node to detect bottlenecks and problematic nodes

●

We Sample the data separately from the Application to detect if the problem
origins are C* or the application

What have we learned
●

●

●

●

Storage:
○ Use SSD:
■ It provides high and stable disk performance
■ Neutralizes Compaction and Repair effects on the cluster
■ Worth the money
Network:
■ Use highest bandwidth VPN possible
■ GRE is great (lacks encryption, but provides best bandwidth)
Maintenance:
○ Run Compact Daily: It does miracle to performance on heavy loads
○ If you are not on SSD, disable thrift on the node before running compaction
○ Do compactions in sequence, node by node
○ On high-load systems, avoid repair as possible, it’s better to decommission
and recommission a node than to run repair!
○ If you have to repair, always use “-pr” flag and if possible use the
incremental repair option (requires heavy scripting)
Monitoring:
○ Write a sampler and speed tester for each node to detect bottlenecks and
performance issues sources

Using Cassandra for RTB systems

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Using Cassandra for RTB systems