3. Real Time Bidding (RTB)
●
Real-time bidding is a dynamic auction process where each
impression is a bid for in (near) real time versus a static auction
●
Kenshoo is engaged In Facebook Exchange (FBX)
●
In FBX, each bid has a life-time of 120ms. All transactions have to
complete within that period, and the winning ad is presented to the
user.
●
Kenshoo employs ad re-targeting, where search engine campaigns
are extended to the social network, thus giving a much higher ROI for
our customers
8. Requirements
●
●
Handle 25K+ requests within the 120ms bid time-frame including
network latencies
Ability to scale up to 1M per minute requests while keeping the
current latency
●
Handle ~10K writes/second with low latency
●
Multi DC Configuration, all nodes must be sync-ed in real-time
●
Seamless Operations: Compactions and Repairs
●
High Security
9. C* Physical Architecture
(US) West Region
(US) East Region
App
App
App
App
App
App
Internet
GRE
VPN
FBX WEST
VPN
FBX EAST
10. C* Cluster Information
●
●
●
●
●
●
●
●
●
Cassandra version 1.2.6
Oracle Java 7
Manual tokens, Vnodes Are Coming Soon
Multi-DC Configuration
Network Topology
DC Connectivity between VPCs via Linux GRE
Amazon C3.2xlarge instance type
Ubuntu 13.10 with EXT4
SSD (Ephemeral)
The Ring
11. C* Cluster Network Between Sites
●
For security reasons we,
○
○
●
Do not use EC2Snitch or EC2MultiRegionSnitch
Connected the nodes via VPN (Linux GRE)
Linux GRE is fast, reliable and provides high throughput
(~1Gb/s)
12. C* Cluster Storage
●
We started with Amazon EBS:
○
○
○
●
With small #nodes (up to 4 nodes): You want persistent
storage; avoid running repairs if you lose a node
4xEBS devices in RAID10 configuration: Provide up to 1000
IOPs and bursts of up to 2000 IOPS
Cheap in AWS
8 nodes with Ephemeral Devices:
○
○
○
○
Lower risk: if you lose a node, recovery isn’t as heavy on the
whole cluster
We used RAID0
Higher performance (double than EBS)
Free, bundled within the instances
13. C* Cluster Storage continued
●
16 nodes with Ephemeral Devices:
○
○
○
●
When load became heavy we grew to 16 nodes
Compactions and repairs harmed the cluster latency
We had to use Provisioned IOPs devices for C* maintenance
C3 Instance type with SSD:
○ Came just in time providing ephemeral SSD storage
○ They solved our performance problems and enabled
seamless compactions and repairs
○ Amazon currently has scarce deployment of this H/W and
nodes are not stable
○ Not available yet in all regions
○ C3 Nodes Deployment are not always a possiblity due to AWS
capacity issues
○ Amazon promised to resolve the C3 issues next month
15. Monitoring
●
We heavily rely on DataStax OpsCenter
●
We grab OpsCenter Metrics out for graphings
●
We wrote our own Read/Write Speed Test on separate dedicated KeySpace on
each node to detect bottlenecks and problematic nodes
●
We Sample the data separately from the Application to detect if the problem
origins are C* or the application
16. What have we learned
●
●
●
●
Storage:
○ Use SSD:
■ It provides high and stable disk performance
■ Neutralizes Compaction and Repair effects on the cluster
■ Worth the money
Network:
■ Use highest bandwidth VPN possible
■ GRE is great (lacks encryption, but provides best bandwidth)
Maintenance:
○ Run Compact Daily: It does miracle to performance on heavy loads
○ If you are not on SSD, disable thrift on the node before running compaction
○ Do compactions in sequence, node by node
○ On high-load systems, avoid repair as possible, it’s better to decommission
and recommission a node than to run repair!
○ If you have to repair, always use “-pr” flag and if possible use the
incremental repair option (requires heavy scripting)
Monitoring:
○ Write a sampler and speed tester for each node to detect bottlenecks and
performance issues sources