Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stream Processing in Uber

2,137 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1OKo5FN.

Danny Yuan discusses how stream processing is used in Uber's real-time system to solve a wide range of problems, including but not limited to real-time aggregation and prediction on geospatial time series, data migration, monitoring and alerting, and extracting patterns from data streams. Yuan also presents the architecture of the stream processing pipeline. Filmed at qconsf.com.

Danny Yuan is a software engineer in Uber. He's currently working on streaming systems for Uber's logistics platform. Prior to joining Uber, he worked on building Netflix's cloud platform. His work includes predictive autoscaling, distributed tracing service, real-time data pipeline that scaled to process hundreds of billions of events every day, and Netflix's low-latency crypto services.

Published in: Technology
  • Login to see the comments

Stream Processing in Uber

  1. 1. STREAM PROCESSING@ UBERDANNY YUAN @ UBER
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /uber-stream-processing
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. What is Uber
  5. 5. Transportation at your fingertips
  6. 6. Stream Data Allows Us To Feel The Pulse Of Cities
  7. 7. Marketplace Health
  8. 8. What’s Going on Now
  9. 9. What’s Happened?
  10. 10. Status Tracking
  11. 11. A Little Background
  12. 12. Uber’s Platform Is a Distributed State Machine Rider States
  13. 13. Uber’s Platform Is a Distributed State Machine Rider States Driver States
  14. 14. Applications can’t do everything
  15. 15. Instead, Applications Emit Events
  16. 16. Events Should Be Available In Seconds
  17. 17. Events Should Rarely Get Lost
  18. 18. Events Should Be Cheap And Scalable
  19. 19. Where are the challenges?
  20. 20. Many Dimensions Dozens of fields per event
  21. 21. Granular Data
  22. 22. Granular Data
  23. 23. Granular Data Over 10,000 hexagons in the city
  24. 24. Granular Data 7 vehicle types
  25. 25. Granular Data 1440 minutes in a day
  26. 26. Granular Data 13 driver states
  27. 27. Granular Data 300 cities
  28. 28. Granular Data 1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion possible combinations
  29. 29. Unknown Query Patterns Any combination of dimensions
  30. 30. Variety of Aggregations - Heatmap - Top N - Histogram - count(), avg(), sum(), percent(), geo
  31. 31. Different Geo Aggregation
  32. 32. Large Data Volume • Hundreds of thousands of events per second, or billions of events per day
 • At least dozens of fields in each event
  33. 33. Tight Schedule
  34. 34. Key: Generalization
  35. 35. Data Type • Dimensional Temporal Spatial Data Dimension Value state driver_arrived vehicle type uber X timestamp 13244323342 lattitude 12.23 longitude 30.00
  36. 36. Data Query • OLAP on single-table temporal-spatial data 
 SELECT  <agg  functions>,  <dimensions>  
 FROM  <data_source>
 WHERE  <boolean  filter>
 GROUP  BY  <dimensions>
 HAVING  <boolean  filter>
 ORDER  BY  <sorting  criterial>
 LIMIT  <n>
 DO  <post  aggregation>
  37. 37. Finding the Right Storage System
  38. 38. Minimum Requirements • OLAP with geospatial and time series support
 • Support large amount of data
 • Sub-second response time
 • Query of raw data 

  39. 39. It can’t be a KV store
  40. 40. Challenges to KV Store Pre-computing all keys is O(2n)  for both space and time 

  41. 41. It can’t be a relational database
  42. 42. Challenges to Relational DB • Managing multiple indices is painful
 • Scanning is not fast enough 

  43. 43. A System That Supports • Fast scan
 • Arbitrary boolean queries
 • Raw data
 • Wide range of aggregations 

  44. 44. Elasticsearch
  45. 45. Highly Efficient Inverted-Index For Boolean Query
  46. 46. Built-in Distributed Query
  47. 47. Fast Scan with Flexible Aggregations
  48. 48. Storage
  49. 49. Are We Done?
  50. 50. Transformation e.g. (Lat, Long) -> (zipcode, hexagon)
  51. 51. Dynamic Pricing
  52. 52. Trend Prediction
  53. 53. Supply and Demand Distribution
  54. 54. Technically Speaking: Clustering & Pr(D, S, E)
  55. 55. New Use Cases —> New Requirements
  56. 56. Pre-aggregation
  57. 57. Joining Multiple Streams
  58. 58. Sessionization
  59. 59. Multi-Staged Processing
  60. 60. State Management
  61. 61. Apache Samza
  62. 62. Why Apache Samza?
  63. 63. DAG on Kafka
  64. 64. Excellent Integration with Kafka
  65. 65. Excellent Integration with Kafka
  66. 66. Built-in Checkpointing
  67. 67. Built-in State Management
  68. 68. Processing Storage
  69. 69. What If Storage Is Down?
  70. 70. What If Processing Takes Long?
  71. 71. Processing Storage
  72. 72. Are We Done?
  73. 73. Post Processing
  74. 74. Results Transformation and Smoothing
  75. 75. Scale of Post Processing 10,000 hexagons in a city
  76. 76. Scale of Post Processing 331 neighboring hexagons to look at
  77. 77. Scale of Post Processing 331 x 10,000 = 3.1 Million Hexagons to Process for a Single Query
  78. 78. Scale of Post Processing 99%-ile Processing Time: 70ms
  79. 79. Post Processing • Each processor is a pure function
 • Processors can be composed by combinators
  80. 80. Post Processing • Highly parallelized execution
 • Pipelining
  81. 81. Post Processing • Each processor is a pure function
 • Processors can be composed by combinators
 • Highly parallelized execution
  82. 82. Practical Considerations
  83. 83. Data Discovery
  84. 84. Elasticsearch Query Can Be Complex
  85. 85. /driverAcceptanceRate?   geo_dist(10,  [37,  22])&   time_range(2015-­‐02-­‐04,2015-­‐03-­‐06)&   aggregate(timeseries(7d))&   eq(msg.driverId,1)  
  86. 86. Elasticsearch Query Can Be Optimized • Pipelining
 • Validation
 • Throttling
  87. 87. Timeinseconds
  88. 88. Elasticsearch Can Be Replaced
  89. 89. Storage QueryProcessing
  90. 90. There’s one more thing
  91. 91. There are always patterns in streams
  92. 92. There is always need for quick exploration
  93. 93. How many drivers cancel a request 10 times in a row within a 5-minute window?
  94. 94. Which riders request a pickup from 100 miles apart within a half hour window?
  95. 95. Complex Event Processing FROM  driver_canceled#window.time(10  min)     SELECT  clientUUID,  count(clientUUID)  as  cancelCount   GROUP  BY  clientUUID  HAVING  cancelCount  >  10     INSERT  INTO  hipchat(room);
  96. 96. Implementation Becomes Easy
  97. 97. Thank You!
  98. 98. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/uber- stream-processing

×