Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Etl is Dead; Long Live Streams

Neha Narkhede talks about the experience at LinkedIn moving from batch-oriented ETL to real-time streams using Apache Kafka and how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for event data. She covers some of the challenges of scaling Kafka to hundreds of billions of events per day at Linkedin, supporting thousands of engineers, etc.

Etl is Dead; Long Live Streams

  1. 1. ETL is dead; long-live streams Neha Narkhede, Co-founder & CTO, Confluent
  2. 2. “Data and data systems have really changed in the past decade
  3. 3. Old world: Two popular locations for data Operational databases Relational data warehouse DB DB DB DB DWH
  4. 4. “Several recent data trends are driving a dramatic change in the ETL architecture
  5. 5. “#1: Single-server databases are replaced by a myriad of distributed data platforms that operate at company-wide scale
  6. 6. “#2: Many more types of data sources beyond transactional data - logs, sensors, metrics...
  7. 7. “#3: Stream data is increasingly ubiquitous; need for faster processing than daily
  8. 8. “The end result? This is what data integration ends up looking like in practice
  9. 9. App App App App search HadoopDWH monitoring security MQ MQ cache cache
  10. 10. A giant mess! App App App App search HadoopDWH monitoring security MQ MQ cache cache
  11. 11. “We will see how transitioning to streams cleans up this mess and works towards...
  12. 12. Streaming platform DWH Hadoop security App App App App search NoSQL monitor ing request-response messaging OR stream processing streaming data pipelines changelogs
  13. 13. A short history of data integration
  14. 14. “Surfaced in the 1990s in retail organizations for analyzing buyer trends
  15. 15. “Extract data from databases Transform into destination warehouse schema Load into a central data warehouse
  16. 16. “BUT … ETL tools have been around for a long time, data coverage in data warehouses is still low! WHY?
  17. 17. Etl has drawbacks
  18. 18. “#1: The need for a global schema
  19. 19. “#2: Data cleansing and curation is manual and fundamentally error-prone
  20. 20. “#3: Operational cost of ETL is high; it is slow; time and resource intensive
  21. 21. “#4: ETL tools were built to narrowly focus on connecting databases and the data warehouse in a batch fashion
  22. 22. “Early take on real-time ETL = Enterprise Application Integration (EAI)
  23. 23. “EAI: A different class of data integration technology for connecting applications in real-time
  24. 24. “EAI employed Enterprise Service Buses and MQs; weren’t scalable
  25. 25. ETL and EAI are outdated!
  26. 26. Old world: scale or timely data, pick one real-time scale batch EAI ETL real-time BUT not scalable scalable BUT batch
  27. 27. “Data integration and ETL in the modern world need a complete revamp
  28. 28. new world: streaming, real-time and scalable real-time scale EAI ETL Streaming Platform real-time BUT not scalable real-time AND scalable scalable BUT batch batch
  29. 29. “Modern streaming world has new set of requirements for data integration
  30. 30. “#1: Ability to process high-volume and high-diversity data
  31. 31. “#2 A streaming platform from the grounds up; a fundamental transition to event-centric thinking
  32. 32. Event-Centric Thinking Streaming Platform “A product was viewed” HadoopWeb app
  33. 33. Event-Centric Thinking Streaming Platform “A product was viewed” Hadoop Web app mobile app APIs
  34. 34. mobile app web app APIs Streaming Platform Hadoop Security Monitoring Rec engine “A product was viewed” Event-Centric Thinking
  35. 35. “Event-centric thinking, when applied at a company-wide scale, leads to this simplification ...
  36. 36. Streaming platform DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  37. 37. “#3: Enable forward-compatible data architecture; the ability to add more applications that need to process the same data … differently
  38. 38. “To enable forward compatibility, redefine the T in ETL: Clean data in; Clean data out
  39. 39. app logs app logs app logs app logs #1: Extract as unstructured text #2: Transform1 = data cleansing = “what is a product view” #4: Transform2 = drop PII fields” #3: Load into DWH DWH
  40. 40. #1: Extract as unstructured text #2: Transform1 = data cleansing = “what is a product view” #4: Transform2 = drop PII fields” DWH #2: Transform1 = data cleansing = “what is a product view” #4: Transform2 = drop PII fields” Cassandra #1: Extract as unstructured text again #3: Load cleansed data #3: Load cleansed data
  41. 41. #1: Extract as structured product view events #2: Transforms = drop PII fields” #4:.1 Load product view stream #4: Load filtered product View stream DWH Cassandra Streaming Platform #4.2 Load filtered product view stream
  42. 42. “To enable forward compatibility, redefine the T in ETL: Data transformations, not data cleansing!
  43. 43. “In summary, needs of modern data integration solution? Scale, diversity, latency and forward compatibility
  44. 44. Requirements for a modern streaming data integration solution - Fault tolerance - Parallelism - Latency - Delivery semantics - Operations and monitoring - Schema management
  45. 45. Data integration: platform vs tool Central, reusable infrastructure for many use cases One-off, non-reusable solution for a particular use case
  46. 46. New shiny future of etl: a streaming platform NoSQL RDBMS Hadoop DWH Apps Apps Apps Search Monitoring RT analytics
  47. 47. “Streaming platform serves as the central nervous system for a company’s data in the following ways ...
  48. 48. “#1: Serves as the real-time, scalable messaging bus for applications; no EAI
  49. 49. “#2: Serves as the source-of-truth pipeline for feeding all data processing destinations; Hadoop, DWH, NoSQL systems and more
  50. 50. “#3: Serves as the building block for stateful stream processing microservices
  51. 51. “ Batch data integration Streaming
  52. 52. “ Batch ETL Streaming
  53. 53. a short history of data integration drawbacks of ETL needs and requirements for a streaming platform new, shiny future of ETL: a streaming platform What does a streaming platform look like and how it enables Streaming ETL?
  54. 54. Apache kafka: a distributed streaming platform
  55. 55. 55Confidential Apache kafka 7 years ago 55
  56. 56. 56Confidential > 1,400,000,000,000 messages processed / day 56
  57. 57. Now Adopted at 1000s of companies worldwide
  58. 58. “What role does Kafka play in the new shiny future for data integration?
  59. 59. “#1: Kafka is the de-facto storage of choice for stream data
  60. 60. The log 0 1 2 3 4 5 6 7 next write reader 1 reader 2
  61. 61. The log & pub-sub 0 1 2 3 4 5 6 7 publisher subscriber 1 subscriber 2
  62. 62. “#2: Kafka offers a scalable messaging backbone for application integration
  63. 63. Kafka messaging APIs: scalable eai app Messaging APIs produce(message) consume(message)
  64. 64. “#3: Kafka enables building streaming data pipelines (E & L in ETL)
  65. 65. Kafka’s Connect API: Streaming data ingestion app Messaging APIsMessaging APIs ConnectAPI ConnectAPI app source sink Extract Load
  66. 66. “#4: Kafka is the basis for stream processing and transformations
  67. 67. Kafka’s streams API: stream processing (transforms) Messaging API Streams API apps apps ConnectAPI ConnectAPI source sink Extract Load Transforms
  68. 68. Kafka’s connect API = E and L in Streaming ETL
  69. 69. Connectors! NoSQL RDBMS Hadoop DWH Search Monitoring RT analytics Apps Apps Apps
  70. 70. Sources and sinks ConnectAPI ConnectAPI source sink Extract Load
  71. 71. Kafka’s Connect API = Connectors Made Easy! - Scalability: Leverages Kafka for scalability - Fault tolerance: Builds on Kafka’s fault tolerance model - Management and monitoring: One way of monitoring all connectors - Schemas: Offers an option for preserving schemas from source to sink
  72. 72. Kafka all the things! ConnectAPI
  73. 73. Kafka’s streams API = The T in STREAMING ETL
  74. 74. “Stream processing = transformations on stream data
  75. 75. Kafka’s streams API Streams API microservice Transforms
  76. 76. “Kafka’s Streams API = Easiest way to do stream processing using Kafka
  77. 77. “#1: Powerful and lightweight Java library; need just Kafka and your app app
  78. 78. “#2: Convenient DSL with all sorts of operators: join(), map(), filter(), windowed aggregates etc
  79. 79. Word count program using Kafka’s streams API
  80. 80. “#3: True event-at-a-time stream processing; no microbatching
  81. 81. “#4: Supports event-time processing; handles late-arriving data
  82. 82. “#5: Out-of-the-box support for local state; supports fast stateful processing
  83. 83. External state
  84. 84. local state
  85. 85. Fault-tolerant local state
  86. 86. “#6: Kafka’s Streams API allows reprocessing; useful to upgrade apps or do A/B testing
  87. 87. reprocessing
  88. 88. Logs unify batch and stream processing
  89. 89. Streams API app sinksource ConnectAPI ConnectAPI Transforms LoadExtract New shiny future of ETL: Kafka
  90. 90. A giant mess! App App App App search HadoopDWH monitoring security MQ MQ cache cache
  91. 91. All your data … everywhere … now Streaming platform DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  92. 92. VISION: All your data … everywhere … now Streaming platform DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  93. 93. 93 Kafka Summit SF – August 28 Conference Discount Code: kafcom17 ($50 off conference pass) www.kafka-summit.org Presented by
  94. 94. Thank you! @nehanarkhede
  95. 95. “For data integration - Streaming platform : stream data :: relational DWH : batch data ?
  96. 96. “For data integration - Streaming platform : stream data :: relational DWH : batch data ? Hold onto that thought!
  97. 97. Relational data warehouse Storage & Integration Processing Apps Batch data integration Batch ETL Daily reporting
  98. 98. Streaming platform Storage & Integration Processing Apps Streaming integration Stream processing Real-time monitoring, analytics

×