Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 75

Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema

0

Share

Download to read offline

Building a curated data lake on real time data is an emerging data warehouse pattern with delta. However in the real world, what we many times face ourselves with is dynamically changing schemas which pose a big challenge to incorporate without downtimes.

Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema

  1. 1. Designing and Implementing Real-time Data Lake with Dynamically changing Schemas
  2. 2. Agenda Mate Gulyas Practice Lead and Principal Instructor Databricks Shasidhar Eranti Resident Solutions Engineer Databricks
  3. 3. Introduction
  4. 4. ▪ SEGA is a worldwide leader in interactive entertainment
  5. 5. ▪ SEGA is a worldwide leader in interactive entertainment ▪ Huge franchises including Sonic, Total War and Football Manager
  6. 6. ▪ SEGA is a worldwide leader in interactive entertainment ▪ Huge franchises including Sonic, Total War and Football Manager ▪ SEGA is currently celebrating its long awaited 60th anniversary.
  7. 7. ▪ SEGA is a worldwide leader in interactive entertainment ▪ Huge franchises including Sonic, Total War and Football Manager ▪ SEGA is currently celebrating its long awaited 60th anniversary. ▪ SEGA also produces arcade machines, holiday resorts, films and merchandise
  8. 8. ▪ Real time data from SEGA titles is crucial for business users.
  9. 9. ▪ Real time data from SEGA titles is crucial for business users. ▪ SEGA’s 6 studios send data to one centralised data platform.
  10. 10. ▪ Real time data from SEGA titles is crucial for business users. ▪ SEGA’s 6 studios send data to one centralised data platform. ▪ New events are frequently added and event schemas evolve overtime.
  11. 11. ▪ Real time data from SEGA titles is crucial for business users. ▪ SEGA’s 6 studios send data to one centralised data platform. ▪ New events are frequently added and event schemas evolve overtime. ▪ Over 300 event types from over 40 SEGA titles (constantly growing)
  12. 12. ▪ Real time data from SEGA titles is crucial for business users. ▪ SEGA’s 6 studios send data to one centralised data platform. ▪ New events are frequently added and event schemas evolve overtime. ▪ Over 300 event types from over 40 SEGA titles (constantly growing) ▪ Events arrive at a rate of 8,000 every second
  13. 13. What is the GOAL and the CHALLENGE we try to achieve? Real time data lake No upfront information about the schemas or the upcoming schema changes No downtime
  14. 14. Architecture
  15. 15. Key Requirements Ingest different types of JSON at scale Handle schema evolution dynamically Serve un-structured data in a structured form for Business users
  16. 16. Architecture
  17. 17. ● Delta Architecture (Bronze - Silver layers) Architecture
  18. 18. ● Delta Architecture (Bronze - Silver layers) Architecture ● Ingestion Stream (Bronze) Using forEachBatch() ○ Dump JSON into delta table ○ Track Schema changes
  19. 19. ● Delta Architecture (Bronze - Silver layers) Architecture ● Ingestion Stream (Bronze) Using forEachBatch() ○ Dump JSON into delta table ○ Track Schema changes ● Stream multiplexing using Delta ● Event Streams(Silver) ○ Read from Bronze table ○ Fetch event schema ○ Apply schema using from_json() ○ Write to Silver table
  20. 20. ● Delta Architecture (Bronze - Silver layers) Architecture ● Ingestion Stream (Bronze) Using forEachBatch() ○ Dump JSON into delta table ○ Track Schema changes ● Stream multiplexing using Delta ● Event Streams(Silver) ○ Read from Bronze table ○ Fetch event schema ○ Apply schema using from_json() ○ Write to Silver table
  21. 21. Sample Data Bronze Table
  22. 22. Sample Data Silver Tables Bronze Table Event Type 1.1 Event Type 2.1
  23. 23. Schema Inference
  24. 24. { “event_type”: “1.1”, “user_agent”: “chrome”, } Schema Changes { “event_type”: “1.1”, “user_agent”: “firefox”, “has_plugins”: “true”, }
  25. 25. { “event_type”: “1.1”, “user_agent”: “chrome”, } Schema Variation Hash 1. Raw message
  26. 26. { “event_type”: “1.1”, “user_agent”: “chrome” } Schema Variation Hash [“event_type”, “user_agent”] 1. Raw message 2. Sorted list of ALL columns (including nested)
  27. 27. { “event_type”: “1.1”, “user_agent”: “chrome” } Schema Variation Hash 7862AF20813560D9AAEAF38D7E [“event_type”, “user_agent”] 3. Calculate SHA1 Hash 1. Raw message 2. Sorted list of ALL columns (including nested)
  28. 28. Schema Repository
  29. 29. Schema Repository
  30. 30. { “event_type”: “1.1”, “user_agent”: “chrome”, “has_plugins”: “true”, } Schema Variation Hash 1. Raw message
  31. 31. { “event_type”: “1.1”, “user_agent”: “chrome”, “has_plugins”: “true”, } Schema Variation Hash [“event_type”, “user_agent”, ”has_plugins”] 1. Raw message 2. Sorted list of ALL columns (including nested)
  32. 32. { “event_type”: “1.1”, “user_agent”: “chrome”, “has_plugins”: “true”, } Schema Variation Hash BEA2ACAF2081350D9AAEAF38D7E [“event_type”, “user_agent”, ”has_plugins”] 3. Calculate SHA1 Hash 1. Raw message 2. Sorted list of ALL columns (including nested)
  33. 33. { “event_type”: “1.1”, “user_agent”: “chrome”, “has_plugins”: “true”, } Schema Variation Hash BEA2ACAF2081350D9AAEAF38D7E [“event_type”, “user_agent”, ”has_plugins”] 3. Calculate SHA1 Hash 1. Raw message 2. Sorted list of ALL columns (including nested) Not in Schema Repository
  34. 34. { “event_type”: “1.1”, “user_agent”: “chrome”, “has_plugins”: “true”, } Schema Variation Hash BEA2ACAF2081350D9AAEAF38D7E [“event_type”, “user_agent”, ”has_plugins”] 3. Calculate SHA1 Hash 1. Raw message 2. Sorted list of ALL columns (including nested) Not in Schema Repository We need to update the schema for 1.1
  35. 35. Foreach Batch
  36. 36. Update the Schema
  37. 37. Update the Schema The new, so far UNSEEN message
  38. 38. Update the Schema All of the the old prototypes from the Schema Repository (We have only 1 now, but could be more)
  39. 39. Update the Schema
  40. 40. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  41. 41. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  42. 42. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  43. 43. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  44. 44. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  45. 45. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  46. 46. Schema Repository
  47. 47. Schema Repository
  48. 48. We now have a new schema that incorporates all the previous prototypes from all known schema variations
  49. 49. Silver tables
  50. 50. Foreach Batch
  51. 51. def assert_and_process(event_type: String, target: String) (df:DataFrame, batchId: Long): Unit = { val (schema, schemaVersion) = get_schema(schema_repository, event_type_id) df .transform(process_raw_events(schema, schemaVersion)) .write.format("delta").mode("append") .option("mergeSchema", true) .save(target) } Retrieve the schema
  52. 52. def assert_and_process(event_type: String, target: String)(df:DataFrame, batchId: Long): Unit = { val (schema, schemaVersion) = get_schema(schema_repository, event_type_id) df .transform(process_raw_events(schema, schemaVersion)) .write.format("delta").mode("append") .option("mergeSchema", true) .save(target) } Retrieve the schema
  53. 53. def assert_and_process(event_type: String, target: String)(df:DataFrame, batchId: Long): Unit = { val (schema, schemaVersion) = get_schema(schema_repository, event_type_id) df .transform(process_raw_events(schema, schemaVersion)) .write.format("delta").mode("append") .option("mergeSchema", true) .save(target) } Retrieve the schema
  54. 54. def assert_and_process(event_type: String, target: String)(df:DataFrame, batchId: Long): Unit = { val (schema, schemaVersion) = get_schema(schema_repository, event_type_id) df .transform(process_raw_events(schema, schemaVersion)) .write.format("delta").mode("append").partitionBy(partitionColumns: _*) .option("mergeSchema", true) .save(target) } Retrieve the schema
  55. 55. Productionization (Deployment and Monitoring)
  56. 56. Deploying Event Streams
  57. 57. Deploying Event Streams ● Events are grouped logically
  58. 58. Deploying Event Streams ● Events are grouped logically ● Stream groups are deployed on job clusters
  59. 59. Deploying Event Streams ● Events are grouped logically ● Stream groups are deployed on job clusters ● Two main aspects ○ Schema change ○ New Schema detected
  60. 60. Deploying Event Streams ● Events are grouped logically ● Stream groups are deployed on job clusters ● Two main aspects ○ Schema change ○ New Schema detected Schema change ● Incompatible schema changes causes stream failures
  61. 61. Deploying Event Streams ● Events are grouped logically ● Stream groups are deployed on job clusters ● Two main aspects ○ Schema change ○ New Schema detected Schema change ● Incompatible schema changes causes stream failures ● Stream monitoring in job clusters
  62. 62. Deploying Event Streams ● Events are grouped logically ● Stream groups are deployed on job clusters ● Two main aspects ○ Schema change ○ New Schema detected Schema change ● Incompatible schema changes causes stream failures ● Stream monitoring in job clusters New Schema detected
  63. 63. Management Stream EventGroup table
  64. 64. Management Stream EventGroup table ● Tracks schema changes from schemaRegistry table
  65. 65. Management Stream EventGroup table ● Tracks schema changes from schemaRegistry table ● Two type of source changes ○ Change in schema ○ New schema detected
  66. 66. Management Stream EventGroup table ● Tracks schema changes from schemaRegistry table ● Two type of source changes ○ Change in schema ○ New schema detected ● Change in schema (No action)
  67. 67. Management Stream EventGroup table ● Tracks schema changes from schemaRegistry table ● Two type of source changes ○ Change in schema ○ New schema detected ● Change in schema (No action) ● New schema detected ○ Add new entry in event group table ○ New stream is launched automatically
  68. 68. Monitoring
  69. 69. Monitoring ● Use Structured Streaming listener APIs to track metrics
  70. 70. Monitoring ● Use Structured Streaming listener APIs to track metrics ● Dump Streaming metrics to central dashboarding tool
  71. 71. Monitoring ● Use Structured Streaming listener APIs to track metrics ● Dump Streaming metrics to central dashboarding tool ● Key metrics tracked in monitoring dashboard ○ Stream Status ○ Streaming latency
  72. 72. Monitoring ● Use Structured Streaming listener APIs to track metrics ● Dump Streaming metrics to central dashboarding tool ● Key metrics tracked in monitoring dashboard ○ Stream Status ○ Streaming latency ● Enable Stream metrics capture for ganglia using spark.sql.streaming.metricsEnabled=true
  73. 73. Key takeaways Delta helps with Schema Evolution and Stream Multiplexing capabilities Schema Variation hash to detect schema changes ImplementationArchitecture Job clusters to run streams in production Productionizing
  74. 74. Felix Baker, SEGA ” “This has revolutionised the flow of analytics from our games and has enabled business users to analyse and react to data far more quickly than we have been able to do previously.”
  75. 75. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×