Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema

Designing and
Implementing Real-time
Data Lake with
Dynamically changing
Schemas

Agenda
Mate Gulyas
Practice Lead and Principal Instructor
Databricks
Shasidhar Eranti
Resident Solutions Engineer
Databricks

▪ SEGA is a worldwide leader in interactive entertainment

▪ Huge franchises including Sonic, Total War and Football
Manager

Manager
▪ SEGA is currently celebrating its long awaited 60th anniversary.

Manager
▪ SEGA is currently celebrating its long awaited 60th anniversary.
▪ SEGA also produces arcade machines, holiday resorts, ﬁlms
and merchandise

▪ Real time data from SEGA titles is crucial for business users.

▪ SEGA’s 6 studios send data to one centralised data platform.

▪ New events are frequently added and event schemas evolve
overtime.

overtime.
▪ Over 300 event types from over 40 SEGA titles (constantly growing)

overtime.
▪ Over 300 event types from over 40 SEGA titles (constantly growing)
▪ Events arrive at a rate of 8,000 every second

What is the GOAL and the CHALLENGE we try to
achieve?
Real time
data lake
No upfront
information about
the schemas or the
upcoming schema
changes
No downtime

Key Requirements
Ingest different
types of JSON at
scale
Handle schema
evolution
dynamically
Serve
un-structured
data in a
structured form
for Business
users

● Delta Architecture (Bronze - Silver
layers)
Architecture

layers)
Architecture
● Ingestion Stream (Bronze)
Using forEachBatch()
○ Dump JSON into delta table
○ Track Schema changes

layers)
Architecture
● Ingestion Stream (Bronze)
Using forEachBatch()
○ Dump JSON into delta table
○ Track Schema changes
● Stream multiplexing using Delta
● Event Streams(Silver)
○ Read from Bronze table
○ Fetch event schema
○ Apply schema using from_json()
○ Write to Silver table

Sample Data
Silver Tables
Bronze Table
Event Type 1.1 Event Type 2.1

{
“event_type”: “1.1”,
“user_agent”: “chrome”,
}
Schema Changes
{
“user_agent”: “firefox”,
“has_plugins”: “true”,
}

{
}
Schema Variation Hash
1. Raw message

{
“user_agent”: “chrome”
}
[“event_type”, “user_agent”]
1. Raw message 2. Sorted list of ALL columns (including nested)

{
“user_agent”: “chrome”
}
7862AF20813560D9AAEAF38D7E
[“event_type”, “user_agent”]
3. Calculate SHA1 Hash

{
}
1. Raw message

{
}
[“event_type”, “user_agent”, ”has_plugins”]

{
}
BEA2ACAF2081350D9AAEAF38D7E

{
}
Not in Schema Repository

{
}
Not in Schema Repository
We need to update the schema
for 1.1

Update the Schema
The new, so far UNSEEN message

Update the Schema
All of the the old prototypes from the Schema Repository
(We have only 1 now, but could be more)

from typing import List
from pyspark.sql import Row
def inferSchema(protoPayloads: List[str]) -> "DataType":
schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads))
return spark
.read
.option("inferSchema", True)
.json(schemaProtoDF.rdd.map(lambda r: r.json))
.schema)
Update the Schema

We now have a new schema that incorporates all
the previous prototypes from all known schema
variations

def assert_and_process(event_type: String, target: String) (df:DataFrame,
batchId: Long): Unit = {
val (schema, schemaVersion) = get_schema(schema_repository, event_type_id)
df
.transform(process_raw_events(schema, schemaVersion))
.write.format("delta").mode("append")
.option("mergeSchema", true)
.save(target)
}
Retrieve the schema

def assert_and_process(event_type: String, target: String)(df:DataFrame,
df
.write.format("delta").mode("append")
.save(target)
}
Retrieve the schema

def assert_and_process(event_type: String, target: String)(df:DataFrame,
df
.write.format("delta").mode("append").partitionBy(partitionColumns: _*)
.save(target)
}
Retrieve the schema

Productionization
(Deployment and Monitoring)

Deploying Event Streams
● Events are grouped logically

● Stream groups are deployed on job
clusters

clusters
● Two main aspects
○ Schema change
○ New Schema detected

clusters
○ Schema change
Schema change
● Incompatible schema changes causes
stream failures

clusters
○ Schema change
Schema change
stream failures
● Stream monitoring in job clusters

clusters
○ Schema change
Schema change
stream failures
● Stream monitoring in job clusters
New Schema detected

Management Stream EventGroup table

● Tracks schema changes from
schemaRegistry table

● Two type of source changes
○ Change in schema
○ New schema detected

● Change in schema (No action)

● Change in schema (No action)
● New schema detected
○ Add new entry in event group table
○ New stream is launched
automatically

Monitoring
● Use Structured Streaming listener
APIs to track metrics

Monitoring
● Dump Streaming metrics to central
dashboarding tool

Monitoring
dashboarding tool
● Key metrics tracked in monitoring
dashboard
○ Stream Status
○ Streaming latency

Monitoring
dashboarding tool
● Key metrics tracked in monitoring
dashboard
○ Stream Status
○ Streaming latency
● Enable Stream metrics capture for
ganglia using
spark.sql.streaming.metricsEnabled=true

Key takeaways
Delta helps with
Schema Evolution
and Stream
Multiplexing
capabilities
Schema Variation
hash to detect
schema changes
ImplementationArchitecture
Job clusters to run
streams in
production
Productionizing

Felix Baker, SEGA
”
“This has revolutionised the ﬂow of analytics from our games
and has enabled business users to analyse and react to data
far more quickly than we have been able to do previously.”

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema

Similar to Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (17)

Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema