Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 28

Containerized Stream Engine to Build Modern Delta Lake

0

Share

Download to read offline

As days goes, everything is changing, your business, your analytics platform and your data. So, Deriving the real time insights from this humongous volume of data are key for survival. This robust solution can operate you to the speed of change.

Containerized Stream Engine to Build Modern Delta Lake

  1. 1. Containerized Stream Engine to build Modern Delta Lake Sandeep Reddy Bheemi Reddy, Senior Data Engineer Karthikeyan Siva Baskaran, Senior Data Engineer
  2. 2. Who We Are Sandeep Reddy Bheemi Reddy Senior Data Engineer Karthikeyan Siva Baskaran Senior Data Engineer TIGER DATA FOUNDATION Containerized Stream Engine to build Modern Delta Lake Contact us +1-408-508-4430 info@tigeranalytics.com https://www.tigeranalytics.com/
  3. 3. Agenda Objective Design Considerations Infrastructure Provisioning Solution Deep Dive Application Monitoring Points to be Noted Questions
  4. 4. Objective To Build Single Source of Truth data for Enterprise via CDC Most compelling operational analytics demand real-time data rather than historical data. Data Agility The Speed of business is rapidly accelerating, driving the need for delivering intelligent, fast solutions. Facilitate larger amounts of data from multiple sources by tracking changes made to the source data, combine them together to build a Single Source of Truth to make decisions based on data. Build SSOT from Siloed Data Demand for real-time Data
  5. 5. Design Considerations
  6. 6. Few Common Ways to Capture Data to get Insights Change Data Capture App DB LOG Analytics Data Lake Dual Writes App DB Pub Sub System Analytics Data Lake Direct JDBC App DB Analytics Data Lake
  7. 7. Inconsistent Data During Job failure, in overwrite mode it leads to inconsistent data Schema Enforcement & Evolution DDLs are not supported, this leads to break in the flow if upstream applications changed the schema Roll Back not possible In case of failure, it is not possible to roll back to the previous state of data No Metadata layer As there is no metadata layer, there is no clear isolation b/w reads and writes – thus it is not consistent, durable or atomic VersioningSchema E2 Data Corruption Not ACID Complaint Problem with Today’s Data Lake
  8. 8. Provides clear isolation between different writes by maintaining log file for each transaction Even Job failure with Overwrite mode, will not corrupt the data Provides serializable isolation levels to ensure the data consistent across multiple users Changes to the table are maintained as ordered, atomic commits ACID Compliant Atomicity Consistency Isolation Durability mergeSchema - Any column that is present in the Data Frame but not in the target table is automatically added on to the end of the schema. overwriteSchema – Datatype change, drop/rename column Time Travel to older version All metadata and lineage of your data are stored. To travel back to previous versions of your delta table, provide a timestamp or specific version Expectations for data quality, which prohibits the invalid data to enter your enterprise data lake Data Check Constraints Schema Enforcement & Evolution Delta Lake to Rescue
  9. 9. Infrastructure Provisioning
  10. 10. 10 On Premise Code Repo To maintain versions of Terraform files Open Source Agent Security & Compliance Checks Terraform To deploy TF files and maintain state of TF files. Deploy TF Files TF State Files DevOps Professional CD Pipeline TF Files ▪ Cloud Agnostic – Create & manage infrastructure across various platforms. ▪ Minimize human errors and configuration differences in various environments. ▪ Maintain the state of infrastructure. ▪ Perform policy checks on Infrastructure IaC – Workflow Infra Provisioning in Selected Environment Kubernetes Cluster (With Scalable worker nodes) Pods (Deployment, Replica Sets) Launch the Deployment Services (Node Port & Load Balancer) Volumes (PV & PVC)
  11. 11. Solution Deep Dive
  12. 12. Kafka Schema Registry Kubernetes Source Database Structured StreamingKafka Connect Change Data Streaming Queue Processing Layer Storage Layer DB Logs ADLS S3 Kafka Connect uses Debezium connector to parse the database logs Schema id data Schema id data Avro Schema-1 Schema-2 Schema-n id - 1001 id - 1002 id - n Register Schema Provides flexibility by creating a VIEW from different schema for different teams based on their need. This helps downstream apps to run without any interruption when schema changes Persistent Volume Claim PVC PVC
  13. 13. { "name": "mssql-${DBName}-connector-${conn_string}", "config": { "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector", "tasks.max": "1", "database.hostname": "${Hostname}", "database.port": "${Port}", "database.user": "${UserName}", "database.password": "${Password}", "database.server.id": "${conn_string}", "database.server.name": "${Source}.${DBName}.${conn_string}", "database.whitelist": "${DBName}", "database.dbname": "${DBName}", "database.history.kafka.bootstrap.servers": "${KAFKA}:9092", "database.history.kafka.topic": "${Source}.${DBName}.dbhistory", "key.converter":"io.confluent.connect.avro.AvroConverter", "key.converter.schema.registry.url":"http://${SCHEMA_REGISTRY}:8081", "value.converter":"io.confluent.connect.avro.AvroConverter", "value.converter.schema.registry.url": "http://${SCHEMA_REGISTRY}:8081", } } Kafka Connector Properties
  14. 14. { "payload": { "before": { "emp_no": 1, "birth_date": 18044, "first_name": “Marion", "last_name": “Colbrun" }, "after": { "emp_no": 1, "birth_date": 18044, "first_name": “Marion", "last_name": “Brickell" } } } { "payload": { "before": { "emp_no": 1, "birth_date": 18044, "first_name": "Marion", "last_name": "Colbrun" }, "after": null } } { "payload": { "before": null, "after": { "emp_no": 1, "birth_date": 18044, "first_name": "Marion", "last_name": "Colbrun" } } } insert into sample_emp values (1,current_date,'Marion’, 'Colbrun'); update sample_emp set last_name='Brickell’ where emp_no=1; delete from sample_emp where emp_no=1; INSERT UPDATE DELETE
  15. 15. CDC Code Logic Flow Read data from Kafka and create Delta Table and insert the recent data based on Primary Key and exclude if there are any Deletes. Read the data from Kafka and split delete data from Inserts/Updates. Get the latest data by using Rank window. Enable autoMerge schema property to detect any schema changes and merge the schema to Delta Table MERGE command to handle Inserts/Updates/Deletes based on Operation(op) column which is created by Debezium by parsing the logs Initial Load DDL DM L Data Preprocess Data Preprocess Initial Load DML Scenario DDL Scenario Incremental Load: Data Preprocess, DDL & DML
  16. 16. Flag ID Value CDCTimeStamp I 1 1 2018-01-01 16:02:00 U 1 11 2018-01-01 16:02:01 I 2 2 2018-01-01 16:02:03 I 3 33 2018-01-01 16:02:04 I 4 40 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 Get Latest Record in order to maintain SCD Type I Inserts/Updates Flag ID Value CDCTimeStamp D 2 2 2018-01-01 16:02:04 Deletes Deletes will have different schema when it is inserted in Kafka from Debezium. For Deletes, take Before Image data to know which primary key records got deleted, where as for Inserts and Updates, pull data from After Image. Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 D 2 2 2018-01-01 16:02:04 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 val orderBy_lst = List(“CDCTimeStamp") val byPrimaryKey = Window .partitionBy(partitionBy_lst.map(col): _*) .orderBy(orderBy_lst.map(x => col(x) desc):_*) rankDf = dmlDf .withColumn("rank", rank over byPrimaryKey) .filter("rank = 1") .drop("rank") Data Pre-processing
  17. 17. Get Latest Record and exclude Deletes for Initial Load Consolidated data for Initial Load As the requirement is to maintain SCD I, there is no need to load the Deletes data into Delta Lake during Initial Load. Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 df.where("op != 'd'") .write .mode("overwrite") .option("path", delta_tbl_loc) .format("delta") .saveAsTable(db_tbl) Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 D 2 2 2018-01-01 16:02:04 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 Initial Load
  18. 18. Flag ID Value City CDCTimeStamp I 11 100 MDU 2018-01-01 16:02:20 U 11 1000 CHN 2018-01-01 16:02:21 U 3 300 MDU 2018-01-01 16:02:22 I 14 400 MDU 2018-01-01 16:02:21 D 4 44 2018-01-01 16:02:22 Incremental Staged Data Incremental Load: Data pre-process & Get Latest Record val orderBy_lst = List("CDCTimeStamp") val byPrimaryKey = Window .partitionBy(partitionBy_lst.map(col): _*) .orderBy(orderBy_lst.map(x => col(x) desc):_*) rankDf = dmlDf .withColumn("rank", rank over byPrimaryKey) .filter("rank = 1") .drop("rank") Flag ID Value City CDCTimeStamp U 11 1000 CHN 2018-01-01 16:02:21 U 3 300 MDU 2018-01-01 16:02:22 I 14 400 MDU 2018-01-01 16:02:21 D 4 44 2018-01-01 16:02:22 Latest Record from incremental load Data pre-processing by splitting deletes and Inserts/Updates and get Latest Records per primary key. Finally union both dataframe before performing MERGE
  19. 19. Flag ID Value City CDCTimeStamp U 11 1000 CHN 2018-01-01 16:02:21 U 3 300 MDU 2018-01-01 16:02:22 I 14 400 MDU 2018-01-01 16:02:21 D 4 44 2018-01-01 16:02:22 Latest Incremental Staged Data Incremental Load – DDL & DML MERGE INTO ${db_tbl} AS target USING staging_tbl AS source ON ${pri_key_const} WHEN MATCHED and source.op = 'u’ THEN UPDATE SET * WHEN MATCHED and source.op = 'd’ THEN delete WHEN NOT MATCHED and source.op = 'c’ THEN INSERT * Flag ID Value City CDCTimeStamp U 1 11 Null 2018-01-01 16:02:01 U 3 300 MDU 2018-01-01 16:02:22 U 11 1000 CHN 2018-01-01 16:02:21 I 14 400 MDU 2018-01-01 16:02:21 Enable this property to add new columns on the fly when MERGE happens spark.databricks.delta.schema.autoMerge.enabled Only available from delta lake 0.6.0 and higher versions Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 Before After
  20. 20. 20 Spark Streaming with Kubernetes API Server Scheduler Kubernetes Master Spark Driver Pod Spark Executor Pod Spark Executor Pod Spark Executor Pod 1)Spark Submit 2)Start Driver Pod 3)Request Executor Pods 5)Notify of New Executor 4)Schedule Executor Pod 6)Schedule tasks on executors Kubernetes Cluster Key Benefits ▪ Containerization – Applications are more portable and easy to package dependencies. ▪ Cloud Agnostic – Able to launch the Spark job in any platform without any code changes. ▪ Efficient Resource Sharing – Resources can be utilized by other applications when Spark jobs are idle. File Share Checkpointing
  21. 21. Application Monitoring
  22. 22. Fluentd is a popular data collector that runs as a DaemonSet inside Kubernetes Worker Nodes to collect and ingest container logs from local filesystem into Elasticsearch engine Metrics Beat is a lightweight shipper that collects and ships various system and service metrics like CPU, Memory, Disk usage etc. to Elasticsearch engine Elasticsearch is a real time distributed scalable search engine that is used to index and search through larger volume of log data Kibana is a powerful data visualization tool that allows to explore log data stored in Elasticsearch and gain quick insights to Kubernetes applications Node Pod2 Podn Pod1 . . Node Storage Pod2 Podn Pod1 . . Node Storage Daemon Node Daemon Daemon Daemon CPU,Memory, Network CPU,Memory, Network Logs Logs Centralized Log Monitoring
  23. 23. Monitoring Dashboard
  24. 24. Points to be noted!
  25. 25. DEBEZIUM Points to be Noted Primary Key is mandatory, without primary key it is not possible to track the changes and apply the changes to Target Primary Key By default, Kafka connect will create topic with only one partition. Due to this, Spark Job will not get parallelized. To achieve parallelism, we need to create the topic with more no of partitions Partitions For each Table under database, one topic will be created and one common topic for one DB to maintain DDLs Topic/Table
  26. 26. SPARK If small files are handled, when merge happens during incremental load, there is no need to rewrite most of the files, in return performance between micro batch will be increased. To control the compaction, either Run OPTIMIZE or set false to dataChange delta property or enable Adaptive Execution Mode Small Files Time travel will not read the delta log checkpoint directory, because we need the specific version, so this will read the specific json commit file, because the checkpoint parquet file is consolidated of all the json files which is committed previously. Time Travel
  27. 27. Any Questions?
  28. 28. Feedback Your feedback is important to us. Don’t forget to rate and review the session. THANKS!

×