Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Introduction to Data Engineer and Data Pipeline at Credit OK
1. Presented by Kriangkrai Chaonithi @spicydog
14/11/2019 | KMUTT | Applied Computer Science
Introduction to
Data Engineer
and
Data Pipeline
at
2. Hello! My name is Gap
Education
● BS Applied Computer Science (KMUTT)
● MS Computer Engineering (KMUTT)
Work Experience
● Former Android, iOS & PHP Developer at Longdo.COM
● Former R&D Manager at Insightera
● CTO & co-founder at Credit OK
Fields of Interests
● Software Engineering
● Cloud Architecture & Distributed Computing
● Computer Security
● Machine Learning & NLP https://spicydog.me
3. Agenda
● What is Big Data?
○ Why data is big?
○ Structured vs Unstructured Data
● Data Engineering
○ Data technology careers
○ What do data engineers do?
○ Skills for data engineers
○ Knowledages & technologies for data engineer
● What is Data Pipeline?
○ ETL - Extract, Transform, Load
○ Batch vs streaming
● Data Pipeline at Credit OK
○ Introduction to GCP technologies
○ Problem and solution on data pipeline
○ Data pipeline architecture in details
● Summary
5. What is Big Data?
https://unsplash.com/photos/LqKhnDzSF-8
6. Why data is big?
● Faster internet better infrastructure
● Business digitization
● Social network
● IoT & embedded systems
● Automated software
● Etc.
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/KiH2-tdGQRY
7. Structured vs. Unstructured Data
https://unsplash.com/photos/QBpZGqEMsKg
https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4
10. What do Data Engineers do?
https://medium.com/@info_46914/data-engineer-บุคคลที่องคกรไมควรมองขาม-e863b37af79
11. Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
12. Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
■ Local Storage
■ Network Attached Storage
■ Object Storage
○ Databases Architecture
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
13. Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
14. Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
● Document-oriented Database
● Columnar Database
● Graph Database
● Key-value Database
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Crontab (Task Scheduler)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
15. Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
● Document-oriented Database
● Columnar Database
● Graph Database
● Key-value Database
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Crontab (Task Scheduler)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
16. Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
17. Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Task Scheduler (Crontab)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
18. What is Data Pipeline?
https://unsplash.com/photos/9AxFJaNySB8
20. Batch vs Streaming Processing
https://unsplash.com/photos/QBpZGqEMsKg
Batch Streaming
Multiple record processing Per record processing
Scheduled / manual Real-time
Longer processing time Shorter processing time
Large window data processing Small window data processing
26. Why do we use serverless on big data?
● No server maintenance
● Scalable & high performance
● Easier to optimize
● Only pay per use
27. Requirements
● Have a HUGE data warehouse for batch processing
● Our customer have on-premise data on >400 sites
● Data ingestor app is needed to install to every site
● Data ingestor app must be able to run on
● Data ingestor app must be super robust and easy to install
● Must work automatically everyday, task scheduler
28. When >400 sites upload large files
to your server at the same time..
This is kinna DDoS!
29. We use cloud functions
● Auto scale
● Almost zero maintenance!
● But only accept <10 MB body size
For the larger files,
we use
Google Cloud Run
Google Kubernetes Engine
Google Compute Engine
33. Raw Data
Source
Raw Data
Source
GCS - Auto trigger GCF when zipped file is put to the INPUT bucket
GCF - (data cleansing) Process text encoding (tis602, utf8)
GCF - (data cleansing) Check and clean CSV format, make it in the best possible one
GCF - Save output CSV to GCD the OUTPUT bucket
GCF - Log all the results for file ingestion reports
34. Raw Data
Source
Raw Data
Source
Cron - Auto run every some period to load CSV data from OUTPUT bucket
GBQ - Load data from OUTPUT bucket into RAW STAGING table in string format
35. Raw Data
Source
Raw Data
Source
GBQ - Cron to run data cleansing SQL from RAW STAGING table to CLEANED STAGING table
GBQ - Cron to run append data with SQL from CLEANED STAGING table to MAIN table
GBQ - Cron to run data processing SQL task from MAIN table to another tables til ready to FINAL tables
36. Raw Data
Source
Raw Data
Source
Frequently Used Data
Lumen - Cron to dump FINAL tables data to real-time database on frequently used data
Laravel - Load data from real-time database of Lumen via internal REST API
Vue - Use data processed from Laravel
Rarely Used Data
Lumen - Load data from BQ directly
Laravel - Load and process data from Lumen
Vue - Use data processed from Laravel
37. Summary
● Big data is possible because of technology advancement
● Store and process big data requires special technology and knowledge
● Data engineers are the geeks who work on processing data for the team
● Data pipeline is all about automation about data processing process
● Understanding about data going to process is crucial
● Don’t forget to log data pipeline to monitoring system
● Data engineer is in high demand in Thailand, we have dirty data, we have data scientist, we have
no one to process data => data scientist do everything! THAT’S WRONG!
Data Engineer is in need