SlideShare a Scribd company logo
1 of 39
Download to read offline
Presented by Kriangkrai Chaonithi @spicydog
14/11/2019 | KMUTT | Applied Computer Science
Introduction to
Data Engineer
and
Data Pipeline
at
Hello! My name is Gap
Education
● BS Applied Computer Science (KMUTT)
● MS Computer Engineering (KMUTT)
Work Experience
● Former Android, iOS & PHP Developer at Longdo.COM
● Former R&D Manager at Insightera
● CTO & co-founder at Credit OK
Fields of Interests
● Software Engineering
● Cloud Architecture & Distributed Computing
● Computer Security
● Machine Learning & NLP https://spicydog.me
Agenda
● What is Big Data?
○ Why data is big?
○ Structured vs Unstructured Data
● Data Engineering
○ Data technology careers
○ What do data engineers do?
○ Skills for data engineers
○ Knowledages & technologies for data engineer
● What is Data Pipeline?
○ ETL - Extract, Transform, Load
○ Batch vs streaming
● Data Pipeline at Credit OK
○ Introduction to GCP technologies
○ Problem and solution on data pipeline
○ Data pipeline architecture in details
● Summary
https://medium.com/@smartrac/the-deep-web-the-dark-web-and-simple-things-2e601ec980ac
What is Big Data?
https://unsplash.com/photos/LqKhnDzSF-8
Why data is big?
● Faster internet better infrastructure
● Business digitization
● Social network
● IoT & embedded systems
● Automated software
● Etc.
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/KiH2-tdGQRY
Structured vs. Unstructured Data
https://unsplash.com/photos/QBpZGqEMsKg
https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4
Data Engineering
https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4
Data Technology Careers
https://unsplash.com/photos/QBpZGqEMsKghttps://www.springboard.com/blog/data-science-career-paths-different-roles-industry/
What do Data Engineers do?
https://medium.com/@info_46914/data-engineer-บุคคลที่องคกรไมควรมองขาม-e863b37af79
Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
■ Local Storage
■ Network Attached Storage
■ Object Storage
○ Databases Architecture
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
● Document-oriented Database
● Columnar Database
● Graph Database
● Key-value Database
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Crontab (Task Scheduler)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
● Document-oriented Database
● Columnar Database
● Graph Database
● Key-value Database
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Crontab (Task Scheduler)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Task Scheduler (Crontab)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
What is Data Pipeline?
https://unsplash.com/photos/9AxFJaNySB8
ETL - Extract, Transform, Load
https://unsplash.com/photos/QBpZGqEMsKghttps://www.astera.com/type/blog/etl-process-and-steps/
Batch vs Streaming Processing
https://unsplash.com/photos/QBpZGqEMsKg
Batch Streaming
Multiple record processing Per record processing
Scheduled / manual Real-time
Longer processing time Shorter processing time
Large window data processing Small window data processing
Credit Scoring Platform on Big Data Analytics
creditok.co
GCP Storages & Databases
Non-serverless
Serverless
GCP Data Analytics
Pipeline Analytics Visualization
Why do we use serverless on big data?
● No server maintenance
● Scalable & high performance
● Easier to optimize
● Only pay per use
Requirements
● Have a HUGE data warehouse for batch processing
● Our customer have on-premise data on >400 sites
● Data ingestor app is needed to install to every site
● Data ingestor app must be able to run on
● Data ingestor app must be super robust and easy to install
● Must work automatically everyday, task scheduler
When >400 sites upload large files
to your server at the same time..
This is kinna DDoS!
We use cloud functions
● Auto scale
● Almost zero maintenance!
● But only accept <10 MB body size
For the larger files,
we use
Google Cloud Run
Google Kubernetes Engine
Google Compute Engine
Raw Data
Source
Raw Data
Source
Data Pipeline Architecture
Raw Data
Source
Raw Data
Source
GCF - Load zipped file data via HTTPS protocol
GCF - Save zipped file data to GCS INPUT bucket
Raw Data
Source
Raw Data
Source
GCS - Auto trigger GCF when zipped file is put to the INPUT bucket
GCF - (data cleansing) Process text encoding (tis602, utf8)
GCF - (data cleansing) Check and clean CSV format, make it in the best possible one
GCF - Save output CSV to GCD the OUTPUT bucket
GCF - Log all the results for file ingestion reports
Raw Data
Source
Raw Data
Source
Cron - Auto run every some period to load CSV data from OUTPUT bucket
GBQ - Load data from OUTPUT bucket into RAW STAGING table in string format
Raw Data
Source
Raw Data
Source
GBQ - Cron to run data cleansing SQL from RAW STAGING table to CLEANED STAGING table
GBQ - Cron to run append data with SQL from CLEANED STAGING table to MAIN table
GBQ - Cron to run data processing SQL task from MAIN table to another tables til ready to FINAL tables
Raw Data
Source
Raw Data
Source
Frequently Used Data
Lumen - Cron to dump FINAL tables data to real-time database on frequently used data
Laravel - Load data from real-time database of Lumen via internal REST API
Vue - Use data processed from Laravel
Rarely Used Data
Lumen - Load data from BQ directly
Laravel - Load and process data from Lumen
Vue - Use data processed from Laravel
Summary
● Big data is possible because of technology advancement
● Store and process big data requires special technology and knowledge
● Data engineers are the geeks who work on processing data for the team
● Data pipeline is all about automation about data processing process
● Understanding about data going to process is crucial
● Don’t forget to log data pipeline to monitoring system
● Data engineer is in high demand in Thailand, we have dirty data, we have data scientist, we have
no one to process data => data scientist do everything! THAT’S WRONG!
Data Engineer is in need
Question & Answer
Time is short, let’s utilize the networks.
Feel free to connect with me via spicydog.me

More Related Content

What's hot

Speeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachSpeeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachDatabricks
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdfChris Hoyean Song
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 

What's hot (20)

Speeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachSpeeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT Approach
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 

Similar to Introduction to Data Engineer and Data Pipeline at Credit OK

Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKKriangkrai Chaonithi
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Jason Flittner
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managabilityGaurav Bahrani
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformGoDataDriven
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow ObstructionsTatiana Al-Chueyr
 
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18Imre Nagi
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govChris Shenton
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Sid Anand
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntLast Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntMark Grebler
 

Similar to Introduction to Data Engineer and Data Pipeline at Credit OK (20)

Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architecture
 
The Big Bad Data
The Big Bad DataThe Big Bad Data
The Big Bad Data
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntLast Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
 

More from Kriangkrai Chaonithi

Introduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OKIntroduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OKKriangkrai Chaonithi
 
Introduction to Modern DevOps Technologies
Introduction to  Modern DevOps TechnologiesIntroduction to  Modern DevOps Technologies
Introduction to Modern DevOps TechnologiesKriangkrai Chaonithi
 
Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)Kriangkrai Chaonithi
 

More from Kriangkrai Chaonithi (6)

Introduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OKIntroduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OK
 
Introduction to Modern DevOps Technologies
Introduction to  Modern DevOps TechnologiesIntroduction to  Modern DevOps Technologies
Introduction to Modern DevOps Technologies
 
Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)
 
Laravel level 2 (Let's Practical)
Laravel level 2 (Let's Practical)Laravel level 2 (Let's Practical)
Laravel level 2 (Let's Practical)
 
Laravel Level 1 (The Basic)
Laravel Level 1 (The Basic)Laravel Level 1 (The Basic)
Laravel Level 1 (The Basic)
 
Laravel level 0 (introduction)
Laravel level 0 (introduction)Laravel level 0 (introduction)
Laravel level 0 (introduction)
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Introduction to Data Engineer and Data Pipeline at Credit OK

  • 1. Presented by Kriangkrai Chaonithi @spicydog 14/11/2019 | KMUTT | Applied Computer Science Introduction to Data Engineer and Data Pipeline at
  • 2. Hello! My name is Gap Education ● BS Applied Computer Science (KMUTT) ● MS Computer Engineering (KMUTT) Work Experience ● Former Android, iOS & PHP Developer at Longdo.COM ● Former R&D Manager at Insightera ● CTO & co-founder at Credit OK Fields of Interests ● Software Engineering ● Cloud Architecture & Distributed Computing ● Computer Security ● Machine Learning & NLP https://spicydog.me
  • 3. Agenda ● What is Big Data? ○ Why data is big? ○ Structured vs Unstructured Data ● Data Engineering ○ Data technology careers ○ What do data engineers do? ○ Skills for data engineers ○ Knowledages & technologies for data engineer ● What is Data Pipeline? ○ ETL - Extract, Transform, Load ○ Batch vs streaming ● Data Pipeline at Credit OK ○ Introduction to GCP technologies ○ Problem and solution on data pipeline ○ Data pipeline architecture in details ● Summary
  • 5. What is Big Data? https://unsplash.com/photos/LqKhnDzSF-8
  • 6. Why data is big? ● Faster internet better infrastructure ● Business digitization ● Social network ● IoT & embedded systems ● Automated software ● Etc. https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/KiH2-tdGQRY
  • 7. Structured vs. Unstructured Data https://unsplash.com/photos/QBpZGqEMsKg https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4
  • 10. What do Data Engineers do? https://medium.com/@info_46914/data-engineer-บุคคลที่องคกรไมควรมองขาม-e863b37af79
  • 11. Skills for Data Engineers ● Data Architecture ● Cloud Computing and Infrastructure ● Programming on Data Manipulation https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 12. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ■ Local Storage ■ Network Attached Storage ■ Object Storage ○ Databases Architecture ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 13. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ○ Databases Architecture ■ SQL (RDBMS) ■ NoSQL ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 14. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ○ Databases Architecture ■ SQL (RDBMS) ■ NoSQL ● Document-oriented Database ● Columnar Database ● Graph Database ● Key-value Database ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation ○ Data Ingestion ○ Data Cleaning ○ Data Manipulation & Data Pipeline ○ Crontab (Task Scheduler) https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 15. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ○ Databases Architecture ■ SQL (RDBMS) ■ NoSQL ● Document-oriented Database ● Columnar Database ● Graph Database ● Key-value Database ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation ○ Data Ingestion ○ Data Cleaning ○ Data Manipulation & Data Pipeline ○ Crontab (Task Scheduler) https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 16. Skills for Data Engineers ● Data Architecture ● Cloud Computing and Infrastructure ● Programming on Data Manipulation
  • 17. Skills for Data Engineers ● Data Architecture ● Cloud Computing and Infrastructure ● Programming on Data Manipulation ○ Data Ingestion ○ Data Cleaning ○ Data Manipulation & Data Pipeline ○ Task Scheduler (Crontab) https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 18. What is Data Pipeline? https://unsplash.com/photos/9AxFJaNySB8
  • 19. ETL - Extract, Transform, Load https://unsplash.com/photos/QBpZGqEMsKghttps://www.astera.com/type/blog/etl-process-and-steps/
  • 20. Batch vs Streaming Processing https://unsplash.com/photos/QBpZGqEMsKg Batch Streaming Multiple record processing Per record processing Scheduled / manual Real-time Longer processing time Shorter processing time Large window data processing Small window data processing
  • 21. Credit Scoring Platform on Big Data Analytics creditok.co
  • 22.
  • 23. GCP Storages & Databases Non-serverless Serverless
  • 24. GCP Data Analytics Pipeline Analytics Visualization
  • 25.
  • 26. Why do we use serverless on big data? ● No server maintenance ● Scalable & high performance ● Easier to optimize ● Only pay per use
  • 27. Requirements ● Have a HUGE data warehouse for batch processing ● Our customer have on-premise data on >400 sites ● Data ingestor app is needed to install to every site ● Data ingestor app must be able to run on ● Data ingestor app must be super robust and easy to install ● Must work automatically everyday, task scheduler
  • 28. When >400 sites upload large files to your server at the same time.. This is kinna DDoS!
  • 29. We use cloud functions ● Auto scale ● Almost zero maintenance! ● But only accept <10 MB body size For the larger files, we use Google Cloud Run Google Kubernetes Engine Google Compute Engine
  • 30.
  • 31. Raw Data Source Raw Data Source Data Pipeline Architecture
  • 32. Raw Data Source Raw Data Source GCF - Load zipped file data via HTTPS protocol GCF - Save zipped file data to GCS INPUT bucket
  • 33. Raw Data Source Raw Data Source GCS - Auto trigger GCF when zipped file is put to the INPUT bucket GCF - (data cleansing) Process text encoding (tis602, utf8) GCF - (data cleansing) Check and clean CSV format, make it in the best possible one GCF - Save output CSV to GCD the OUTPUT bucket GCF - Log all the results for file ingestion reports
  • 34. Raw Data Source Raw Data Source Cron - Auto run every some period to load CSV data from OUTPUT bucket GBQ - Load data from OUTPUT bucket into RAW STAGING table in string format
  • 35. Raw Data Source Raw Data Source GBQ - Cron to run data cleansing SQL from RAW STAGING table to CLEANED STAGING table GBQ - Cron to run append data with SQL from CLEANED STAGING table to MAIN table GBQ - Cron to run data processing SQL task from MAIN table to another tables til ready to FINAL tables
  • 36. Raw Data Source Raw Data Source Frequently Used Data Lumen - Cron to dump FINAL tables data to real-time database on frequently used data Laravel - Load data from real-time database of Lumen via internal REST API Vue - Use data processed from Laravel Rarely Used Data Lumen - Load data from BQ directly Laravel - Load and process data from Lumen Vue - Use data processed from Laravel
  • 37. Summary ● Big data is possible because of technology advancement ● Store and process big data requires special technology and knowledge ● Data engineers are the geeks who work on processing data for the team ● Data pipeline is all about automation about data processing process ● Understanding about data going to process is crucial ● Don’t forget to log data pipeline to monitoring system ● Data engineer is in high demand in Thailand, we have dirty data, we have data scientist, we have no one to process data => data scientist do everything! THAT’S WRONG! Data Engineer is in need
  • 39. Time is short, let’s utilize the networks. Feel free to connect with me via spicydog.me