SlideShare a Scribd company logo
1 of 13
Introduction to
Data Engineering
Vivek A. Ganesan
vivganes@gmail.com
Agenda
Copyright 2013, Vivek A. Ganesan, All rights reserved 1
o Introduction
o What is data engineering?
o Why data engineering?
o Required Skills
o Questions?
Introduction
Copyright 2013, Vivek A. Ganesan, All rights reserved 2
o What’s with the name?
o All other names were taken 
o Gods = Geeks on Data
o Well, it is now Geeking out on Data
o Why a Data Geek?
o Geeks are cool
o Data Geeks are way cool
Partial Omniscience (Super power of Prediction)
Data, Data, Data!
Copyright 2013, Vivek A. Ganesan, All rights reserved 3
• Significant increase in data (Volume)
• Social Networks
• Transaction Logs
• Fast streams of data (Velocity)
• Sensor data
• Machine-to-machine data
• Different kinds of data (Variety)
• Text
• Audio
• Video
• This trend is only going to grow!
Note : EB = Exabyte = 1 million Petabytes
Big Data Trends
Before Big Data
Copyright 2013, Vivek A. Ganesan, All rights reserved 4
• Life was simple … well mostly
• The ETL engineers managed data
pipelines
• The Data Scientists (they weren’t
called that, btw, they were
mostly Statisticians who
programmed in SAS, SPSS or S)
did the analysis
• Data Warehouses, Data marts
and OLAP cubes were the
platforms
• Data Analysts mostly generated
reports but they were proficient
in SQL, Excel, Pivot Tables etc.
• Data Architects …
well, they architected

• They managed :
• Data models
• Star Schemas
• Data Governance
• Master Data
Management
(MDM)
• Data Security
• For the most part, they
had to coax different
groups to share data
Big Data – What Changed?
Copyright 2013, Vivek A. Ganesan, All rights reserved 5
• Life … got interesting
• Huge data volumes – ETL became
a problem
• Traditional Statistical tools
couldn’t handle the volume
• Data Warehouses, Data marts
and OLAP cubes not primary
analytical means – “in situ”
analysis preferred i.e. no moving
data to an analytics platform
• Data Analysts still on point for
reports but now they no longer
had SQL interfaces (thanks to
NoSQL and Map Reduce)
• Data Architects …
well, they still need to
architect 
• Still need :
• Data models
• Data Governance
• Data Security
• For the most part, they
had to coax different
groups to share data
• They have to do all of
this when the
technology is rapidly
evolving
Life in the Big Data Universe
Copyright 2013, Vivek A. Ganesan, All rights reserved 6
• The Good
• Data recognized as an asset
• Data Driven Products more
common
• Working with Data is cool
• The Bad
• Complexity is overwhelming
• No sophisticated toolset yet
• Technology is fast changing
• The Ugly
• No SQL!
• Security
• Governance
• Performance
• The Opportunity
• Solve for :
• SQL semantics
• Data Governance
• Data Security
• Benchmarking, Pro
filing and
Performance
measurement tools
• Build :
• Real-time solutions
• Data Marts/Data
Warehouses on top
Life in the Big Data Universe
Copyright 2013, Vivek A. Ganesan, All rights reserved 7
Data Scientist Data AnalystData Engineer
• Building Models
• Validation/Testing
• Algorithms
• Continuous
Improvement
• Knowledge of :
• Statistics
• Linear Algebra
• Machine
Learning
• R,Matlab etc.
• Deep Domain
Knowledge
• Report Generation
• Data Exploration
• Hypotheses Testing
• Pattern Discovery
• Correlations
• Serendipitous
Discovery
• Data Pipelines
• Manage Platforms
• Productionalize
Algorithms
• Agile Development
• Knowledge of :
• Platforms
• Algorithms
• Java, C++ etc.
• Scripting
languagues
like python
Data Engineering
Copyright 2013, Vivek A. Ganesan, All rights reserved 8
• Strong CS Background
• Algorithms
• Database theory
• Scripting languages
• Server side languages
• Distributed Systems Background
• Clusters
• Networking
• Monitoring/Performance
• Data Science/Machine Learning
• Search/IR
• Text Analytics
• Classification
• Clustering
• Infrastructure
• Hadoop
• Cassandra
• Mongo DB
• Platforms
• Solr
• Hive
• HBase
• Mahout
• Applications
• Recommendation
Engines
• Fraud Prevention
• Disease Prevention
Data Engineer’s Role
Copyright 2013, Vivek A. Ganesan, All rights reserved 9
• Data Dialysis – Cleaning up Data
• Hard to do at Scale
• Newer tools in this space
• Great scope for innovation
• ETL -> ELT
• Distributed Bulk loading
• Full-fledged data pipelines
• Supporting both data scientists
and data analysts
• Productionalizing algorithms
• Production support
• Optimization
• A/B Testing and Continuous
Improvement
About this Meetup : Structure
Copyright 2013, Vivek A. Ganesan, All rights reserved 10
• Agile teams
• Monthly Scrum
• Week 1 : Introduction to Problem
• Week 2 : Algorithm + Platform
• Week 3 : Technical help
(Algorithm, Platform, Testing and
Deployment)
• Week 4 : Panel + Demo
• Showcase Startups/Experts in
the space
• Teams show demos
• Panel judges winners
• We might have prizes (needs
to be figured out)
• Weekly Meetup (on
Mondays)
• Might move to a bigger
venue if there is
enough demand
About this Meetup : Schedule
Copyright 2013, Vivek A. Ganesan, All rights reserved 11
• May 29th : Kickoff
• Scrum 1
• June 3rd – Collaborative
Filtering Introduction
• June 10th – Mongo DB
Introduction
• June 17th – Analytics on
Mongo DB
• June 24th – Panel + Demo
• Scrum 2 (TBD)
• Come along now, it will
be fun!
• Oh, the name 
Questions? Comments?
Thank You!
E-mail: vivganes@gmail.com
Twitter : onevivek
Copyright 2013, Vivek A. Ganesan, All rights
reserved
12

More Related Content

What's hot

Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
What is data engineering?
What is data engineering?What is data engineering?
What is data engineering?yongdam kim
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?confluent
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 

What's hot (20)

Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
What is data engineering?
What is data engineering?What is data engineering?
What is data engineering?
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 

Similar to Introduction to Data Engineering

Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketDremio Corporation
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeCaserta
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleDatabricks
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent Jonny Daenen
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Alex Gorbachev
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analyticsIke Ellis
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Big data berlin
Big data berlinBig data berlin
Big data berlinkammeyer
 

Similar to Introduction to Data Engineering (20)

Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 

More from Vivek Aanand Ganesan

More from Vivek Aanand Ganesan (6)

Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
 
Collaborative filtering common_problems_and_solutions
Collaborative filtering common_problems_and_solutionsCollaborative filtering common_problems_and_solutions
Collaborative filtering common_problems_and_solutions
 
Mongodb hackathon 02
Mongodb hackathon 02Mongodb hackathon 02
Mongodb hackathon 02
 
Collaborative filtering getting_started
Collaborative filtering getting_startedCollaborative filtering getting_started
Collaborative filtering getting_started
 
Mongodb hackathon 01
Mongodb hackathon 01Mongodb hackathon 01
Mongodb hackathon 01
 
Recommendation Engines Program Kickoff
Recommendation Engines Program KickoffRecommendation Engines Program Kickoff
Recommendation Engines Program Kickoff
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 

Introduction to Data Engineering

  • 1. Introduction to Data Engineering Vivek A. Ganesan vivganes@gmail.com
  • 2. Agenda Copyright 2013, Vivek A. Ganesan, All rights reserved 1 o Introduction o What is data engineering? o Why data engineering? o Required Skills o Questions?
  • 3. Introduction Copyright 2013, Vivek A. Ganesan, All rights reserved 2 o What’s with the name? o All other names were taken  o Gods = Geeks on Data o Well, it is now Geeking out on Data o Why a Data Geek? o Geeks are cool o Data Geeks are way cool Partial Omniscience (Super power of Prediction)
  • 4. Data, Data, Data! Copyright 2013, Vivek A. Ganesan, All rights reserved 3 • Significant increase in data (Volume) • Social Networks • Transaction Logs • Fast streams of data (Velocity) • Sensor data • Machine-to-machine data • Different kinds of data (Variety) • Text • Audio • Video • This trend is only going to grow! Note : EB = Exabyte = 1 million Petabytes Big Data Trends
  • 5. Before Big Data Copyright 2013, Vivek A. Ganesan, All rights reserved 4 • Life was simple … well mostly • The ETL engineers managed data pipelines • The Data Scientists (they weren’t called that, btw, they were mostly Statisticians who programmed in SAS, SPSS or S) did the analysis • Data Warehouses, Data marts and OLAP cubes were the platforms • Data Analysts mostly generated reports but they were proficient in SQL, Excel, Pivot Tables etc. • Data Architects … well, they architected  • They managed : • Data models • Star Schemas • Data Governance • Master Data Management (MDM) • Data Security • For the most part, they had to coax different groups to share data
  • 6. Big Data – What Changed? Copyright 2013, Vivek A. Ganesan, All rights reserved 5 • Life … got interesting • Huge data volumes – ETL became a problem • Traditional Statistical tools couldn’t handle the volume • Data Warehouses, Data marts and OLAP cubes not primary analytical means – “in situ” analysis preferred i.e. no moving data to an analytics platform • Data Analysts still on point for reports but now they no longer had SQL interfaces (thanks to NoSQL and Map Reduce) • Data Architects … well, they still need to architect  • Still need : • Data models • Data Governance • Data Security • For the most part, they had to coax different groups to share data • They have to do all of this when the technology is rapidly evolving
  • 7. Life in the Big Data Universe Copyright 2013, Vivek A. Ganesan, All rights reserved 6 • The Good • Data recognized as an asset • Data Driven Products more common • Working with Data is cool • The Bad • Complexity is overwhelming • No sophisticated toolset yet • Technology is fast changing • The Ugly • No SQL! • Security • Governance • Performance • The Opportunity • Solve for : • SQL semantics • Data Governance • Data Security • Benchmarking, Pro filing and Performance measurement tools • Build : • Real-time solutions • Data Marts/Data Warehouses on top
  • 8. Life in the Big Data Universe Copyright 2013, Vivek A. Ganesan, All rights reserved 7 Data Scientist Data AnalystData Engineer • Building Models • Validation/Testing • Algorithms • Continuous Improvement • Knowledge of : • Statistics • Linear Algebra • Machine Learning • R,Matlab etc. • Deep Domain Knowledge • Report Generation • Data Exploration • Hypotheses Testing • Pattern Discovery • Correlations • Serendipitous Discovery • Data Pipelines • Manage Platforms • Productionalize Algorithms • Agile Development • Knowledge of : • Platforms • Algorithms • Java, C++ etc. • Scripting languagues like python
  • 9. Data Engineering Copyright 2013, Vivek A. Ganesan, All rights reserved 8 • Strong CS Background • Algorithms • Database theory • Scripting languages • Server side languages • Distributed Systems Background • Clusters • Networking • Monitoring/Performance • Data Science/Machine Learning • Search/IR • Text Analytics • Classification • Clustering • Infrastructure • Hadoop • Cassandra • Mongo DB • Platforms • Solr • Hive • HBase • Mahout • Applications • Recommendation Engines • Fraud Prevention • Disease Prevention
  • 10. Data Engineer’s Role Copyright 2013, Vivek A. Ganesan, All rights reserved 9 • Data Dialysis – Cleaning up Data • Hard to do at Scale • Newer tools in this space • Great scope for innovation • ETL -> ELT • Distributed Bulk loading • Full-fledged data pipelines • Supporting both data scientists and data analysts • Productionalizing algorithms • Production support • Optimization • A/B Testing and Continuous Improvement
  • 11. About this Meetup : Structure Copyright 2013, Vivek A. Ganesan, All rights reserved 10 • Agile teams • Monthly Scrum • Week 1 : Introduction to Problem • Week 2 : Algorithm + Platform • Week 3 : Technical help (Algorithm, Platform, Testing and Deployment) • Week 4 : Panel + Demo • Showcase Startups/Experts in the space • Teams show demos • Panel judges winners • We might have prizes (needs to be figured out) • Weekly Meetup (on Mondays) • Might move to a bigger venue if there is enough demand
  • 12. About this Meetup : Schedule Copyright 2013, Vivek A. Ganesan, All rights reserved 11 • May 29th : Kickoff • Scrum 1 • June 3rd – Collaborative Filtering Introduction • June 10th – Mongo DB Introduction • June 17th – Analytics on Mongo DB • June 24th – Panel + Demo • Scrum 2 (TBD) • Come along now, it will be fun! • Oh, the name 
  • 13. Questions? Comments? Thank You! E-mail: vivganes@gmail.com Twitter : onevivek Copyright 2013, Vivek A. Ganesan, All rights reserved 12