2. Agenda
Copyright 2013, Vivek A. Ganesan, All rights reserved 1
o Introduction
o What is data engineering?
o Why data engineering?
o Required Skills
o Questions?
3. Introduction
Copyright 2013, Vivek A. Ganesan, All rights reserved 2
o What’s with the name?
o All other names were taken
o Gods = Geeks on Data
o Well, it is now Geeking out on Data
o Why a Data Geek?
o Geeks are cool
o Data Geeks are way cool
Partial Omniscience (Super power of Prediction)
4. Data, Data, Data!
Copyright 2013, Vivek A. Ganesan, All rights reserved 3
• Significant increase in data (Volume)
• Social Networks
• Transaction Logs
• Fast streams of data (Velocity)
• Sensor data
• Machine-to-machine data
• Different kinds of data (Variety)
• Text
• Audio
• Video
• This trend is only going to grow!
Note : EB = Exabyte = 1 million Petabytes
Big Data Trends
5. Before Big Data
Copyright 2013, Vivek A. Ganesan, All rights reserved 4
• Life was simple … well mostly
• The ETL engineers managed data
pipelines
• The Data Scientists (they weren’t
called that, btw, they were
mostly Statisticians who
programmed in SAS, SPSS or S)
did the analysis
• Data Warehouses, Data marts
and OLAP cubes were the
platforms
• Data Analysts mostly generated
reports but they were proficient
in SQL, Excel, Pivot Tables etc.
• Data Architects …
well, they architected
• They managed :
• Data models
• Star Schemas
• Data Governance
• Master Data
Management
(MDM)
• Data Security
• For the most part, they
had to coax different
groups to share data
6. Big Data – What Changed?
Copyright 2013, Vivek A. Ganesan, All rights reserved 5
• Life … got interesting
• Huge data volumes – ETL became
a problem
• Traditional Statistical tools
couldn’t handle the volume
• Data Warehouses, Data marts
and OLAP cubes not primary
analytical means – “in situ”
analysis preferred i.e. no moving
data to an analytics platform
• Data Analysts still on point for
reports but now they no longer
had SQL interfaces (thanks to
NoSQL and Map Reduce)
• Data Architects …
well, they still need to
architect
• Still need :
• Data models
• Data Governance
• Data Security
• For the most part, they
had to coax different
groups to share data
• They have to do all of
this when the
technology is rapidly
evolving
7. Life in the Big Data Universe
Copyright 2013, Vivek A. Ganesan, All rights reserved 6
• The Good
• Data recognized as an asset
• Data Driven Products more
common
• Working with Data is cool
• The Bad
• Complexity is overwhelming
• No sophisticated toolset yet
• Technology is fast changing
• The Ugly
• No SQL!
• Security
• Governance
• Performance
• The Opportunity
• Solve for :
• SQL semantics
• Data Governance
• Data Security
• Benchmarking, Pro
filing and
Performance
measurement tools
• Build :
• Real-time solutions
• Data Marts/Data
Warehouses on top
8. Life in the Big Data Universe
Copyright 2013, Vivek A. Ganesan, All rights reserved 7
Data Scientist Data AnalystData Engineer
• Building Models
• Validation/Testing
• Algorithms
• Continuous
Improvement
• Knowledge of :
• Statistics
• Linear Algebra
• Machine
Learning
• R,Matlab etc.
• Deep Domain
Knowledge
• Report Generation
• Data Exploration
• Hypotheses Testing
• Pattern Discovery
• Correlations
• Serendipitous
Discovery
• Data Pipelines
• Manage Platforms
• Productionalize
Algorithms
• Agile Development
• Knowledge of :
• Platforms
• Algorithms
• Java, C++ etc.
• Scripting
languagues
like python
9. Data Engineering
Copyright 2013, Vivek A. Ganesan, All rights reserved 8
• Strong CS Background
• Algorithms
• Database theory
• Scripting languages
• Server side languages
• Distributed Systems Background
• Clusters
• Networking
• Monitoring/Performance
• Data Science/Machine Learning
• Search/IR
• Text Analytics
• Classification
• Clustering
• Infrastructure
• Hadoop
• Cassandra
• Mongo DB
• Platforms
• Solr
• Hive
• HBase
• Mahout
• Applications
• Recommendation
Engines
• Fraud Prevention
• Disease Prevention
10. Data Engineer’s Role
Copyright 2013, Vivek A. Ganesan, All rights reserved 9
• Data Dialysis – Cleaning up Data
• Hard to do at Scale
• Newer tools in this space
• Great scope for innovation
• ETL -> ELT
• Distributed Bulk loading
• Full-fledged data pipelines
• Supporting both data scientists
and data analysts
• Productionalizing algorithms
• Production support
• Optimization
• A/B Testing and Continuous
Improvement
11. About this Meetup : Structure
Copyright 2013, Vivek A. Ganesan, All rights reserved 10
• Agile teams
• Monthly Scrum
• Week 1 : Introduction to Problem
• Week 2 : Algorithm + Platform
• Week 3 : Technical help
(Algorithm, Platform, Testing and
Deployment)
• Week 4 : Panel + Demo
• Showcase Startups/Experts in
the space
• Teams show demos
• Panel judges winners
• We might have prizes (needs
to be figured out)
• Weekly Meetup (on
Mondays)
• Might move to a bigger
venue if there is
enough demand
12. About this Meetup : Schedule
Copyright 2013, Vivek A. Ganesan, All rights reserved 11
• May 29th : Kickoff
• Scrum 1
• June 3rd – Collaborative
Filtering Introduction
• June 10th – Mongo DB
Introduction
• June 17th – Analytics on
Mongo DB
• June 24th – Panel + Demo
• Scrum 2 (TBD)
• Come along now, it will
be fun!
• Oh, the name