SlideShare a Scribd company logo
1 of 51
7 Key Recipes for
Data Engineering
Introduction
We will explore 7 key recipes about Data
Engineering.
The 5th is absolutely game changing!
Thank You
Bachir AIT MBAREK
BI and Big Data Consultant
About Me
Jonathan WINANDY
Lead Data Engineer:
- Data Lake building,
- Audit / Coaching,
- Spark Training.
Founder of Univalence (BI / Big Data)
Co-Founder of CYM (IoT / Predictive Maintenance),
Craft Analytics† (BI / Big Data),
and Valwin (Health Care Data).
2016 has been amazing for Data Engineering !
but ...
1.It’s all about our
organisations!
1.It’s all about our Organisations
Data engineering is not about scaling
computation.
1.It’s all about our Organisations
Data engineering is not a support
function
for Data Scientists[1].
[1] whatever they are nowadays
1.It’s all about our Organisations
Instead, Data engineering
enables access to Data!
1.It’s all about our Organisations
access to Data … in complex organisations.
Product OpsBI You
Marketing
data
new data
Holding
1.It’s all about our Organisations
access to Data … in complex organisations.
Marketing
Yo
u
data
new data
Entity 1
MarketingIT
Entity N
MarketingIT
1.It’s all about our Organisations
access to Data … in complex organisations.
It’s very frustrating!
We run a support group meetup if you are interested : Paris Data Engineers!
1.It’s all about our Organisations
Small tips :
Only one hadoop cluster (no TEST/REC/INT/PREPROD).
No Air-Data-Eng, it helps no one.
Radical transparency with other teams.
Hack that sh**.
2. Optimising our work
2. Optimising our work
There are 3 key concerns governing our decisions :
Lead time
Impact
Failure management
2. Optimising our work
Lead time (noun) :
The period of time between the initial phase of a process and the emergence
of results, as between the planning and completed manufacture of a product.
Short lead times are essential!
The Elastic stack helps a lot in this area.
2. Optimising our work
Impact
To have impact, we have to analyse beyond
immediate needs. That way, we’re able to
provide solutions to entire kinds of
problems.
2. Optimising our work
Failure management
Things fail, be prepared!
On the same morning the RER A public transports
and
our Hadoop job tracker can fail.
Unprepared failures may pile up and lead to huge wastes.
2. Optimising our work
“What is likely to fail?” $componentName_____
“How? (root cause)”
“Can we know if this will fail?”
“Can we prevent this failure?”
“What are the impacts?”
“How to fix it when it happens?”
“Can we facilitate today?”
How to mitigate failure in 7 questions.
2. Optimising our work
Track your work!
3. Staging the Data
3. Staging the data
Data is moving around, freeze it!
Staging changed with Big Data. We moved from transient
staging (FTP, NFS, etc.) to persistent staging in distributed
solutions:
● In Streaming with Kafka, we may retain logs in Kafka
for several months.
● In Batch, staging in HDFS may retain source Data for
years.
3. Staging the data
Modern staging anti-pattern :
Dropping destination places before moving the Data.
Having incomplete data visible.
Short log retention in streams (=> new failure modes).
Modern staging should be seen as a persistent data structure.
3. Staging the data
HDFS staging :
/staging
|-- $tablename
|-- dtint=$dtint
|-- dsparam.name=$dsparam.value
|-- ...
|-- ...
|-- uuid=$uuid
4. Using RDDs or Dataframes
4. Using RDDs or Dataframes
Dataframes have great performance,
but are untyped and foreign.
RDDs have a robust Scala API,
but are a pain to map from data sources.
btw, SQL is AWESOME
4. Using RDDs or Dataframes
Dataframes RDDs
Predicate push down Types !!
Bare metal / unboxed Nested structures
Connectors Better unit tests
Pluggable Optimizer Less stages
SQL + Meta Scala * Scala
4. Using RDDs or Dataframes
We should use RDDs in large ETL jobs :
Loading the data with dataframe APIs,
Basic case class mapping (or better Datasets),
Typesafe transformations,
Storing with dataframe APIs
4. Using RDDs or Dataframes
Dataframes are perfect for :
Exploration, drill down,
Light jobs,
Dynamic jobs.
4. Using RDDs or Dataframes
RDD based jobs are like marine
mammals.
5. Cogroup all the things
5. Cogroup all the things
The cogroup is the best operation
to link data together.
It changes fundamentally the way we work with data.
5. Cogroup all the things
join (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A , B ))]
leftJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A ,Option[B]))]
rightJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A], B) )]
outerJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A],Option[B]))]
cogroup (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( Seq[A], Seq[B]))]
groupBy (rdd:RDD[(K,A)]):RDD[(K,Seq[A])]
On cogroup and groupBy, for a given key:K, there is only one
unique row with that key in the output dataset.
5. Cogroup all the things
5. Cogroup all the things
{case (k,(s1,s2)) => (k,(s1.map(fA).filter(pA)
,s2.map(fB).filter(pB)))}
CHECKPOINT
5. Cogroup all the things
3k LoC
30 minutes to run (non-
blocking)
15 LoC
11 hours to run
(blocking)
5. Cogroup all the things
What about tests? Cogrouping allows us to have “ScalaChecks-like” tests, by
minimising examples.
Test workflow :
Write a predicate to isolate the bug.
Get the minimal cogrouped row
ouput the row in test resources.
Reproduce the bug.
Write tests and fix code.
6. Inline data quality
6. Inline data quality
Data quality improves resilience to bad data.
But data quality concerns come second.
6. Inline data quality
case class FixeVisiteur(
devicetype: String,
isrobot: Boolean,
recherche_visitorid: String,
sessions: List[FixeSession]
) {
def recherches: List[FixeRecherche] = sessions.flatMap(_.recherches)
}
object FixeVisiteur {
@autoBuildResult
def build(
devicetype: Result[String],
isrobot: Result[Boolean],
recherche_visitorid: Result[String],
sessions: Result[List[FixeSession]]
): Result[FixeVisiteur] = MacroMarker.generated_applicative
}
Example :
6. Inline data quality
case class Annotation(
anchor: Anchor,
message: String,
badData: Option[String],
expectedData: List[String],
remainingData: List[String],
level: String @@ AnnotationLevel,
annotationId: Option[AnnotationId],
stage: String
)
case class Anchor(path: String @@ AnchorPath,
typeName: String)
6. Inline data quality
Message :
EMPTY_STRING
MULTIPLE_VALUES
NOT_IN_ENUM
PARSE_ERROR
______________
Levels :
WARNING
ERROR
CRITICAL
6. Inline data quality
Data quality is available within the output rows.
case class HVisiteur(
visitorId: String,
atVisitorId: Option[String],
isRobot: Boolean,
typeClient: String @@ TypeClient,
typeSupport: String @@ TypeSupport,
typeSource: String @@ TypeSource,
hVisiteurPlus: Option[HVisiteurPlus],
sessions: List[HSession],
annotations: Seq[HAnnotation]
)
6. Inline data quality
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> lib_source, message ->
NOT_IN_ENUM, type -> String @@ LibSource, level -> WARNING)),657366)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path ->
analyseInfos.analyse_typequoi, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level ->
WARNING)),201930)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> isrobot, message ->
MULTIPLE_VALUE, type -> String, level -> WARNING)),15)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> rechercheInfos, message ->
MULTIPLE_VALUE, type -> String, level -> WARNING)),566973)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path ->
reponseInfos.reponse_nbblocs, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),571313)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path ->
requeteInfos.requete_typerequete, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),315297)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path ->
analyseInfos.analyse_typequoi_sec, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level ->
WARNING)),201930)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> typereponse, message ->
EMPTY_STRING, type -> String @@ TypeReponse, level -> WARNING)),323614)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> grp_source, message ->
MULTIPLE_VALUE, type -> String, level -> WARNING)),94)
6. Inline data quality
https://github.com/ahoy-jon/autoBuild (presented in october 2015)
There are opportunities to make those approaches more “precepte-like”.
(DAG of workflow, provenance of every fields, structure tags)
7. Create real programs
7. Create real programs
Most pipelines are designed as “Stateless” computation.
They require no state (good)
Or
Infer the current state based on filesystem’ states (bad).
7. Create real programs
Solution : Allow pipelines to access a commit log to read about past execution
and to push data for future execution.
7. Create real programs
In progress: project codename Kerguelen
Multi level abstractions / commit log backed / api for jobs.
Allow creation of jobs that have different concern level.
Level 1 : name resolving
Level 2 : smart intermediaries (schema capture, stats, delta, …)
Level 3 : smart high level scheduler (replay, load management, coherence)
Level 4 : “code as data” (=> continuous delivery, auto QA, auto mep)
Conclusion
Thank you
for listening!
Questions?
jonathan@univalence.io

More Related Content

What's hot

Net flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalNet flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalYeounhee Lee
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...Herman Wu
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management frameworkJulian Hyde
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
SQL on Big Data using Optiq
SQL on Big Data using OptiqSQL on Big Data using Optiq
SQL on Big Data using OptiqJulian Hyde
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsAlexey Zinoviev
 

What's hot (20)

Net flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalNet flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_final
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management framework
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
SQL on Big Data using Optiq
SQL on Big Data using OptiqSQL on Big Data using Optiq
SQL on Big Data using Optiq
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projects
 

Viewers also liked

Preparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuriPreparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuriTIS Inc.
 
Akka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldAkka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldKonrad Malawski
 
ここまで来た!2017年 Web VRでできること
ここまで来た!2017年 Web VRでできることここまで来た!2017年 Web VRでできること
ここまで来た!2017年 Web VRでできることJun Ito
 
ARもVRもMRもまとめてドドンドーン!
ARもVRもMRもまとめてドドンドーン!ARもVRもMRもまとめてドドンドーン!
ARもVRもMRもまとめてドドンドーン!Satoshi Maemoto
 
Big Data Commercialization and associated IoT Platform Implications by Ramnik...
Big Data Commercialization and associated IoT Platform Implications by Ramnik...Big Data Commercialization and associated IoT Platform Implications by Ramnik...
Big Data Commercialization and associated IoT Platform Implications by Ramnik...Data Con LA
 
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...Lucidworks
 
Projectmanagement en systemisch werken
Projectmanagement en systemisch werkenProjectmanagement en systemisch werken
Projectmanagement en systemisch werkenOkke Jan Douma
 
Giip bp-giip connectivity1703
Giip bp-giip connectivity1703Giip bp-giip connectivity1703
Giip bp-giip connectivity1703Lowy Shin
 
Finding HMAS Sydney Chapter 9 - Search for Sydney
Finding HMAS Sydney Chapter 9 - Search for SydneyFinding HMAS Sydney Chapter 9 - Search for Sydney
Finding HMAS Sydney Chapter 9 - Search for SydneyElk Software Group
 
High Availability Architecture for Legacy Stuff - a 10.000 feet overview
High Availability Architecture for Legacy Stuff - a 10.000 feet overviewHigh Availability Architecture for Legacy Stuff - a 10.000 feet overview
High Availability Architecture for Legacy Stuff - a 10.000 feet overviewMarco Amado
 
소셜 코딩 GitHub & branch & branch strategy
소셜 코딩 GitHub & branch & branch strategy소셜 코딩 GitHub & branch & branch strategy
소셜 코딩 GitHub & branch & branch strategyKenu, GwangNam Heo
 
Experimental Photography Artist Research
Experimental Photography Artist ResearchExperimental Photography Artist Research
Experimental Photography Artist ResearchJaskirt Boora
 
How OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaHow OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaData Con LA
 
Collaboration with Eclipse final
Collaboration with Eclipse finalCollaboration with Eclipse final
Collaboration with Eclipse finalKenu, GwangNam Heo
 
Conociendo los servicios adicionales en big data
Conociendo los servicios adicionales en big dataConociendo los servicios adicionales en big data
Conociendo los servicios adicionales en big dataSpanishPASSVC
 
VMworld 2015: Take Virtualization to the Next Level vSphere with Operations M...
VMworld 2015: Take Virtualization to the Next Level vSphere with Operations M...VMworld 2015: Take Virtualization to the Next Level vSphere with Operations M...
VMworld 2015: Take Virtualization to the Next Level vSphere with Operations M...VMworld
 
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...Vladimir Bacvanski, PhD
 

Viewers also liked (20)

Preparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuriPreparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuri
 
Akka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldAkka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming World
 
ここまで来た!2017年 Web VRでできること
ここまで来た!2017年 Web VRでできることここまで来た!2017年 Web VRでできること
ここまで来た!2017年 Web VRでできること
 
ARもVRもMRもまとめてドドンドーン!
ARもVRもMRもまとめてドドンドーン!ARもVRもMRもまとめてドドンドーン!
ARもVRもMRもまとめてドドンドーン!
 
C++ Coroutines
C++ CoroutinesC++ Coroutines
C++ Coroutines
 
Big Data Commercialization and associated IoT Platform Implications by Ramnik...
Big Data Commercialization and associated IoT Platform Implications by Ramnik...Big Data Commercialization and associated IoT Platform Implications by Ramnik...
Big Data Commercialization and associated IoT Platform Implications by Ramnik...
 
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
 
Projectmanagement en systemisch werken
Projectmanagement en systemisch werkenProjectmanagement en systemisch werken
Projectmanagement en systemisch werken
 
Giip bp-giip connectivity1703
Giip bp-giip connectivity1703Giip bp-giip connectivity1703
Giip bp-giip connectivity1703
 
Finding HMAS Sydney Chapter 9 - Search for Sydney
Finding HMAS Sydney Chapter 9 - Search for SydneyFinding HMAS Sydney Chapter 9 - Search for Sydney
Finding HMAS Sydney Chapter 9 - Search for Sydney
 
High Availability Architecture for Legacy Stuff - a 10.000 feet overview
High Availability Architecture for Legacy Stuff - a 10.000 feet overviewHigh Availability Architecture for Legacy Stuff - a 10.000 feet overview
High Availability Architecture for Legacy Stuff - a 10.000 feet overview
 
소셜 코딩 GitHub & branch & branch strategy
소셜 코딩 GitHub & branch & branch strategy소셜 코딩 GitHub & branch & branch strategy
소셜 코딩 GitHub & branch & branch strategy
 
Migrating to aws
Migrating to awsMigrating to aws
Migrating to aws
 
Experimental Photography Artist Research
Experimental Photography Artist ResearchExperimental Photography Artist Research
Experimental Photography Artist Research
 
How OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaHow OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman Marya
 
The Beauty of BAD code
The Beauty of  BAD codeThe Beauty of  BAD code
The Beauty of BAD code
 
Collaboration with Eclipse final
Collaboration with Eclipse finalCollaboration with Eclipse final
Collaboration with Eclipse final
 
Conociendo los servicios adicionales en big data
Conociendo los servicios adicionales en big dataConociendo los servicios adicionales en big data
Conociendo los servicios adicionales en big data
 
VMworld 2015: Take Virtualization to the Next Level vSphere with Operations M...
VMworld 2015: Take Virtualization to the Next Level vSphere with Operations M...VMworld 2015: Take Virtualization to the Next Level vSphere with Operations M...
VMworld 2015: Take Virtualization to the Next Level vSphere with Operations M...
 
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
 

Similar to 7 key recipes for data engineering

IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
Cio summit 20170223_v20
Cio summit 20170223_v20Cio summit 20170223_v20
Cio summit 20170223_v20Joshua Bae
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET Journal
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins Edureka!
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionAdnan Masood
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxKnoldus Inc.
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By SparkKnoldus Inc.
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Scienceds4good
 

Similar to 7 key recipes for data engineering (20)

Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Big data
Big dataBig data
Big data
 
Cio summit 20170223_v20
Cio summit 20170223_v20Cio summit 20170223_v20
Cio summit 20170223_v20
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
 
User 2013-oracle-big-data-analytics-1971985
User 2013-oracle-big-data-analytics-1971985User 2013-oracle-big-data-analytics-1971985
User 2013-oracle-big-data-analytics-1971985
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By Spark
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Science
 

More from univalence

Scala pour le Data Eng
Scala pour le Data EngScala pour le Data Eng
Scala pour le Data Engunivalence
 
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) univalence
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineeringunivalence
 
Streaming in Scala with Avro
Streaming in Scala with AvroStreaming in Scala with Avro
Streaming in Scala with Avrounivalence
 
Beyond tabular data
Beyond tabular dataBeyond tabular data
Beyond tabular dataunivalence
 
Introduction à kafka
Introduction à kafkaIntroduction à kafka
Introduction à kafkaunivalence
 
Data encoding and Metadata for Streams
Data encoding and Metadata for StreamsData encoding and Metadata for Streams
Data encoding and Metadata for Streamsunivalence
 
Introduction aux Macros
Introduction aux MacrosIntroduction aux Macros
Introduction aux Macrosunivalence
 
Big data forever
Big data foreverBig data forever
Big data foreverunivalence
 

More from univalence (9)

Scala pour le Data Eng
Scala pour le Data EngScala pour le Data Eng
Scala pour le Data Eng
 
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineering
 
Streaming in Scala with Avro
Streaming in Scala with AvroStreaming in Scala with Avro
Streaming in Scala with Avro
 
Beyond tabular data
Beyond tabular dataBeyond tabular data
Beyond tabular data
 
Introduction à kafka
Introduction à kafkaIntroduction à kafka
Introduction à kafka
 
Data encoding and Metadata for Streams
Data encoding and Metadata for StreamsData encoding and Metadata for Streams
Data encoding and Metadata for Streams
 
Introduction aux Macros
Introduction aux MacrosIntroduction aux Macros
Introduction aux Macros
 
Big data forever
Big data foreverBig data forever
Big data forever
 

Recently uploaded

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationShrmpro
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 

Recently uploaded (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 

7 key recipes for data engineering

  • 1. 7 Key Recipes for Data Engineering
  • 2. Introduction We will explore 7 key recipes about Data Engineering. The 5th is absolutely game changing!
  • 3. Thank You Bachir AIT MBAREK BI and Big Data Consultant
  • 4. About Me Jonathan WINANDY Lead Data Engineer: - Data Lake building, - Audit / Coaching, - Spark Training. Founder of Univalence (BI / Big Data) Co-Founder of CYM (IoT / Predictive Maintenance), Craft Analytics† (BI / Big Data), and Valwin (Health Care Data).
  • 5. 2016 has been amazing for Data Engineering ! but ...
  • 6. 1.It’s all about our organisations!
  • 7. 1.It’s all about our Organisations Data engineering is not about scaling computation.
  • 8. 1.It’s all about our Organisations Data engineering is not a support function for Data Scientists[1]. [1] whatever they are nowadays
  • 9. 1.It’s all about our Organisations Instead, Data engineering enables access to Data!
  • 10. 1.It’s all about our Organisations access to Data … in complex organisations. Product OpsBI You Marketing data new data
  • 11. Holding 1.It’s all about our Organisations access to Data … in complex organisations. Marketing Yo u data new data Entity 1 MarketingIT Entity N MarketingIT
  • 12. 1.It’s all about our Organisations access to Data … in complex organisations. It’s very frustrating! We run a support group meetup if you are interested : Paris Data Engineers!
  • 13. 1.It’s all about our Organisations Small tips : Only one hadoop cluster (no TEST/REC/INT/PREPROD). No Air-Data-Eng, it helps no one. Radical transparency with other teams. Hack that sh**.
  • 15. 2. Optimising our work There are 3 key concerns governing our decisions : Lead time Impact Failure management
  • 16. 2. Optimising our work Lead time (noun) : The period of time between the initial phase of a process and the emergence of results, as between the planning and completed manufacture of a product. Short lead times are essential! The Elastic stack helps a lot in this area.
  • 17. 2. Optimising our work Impact To have impact, we have to analyse beyond immediate needs. That way, we’re able to provide solutions to entire kinds of problems.
  • 18. 2. Optimising our work Failure management Things fail, be prepared! On the same morning the RER A public transports and our Hadoop job tracker can fail. Unprepared failures may pile up and lead to huge wastes.
  • 19. 2. Optimising our work “What is likely to fail?” $componentName_____ “How? (root cause)” “Can we know if this will fail?” “Can we prevent this failure?” “What are the impacts?” “How to fix it when it happens?” “Can we facilitate today?” How to mitigate failure in 7 questions.
  • 20. 2. Optimising our work Track your work!
  • 22. 3. Staging the data Data is moving around, freeze it! Staging changed with Big Data. We moved from transient staging (FTP, NFS, etc.) to persistent staging in distributed solutions: ● In Streaming with Kafka, we may retain logs in Kafka for several months. ● In Batch, staging in HDFS may retain source Data for years.
  • 23. 3. Staging the data Modern staging anti-pattern : Dropping destination places before moving the Data. Having incomplete data visible. Short log retention in streams (=> new failure modes). Modern staging should be seen as a persistent data structure.
  • 24. 3. Staging the data HDFS staging : /staging |-- $tablename |-- dtint=$dtint |-- dsparam.name=$dsparam.value |-- ... |-- ... |-- uuid=$uuid
  • 25. 4. Using RDDs or Dataframes
  • 26. 4. Using RDDs or Dataframes Dataframes have great performance, but are untyped and foreign. RDDs have a robust Scala API, but are a pain to map from data sources. btw, SQL is AWESOME
  • 27. 4. Using RDDs or Dataframes Dataframes RDDs Predicate push down Types !! Bare metal / unboxed Nested structures Connectors Better unit tests Pluggable Optimizer Less stages SQL + Meta Scala * Scala
  • 28. 4. Using RDDs or Dataframes We should use RDDs in large ETL jobs : Loading the data with dataframe APIs, Basic case class mapping (or better Datasets), Typesafe transformations, Storing with dataframe APIs
  • 29. 4. Using RDDs or Dataframes Dataframes are perfect for : Exploration, drill down, Light jobs, Dynamic jobs.
  • 30. 4. Using RDDs or Dataframes RDD based jobs are like marine mammals.
  • 31. 5. Cogroup all the things
  • 32. 5. Cogroup all the things The cogroup is the best operation to link data together. It changes fundamentally the way we work with data.
  • 33. 5. Cogroup all the things join (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A , B ))] leftJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A ,Option[B]))] rightJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A], B) )] outerJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A],Option[B]))] cogroup (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( Seq[A], Seq[B]))] groupBy (rdd:RDD[(K,A)]):RDD[(K,Seq[A])] On cogroup and groupBy, for a given key:K, there is only one unique row with that key in the output dataset.
  • 34. 5. Cogroup all the things
  • 35. 5. Cogroup all the things {case (k,(s1,s2)) => (k,(s1.map(fA).filter(pA) ,s2.map(fB).filter(pB)))} CHECKPOINT
  • 36. 5. Cogroup all the things 3k LoC 30 minutes to run (non- blocking) 15 LoC 11 hours to run (blocking)
  • 37. 5. Cogroup all the things What about tests? Cogrouping allows us to have “ScalaChecks-like” tests, by minimising examples. Test workflow : Write a predicate to isolate the bug. Get the minimal cogrouped row ouput the row in test resources. Reproduce the bug. Write tests and fix code.
  • 38. 6. Inline data quality
  • 39. 6. Inline data quality Data quality improves resilience to bad data. But data quality concerns come second.
  • 40. 6. Inline data quality case class FixeVisiteur( devicetype: String, isrobot: Boolean, recherche_visitorid: String, sessions: List[FixeSession] ) { def recherches: List[FixeRecherche] = sessions.flatMap(_.recherches) } object FixeVisiteur { @autoBuildResult def build( devicetype: Result[String], isrobot: Result[Boolean], recherche_visitorid: Result[String], sessions: Result[List[FixeSession]] ): Result[FixeVisiteur] = MacroMarker.generated_applicative } Example :
  • 41. 6. Inline data quality case class Annotation( anchor: Anchor, message: String, badData: Option[String], expectedData: List[String], remainingData: List[String], level: String @@ AnnotationLevel, annotationId: Option[AnnotationId], stage: String ) case class Anchor(path: String @@ AnchorPath, typeName: String)
  • 42. 6. Inline data quality Message : EMPTY_STRING MULTIPLE_VALUES NOT_IN_ENUM PARSE_ERROR ______________ Levels : WARNING ERROR CRITICAL
  • 43. 6. Inline data quality Data quality is available within the output rows. case class HVisiteur( visitorId: String, atVisitorId: Option[String], isRobot: Boolean, typeClient: String @@ TypeClient, typeSupport: String @@ TypeSupport, typeSource: String @@ TypeSource, hVisiteurPlus: Option[HVisiteurPlus], sessions: List[HSession], annotations: Seq[HAnnotation] )
  • 44. 6. Inline data quality (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> lib_source, message -> NOT_IN_ENUM, type -> String @@ LibSource, level -> WARNING)),657366) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> analyseInfos.analyse_typequoi, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level -> WARNING)),201930) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> isrobot, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),15) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> rechercheInfos, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),566973) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> reponseInfos.reponse_nbblocs, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),571313) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> requeteInfos.requete_typerequete, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),315297) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> analyseInfos.analyse_typequoi_sec, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level -> WARNING)),201930) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> typereponse, message -> EMPTY_STRING, type -> String @@ TypeReponse, level -> WARNING)),323614) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> grp_source, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),94)
  • 45. 6. Inline data quality https://github.com/ahoy-jon/autoBuild (presented in october 2015) There are opportunities to make those approaches more “precepte-like”. (DAG of workflow, provenance of every fields, structure tags)
  • 46. 7. Create real programs
  • 47. 7. Create real programs Most pipelines are designed as “Stateless” computation. They require no state (good) Or Infer the current state based on filesystem’ states (bad).
  • 48. 7. Create real programs Solution : Allow pipelines to access a commit log to read about past execution and to push data for future execution.
  • 49. 7. Create real programs In progress: project codename Kerguelen Multi level abstractions / commit log backed / api for jobs. Allow creation of jobs that have different concern level. Level 1 : name resolving Level 2 : smart intermediaries (schema capture, stats, delta, …) Level 3 : smart high level scheduler (replay, load management, coherence) Level 4 : “code as data” (=> continuous delivery, auto QA, auto mep)

Editor's Notes

  1. ~15 lines of code