SlideShare a Scribd company logo
1 of 30
Download to read offline
www.kensu.io
DATA SCIENCE GOVERNANCE
1
What and How
www.kensu.io 2
- CEO & Founder -
Mathematics & Computer Science MsC.
Creator of Spark Notebook
- CSO & Founder -
Physics PhD. 

Genomics & Quantitative Finance
XAVIER TORDOIRANDY PETRELLA
KENSU & ME
Started in 2015 as Data Fellas, focus on Data Science consulting
Team of 10 engineers and scientists
Shift toward Product Company in 2016, renamed to Kensu,
Focus on Data Science Governance
Accelerated by Alchemist Accelerator in San Francisco and The Faktory in Belgium
www.kensu.io
TOPICS
1. Some thoughts on “Data Science”
2. Data Science Governance: What
3. Data Science Governance: How
4. GDPR: Accountability principle and transparency
3
www.kensu.io
SOME THOUGHTS ON “DATA SCIENCE”
4
www.kensu.io
MACHINE LEARNING
Pioneers in 1950s
AI Winter in 1970s due pessimism
Resurgence in 1980s
Machine Learning (and related) is used since the 1990s (esp. SVM and RNN)
Deep learning see widespread commercial use in 2000s
Machine learning receives great publicity (read: buzz) in 2010s
5ref: https://en.wikipedia.org/wiki/Timeline_of_machine_learning
www.kensu.io
DATA SCIENCE: +ENGINEERING
Claim: “Data Scientist” coined by DJ Patil in 2008.
Pretty much where Machine Learning was part of Softwares
In a way, when we added “engineering” to the mix
Also, engineering is even more prominent with Big Data Distributed
Computing
6
www.kensu.io
DATA SCIENCE: +EXPERIMENTATION
So much data available
So many tools, libraries, frameworks, …
So many things we can try
We have distributed computing now, right? => Let’s try everything
Discover new insights (and potentially new businesses)
7
www.kensu.io
DATA SCIENCE: RECAP
Maths: stats, machine learning and so on
Engineering: ETL, Databases, Computing framework, Softwares, Platforms, …
Creativity: “From business intelligence To intelligent business”- Michael Fergusson
Data Science is an umbrella on top of all activities on data
8
www.kensu.io
DATA SCIENCE GOVERNANCE: WHAT
9
www.kensu.io
DATA PIPELINE
Data pipeline is connecting activities on data, potentially involving
several technologies.
A pipeline is generally thought as an End-to-End processing line to solve
one problem.
But, part of pipelines are reused to save computation, storage, time, …
Thus interdependency between pipeline segments grows with initiatives
10
www.kensu.io
GOAL: TAKE DECISION
Data Pipelines, connected together, aren’t created for the beauty of it.
The ultimate goal is always to take decisions.
Decisions are generally taken or linked to humans with responsibilities.

(even for self driving cars, in case of problem)
Given that pipelines are cut-and-wired, interleaved, …
How not to be anxious at deploying the last piece used by the decision maker
11
www.kensu.io
SOURCES OF ANXIETY
What if:
• one of the data used in the process has different patterns suddenly?
• one of the tools, projects or similar is modified upstream?
• the insights are deviating from the reality?
• …
12
www.kensu.io
DEBUGGING?
To reduce the anxiety or, actually, reducing the risks, we need ways to debug.
In pure engineering, we have unit, function, integrations tests,… but
How do we do when the problems come from the data themselves?
We can’t generate all cases of data variations, right?
How to debug? 

Without the big picture, we may try to optimise a model for weeks for nothing
13
www.kensu.io
DATA SCIENCE GOVERNANCE
Data governance: controls that data meets precise standards and
involves monitoring against production data.
Data Science Governance: control that data activity meets precise
standards and involves monitoring against production data activity.
A Data Activity is described by at least technologies, users, systems,
data, processing
14
www.kensu.io
GOVERNING DATA SCIENCE
Who does what on which data and where it is done?
What is the impact of a process on the global system?
What are the performance metrics (quality, execution,…) of the processes?
15
www.kensu.io
CONTINUOUS INTEGRATION FOR DATA SCIENCE
Data Scientists/Citizens have a view on all the activities applied to
the original sources used in his/her own process.
They also have a control on their own results in production
They have the opportunity to analyse and debug a pipeline
involving all activities:
• independently of the technologies
• involving several people in the enterprise
16
www.kensu.io
DATA SCIENCE GOVERNANCE: HOW
17
www.kensu.io
CHALLENGES
So many tools are using data!
The number of processing is growing impressively.
We have to take care of the legacy…
18
www.kensu.io
GET THE DATA
As usual, we have to collect the right data to take right decision.
First run an assessment to create a high level map of all the tools
involved into a company.
For each tool, do whatever it takes to collect information about the
activities it is creating.
Information are metadata, lineage, statistics, accuracy measures, …
19
www.kensu.io
CONNECT THE DATA
Data Science Governance needs the global picture.
To do that we need to connect all data that can be collected.
So that, it is possible to create a cartography of all on-going processes.
This map tracks all data and their descendants
20
www.kensu.io
USE THE DATA
This is where the fun part starts… the map of data activities is an
amazing source of information
Here are a few things you can think of when using this kind of data:
• impact analysis
• dependency analysis
• optimisation
• recommendation
21
www.kensu.io
GDPR
22
General Data Protection Regulation
www.kensu.io
ACCOUNTABILITY PRINCIPLE
Implement appropriate technical and organisational measures that
ensure and demonstrate that you comply. This may include internal
data protection policies such as staff training, internal audits of
processing activities, and reviews of internal HR policies.
23
www.kensu.io
TRANSPARENCY
As well as your obligation to provide comprehensive, clear and
transparent privacy policies, if your organisation has more than 250
employees, you must maintain additional internal records of your
processing activities.
24
www.kensu.io
ACCOUNTABILITY: DATA SCIENCE GOVERNANCE
To govern data science, we have to:
• collect activities
• connect activities
With this information we can reliably create automatically the
process registry
25
www.kensu.io
TRANSPARENCY: DATA SCIENCE GOVERNANCE
To govern data science seen as a continuous integration solution: 

we have to explain and measure activities independently of the
technologies.
With this information we can reliably create transparent reports of
activities across the whole chain of processing
26
www.kensu.io
GUESS WHAT?
This what Adalog, our product at Kensu, does!
27
www.kensu.io
ADALOG
28
Adalog Collectors
Adalog Service
Data Citizen
HTTPSPortonly
Recommendation System
Data Process Registry
Impact Analyzer
Data
Protection
Officer
Dashboard
www.kensu.io
WANT TO SEE MORE?
Request a demo on our website: http://kensu.io
29
www.kensu.io
DATA SCIENCE GOVERNANCE
Andy Petrella
CEO Co Founder
0032 495 99 11 04
@noootsab
Xavier Tordoir
CSO Co Founder
0032 495 99 11 04
+1 (628) 236-9239
@xtordoir
@kensuio

More Related Content

What's hot

Data Management vs. Data Governance Program
Data Management vs. Data Governance ProgramData Management vs. Data Governance Program
Data Management vs. Data Governance ProgramDATAVERSITY
 
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is InvaluableDataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is InvaluableDATAVERSITY
 
DataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business OutcomesDataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business OutcomesDATAVERSITY
 
Key Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics ProgramKey Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics ProgramData Con LA
 
Approaching Data Quality
Approaching Data QualityApproaching Data Quality
Approaching Data QualityDATAVERSITY
 
Predictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal BallPredictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal BallDATAVERSITY
 
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...DATAVERSITY
 
RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...
RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...
RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...DATAVERSITY
 
DI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDATAVERSITY
 
Are You Your Company's Chief Data Officer?
Are You Your Company's Chief Data Officer?Are You Your Company's Chief Data Officer?
Are You Your Company's Chief Data Officer?Brendan Aldrich
 
Data Governance Best Practices and Lessons Learned
Data Governance Best Practices and Lessons LearnedData Governance Best Practices and Lessons Learned
Data Governance Best Practices and Lessons LearnedDATAVERSITY
 
ADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced Analytics
ADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced AnalyticsADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced Analytics
ADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced AnalyticsDATAVERSITY
 
Building a data fluent organization
Building a data fluent organizationBuilding a data fluent organization
Building a data fluent organizationZach Gemignani
 
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...Denodo
 
Data-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesData-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesDATAVERSITY
 
Data-Ed Online Webinar: Data Governance Strategies
Data-Ed Online Webinar: Data Governance StrategiesData-Ed Online Webinar: Data Governance Strategies
Data-Ed Online Webinar: Data Governance StrategiesDATAVERSITY
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015Fiona Lew
 
How to Consume Your Data for AI
How to Consume Your Data for AIHow to Consume Your Data for AI
How to Consume Your Data for AIDATAVERSITY
 
DataEd Slides: Getting (Re)Started with Data Stewardship
DataEd Slides: Getting (Re)Started with Data StewardshipDataEd Slides: Getting (Re)Started with Data Stewardship
DataEd Slides: Getting (Re)Started with Data StewardshipDATAVERSITY
 
DataEd Slides: Data Management + Data Strategy = Interoperability
DataEd Slides: Data Management + Data Strategy = InteroperabilityDataEd Slides: Data Management + Data Strategy = Interoperability
DataEd Slides: Data Management + Data Strategy = InteroperabilityDATAVERSITY
 

What's hot (20)

Data Management vs. Data Governance Program
Data Management vs. Data Governance ProgramData Management vs. Data Governance Program
Data Management vs. Data Governance Program
 
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is InvaluableDataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
 
DataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business OutcomesDataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business Outcomes
 
Key Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics ProgramKey Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics Program
 
Approaching Data Quality
Approaching Data QualityApproaching Data Quality
Approaching Data Quality
 
Predictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal BallPredictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal Ball
 
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
 
RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...
RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...
RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...
 
DI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data Warehouse
 
Are You Your Company's Chief Data Officer?
Are You Your Company's Chief Data Officer?Are You Your Company's Chief Data Officer?
Are You Your Company's Chief Data Officer?
 
Data Governance Best Practices and Lessons Learned
Data Governance Best Practices and Lessons LearnedData Governance Best Practices and Lessons Learned
Data Governance Best Practices and Lessons Learned
 
ADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced Analytics
ADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced AnalyticsADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced Analytics
ADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced Analytics
 
Building a data fluent organization
Building a data fluent organizationBuilding a data fluent organization
Building a data fluent organization
 
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
 
Data-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesData-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success Stories
 
Data-Ed Online Webinar: Data Governance Strategies
Data-Ed Online Webinar: Data Governance StrategiesData-Ed Online Webinar: Data Governance Strategies
Data-Ed Online Webinar: Data Governance Strategies
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015
 
How to Consume Your Data for AI
How to Consume Your Data for AIHow to Consume Your Data for AI
How to Consume Your Data for AI
 
DataEd Slides: Getting (Re)Started with Data Stewardship
DataEd Slides: Getting (Re)Started with Data StewardshipDataEd Slides: Getting (Re)Started with Data Stewardship
DataEd Slides: Getting (Re)Started with Data Stewardship
 
DataEd Slides: Data Management + Data Strategy = Interoperability
DataEd Slides: Data Management + Data Strategy = InteroperabilityDataEd Slides: Data Management + Data Strategy = Interoperability
DataEd Slides: Data Management + Data Strategy = Interoperability
 

Similar to Data science governance : what and how

Governance compliance
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerMicrosoft
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data MiningIOSR Journals
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Chris Dagdigian
 
Introduction to business analyticsand historidal overview.pptx
Introduction to business analyticsand historidal overview.pptxIntroduction to business analyticsand historidal overview.pptx
Introduction to business analyticsand historidal overview.pptxSambarajuRavalika
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategyHimanshu Bari
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxOTA13NayabNakhwa
 
Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Joanne Luciano
 
Confirming PagesLess managing. More teaching. Greater
Confirming PagesLess managing. More teaching. Greater Confirming PagesLess managing. More teaching. Greater
Confirming PagesLess managing. More teaching. Greater AlleneMcclendon878
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
My FAIR share of the work - Diamond Light Source - Dec 2018
My FAIR share of the work - Diamond Light Source - Dec 2018My FAIR share of the work - Diamond Light Source - Dec 2018
My FAIR share of the work - Diamond Light Source - Dec 2018Susanna-Assunta Sansone
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsDhruv Saxena
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsEmbarcadero Technologies
 
How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?Inside Analysis
 
Story of Bigdata and its Applications in Financial Institutions
Story of Bigdata and its Applications in Financial InstitutionsStory of Bigdata and its Applications in Financial Institutions
Story of Bigdata and its Applications in Financial Institutionsijtsrd
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 

Similar to Data science governance : what and how (20)

Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringer
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data Mining
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 
Introduction to business analyticsand historidal overview.pptx
Introduction to business analyticsand historidal overview.pptxIntroduction to business analyticsand historidal overview.pptx
Introduction to business analyticsand historidal overview.pptx
 
Information what is it
Information what is itInformation what is it
Information what is it
 
Big data
Big dataBig data
Big data
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
 
Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020
 
Confirming PagesLess managing. More teaching. Greater
Confirming PagesLess managing. More teaching. Greater Confirming PagesLess managing. More teaching. Greater
Confirming PagesLess managing. More teaching. Greater
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
My FAIR share of the work - Diamond Light Source - Dec 2018
My FAIR share of the work - Diamond Light Source - Dec 2018My FAIR share of the work - Diamond Light Source - Dec 2018
My FAIR share of the work - Diamond Light Source - Dec 2018
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Big data
Big dataBig data
Big data
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data Assets
 
How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?
 
Story of Bigdata and its Applications in Financial Institutions
Story of Bigdata and its Applications in Financial InstitutionsStory of Bigdata and its Applications in Financial Institutions
Story of Bigdata and its Applications in Financial Institutions
 
Agile data science
Agile data scienceAgile data science
Agile data science
 

More from Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open ScienceAndy Petrella
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 

More from Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 

Recently uploaded

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Data science governance : what and how

  • 2. www.kensu.io 2 - CEO & Founder - Mathematics & Computer Science MsC. Creator of Spark Notebook - CSO & Founder - Physics PhD. 
 Genomics & Quantitative Finance XAVIER TORDOIRANDY PETRELLA KENSU & ME Started in 2015 as Data Fellas, focus on Data Science consulting Team of 10 engineers and scientists Shift toward Product Company in 2016, renamed to Kensu, Focus on Data Science Governance Accelerated by Alchemist Accelerator in San Francisco and The Faktory in Belgium
  • 3. www.kensu.io TOPICS 1. Some thoughts on “Data Science” 2. Data Science Governance: What 3. Data Science Governance: How 4. GDPR: Accountability principle and transparency 3
  • 4. www.kensu.io SOME THOUGHTS ON “DATA SCIENCE” 4
  • 5. www.kensu.io MACHINE LEARNING Pioneers in 1950s AI Winter in 1970s due pessimism Resurgence in 1980s Machine Learning (and related) is used since the 1990s (esp. SVM and RNN) Deep learning see widespread commercial use in 2000s Machine learning receives great publicity (read: buzz) in 2010s 5ref: https://en.wikipedia.org/wiki/Timeline_of_machine_learning
  • 6. www.kensu.io DATA SCIENCE: +ENGINEERING Claim: “Data Scientist” coined by DJ Patil in 2008. Pretty much where Machine Learning was part of Softwares In a way, when we added “engineering” to the mix Also, engineering is even more prominent with Big Data Distributed Computing 6
  • 7. www.kensu.io DATA SCIENCE: +EXPERIMENTATION So much data available So many tools, libraries, frameworks, … So many things we can try We have distributed computing now, right? => Let’s try everything Discover new insights (and potentially new businesses) 7
  • 8. www.kensu.io DATA SCIENCE: RECAP Maths: stats, machine learning and so on Engineering: ETL, Databases, Computing framework, Softwares, Platforms, … Creativity: “From business intelligence To intelligent business”- Michael Fergusson Data Science is an umbrella on top of all activities on data 8
  • 10. www.kensu.io DATA PIPELINE Data pipeline is connecting activities on data, potentially involving several technologies. A pipeline is generally thought as an End-to-End processing line to solve one problem. But, part of pipelines are reused to save computation, storage, time, … Thus interdependency between pipeline segments grows with initiatives 10
  • 11. www.kensu.io GOAL: TAKE DECISION Data Pipelines, connected together, aren’t created for the beauty of it. The ultimate goal is always to take decisions. Decisions are generally taken or linked to humans with responsibilities.
 (even for self driving cars, in case of problem) Given that pipelines are cut-and-wired, interleaved, … How not to be anxious at deploying the last piece used by the decision maker 11
  • 12. www.kensu.io SOURCES OF ANXIETY What if: • one of the data used in the process has different patterns suddenly? • one of the tools, projects or similar is modified upstream? • the insights are deviating from the reality? • … 12
  • 13. www.kensu.io DEBUGGING? To reduce the anxiety or, actually, reducing the risks, we need ways to debug. In pure engineering, we have unit, function, integrations tests,… but How do we do when the problems come from the data themselves? We can’t generate all cases of data variations, right? How to debug? 
 Without the big picture, we may try to optimise a model for weeks for nothing 13
  • 14. www.kensu.io DATA SCIENCE GOVERNANCE Data governance: controls that data meets precise standards and involves monitoring against production data. Data Science Governance: control that data activity meets precise standards and involves monitoring against production data activity. A Data Activity is described by at least technologies, users, systems, data, processing 14
  • 15. www.kensu.io GOVERNING DATA SCIENCE Who does what on which data and where it is done? What is the impact of a process on the global system? What are the performance metrics (quality, execution,…) of the processes? 15
  • 16. www.kensu.io CONTINUOUS INTEGRATION FOR DATA SCIENCE Data Scientists/Citizens have a view on all the activities applied to the original sources used in his/her own process. They also have a control on their own results in production They have the opportunity to analyse and debug a pipeline involving all activities: • independently of the technologies • involving several people in the enterprise 16
  • 18. www.kensu.io CHALLENGES So many tools are using data! The number of processing is growing impressively. We have to take care of the legacy… 18
  • 19. www.kensu.io GET THE DATA As usual, we have to collect the right data to take right decision. First run an assessment to create a high level map of all the tools involved into a company. For each tool, do whatever it takes to collect information about the activities it is creating. Information are metadata, lineage, statistics, accuracy measures, … 19
  • 20. www.kensu.io CONNECT THE DATA Data Science Governance needs the global picture. To do that we need to connect all data that can be collected. So that, it is possible to create a cartography of all on-going processes. This map tracks all data and their descendants 20
  • 21. www.kensu.io USE THE DATA This is where the fun part starts… the map of data activities is an amazing source of information Here are a few things you can think of when using this kind of data: • impact analysis • dependency analysis • optimisation • recommendation 21
  • 23. www.kensu.io ACCOUNTABILITY PRINCIPLE Implement appropriate technical and organisational measures that ensure and demonstrate that you comply. This may include internal data protection policies such as staff training, internal audits of processing activities, and reviews of internal HR policies. 23
  • 24. www.kensu.io TRANSPARENCY As well as your obligation to provide comprehensive, clear and transparent privacy policies, if your organisation has more than 250 employees, you must maintain additional internal records of your processing activities. 24
  • 25. www.kensu.io ACCOUNTABILITY: DATA SCIENCE GOVERNANCE To govern data science, we have to: • collect activities • connect activities With this information we can reliably create automatically the process registry 25
  • 26. www.kensu.io TRANSPARENCY: DATA SCIENCE GOVERNANCE To govern data science seen as a continuous integration solution: 
 we have to explain and measure activities independently of the technologies. With this information we can reliably create transparent reports of activities across the whole chain of processing 26
  • 27. www.kensu.io GUESS WHAT? This what Adalog, our product at Kensu, does! 27
  • 28. www.kensu.io ADALOG 28 Adalog Collectors Adalog Service Data Citizen HTTPSPortonly Recommendation System Data Process Registry Impact Analyzer Data Protection Officer Dashboard
  • 29. www.kensu.io WANT TO SEE MORE? Request a demo on our website: http://kensu.io 29
  • 30. www.kensu.io DATA SCIENCE GOVERNANCE Andy Petrella CEO Co Founder 0032 495 99 11 04 @noootsab Xavier Tordoir CSO Co Founder 0032 495 99 11 04 +1 (628) 236-9239 @xtordoir @kensuio