1) The document provides practical advice for building a data-driven company, including making data easily accessible, ensuring business awareness of opportunities to use algorithms, and having data science projects with the lowest time to market.
2) It emphasizes focusing on usability over architecture when building a data lake, and having feature teams to deliver code ready for production through a message broker that allows for reuse of data flows.
3) The document advises businesses to be aware of algorithm opportunities, mixing business and data science teams for collaboration, and spending time together through activities like pair programming.
1. 50 AVENUE DES CHAMPS-ÉLYSÉES 75008 PARIS > FRANCE > WWW.OCTO.COM
HADOOP SUMMIT 2016 - DUBLIN
PRACTICAL ADVICE TO BUILD A DATA DRIVEN
COMPANY
Simon MABY
@simonmaby
3. 3
A continuous improvement of all business
processes, through a smart use of the data, all the
time, everywhere and to all purposes
OCTO TECHNOLOGY > THERE IS A BETTER WAY
4. 4
BEING DATA DRIVEN IS BEING LEAN
OCTO TECHNOLOGY > THERE IS A BETTER WAY
IDEA
CODEDATA
BUILD
MEASURE
LEARN
5. 5
REQUIREMENTS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
IDEA
CODE
DATA Data must be easily accessible
Business must be aware of opportunities to use algorithms
Datascience projects should have the lowest time to market
possible
7. 7
DATA
Data must be easily accessible
OCTO TECHNOLOGY > THERE IS A BETTER WAY
8. 8
Your Datalake is a service to your company.
It should be managed like a startup
Your employees are you first clients. The more
they use it, the more you are Data Driven
OCTO TECHNOLOGY > THERE IS A BETTER WAY
9. 9
FOCUS ON USABILITY OVER ARCHITECTURE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Services
Datalake
Datalake Team :
OPS - DEVs - DESIGNERS
End Users and projects
Design services
for usability and
grant support
Gather
requirements
and usage
metrics
10. 10
FOCUS ON USABILITY OVER ARCHITECTURE : EXAMPLES
How simple is it to share data to other projects?
How simple is it to suscribe to a data feed?
Is it possible to run a full search on available datasets?
Is it possible to ask other projects for details about their data through a social
network?
Auto-completion over SQL request from other projects?
Bookmarking, sharing, upvoting datasets, tagging metadata…
OCTO TECHNOLOGY > THERE IS A BETTER WAY
14. 14OCTO TECHNOLOGY > THERE IS A BETTER WAY
(Not so)
Big Data Infrastructure
(For exploration)
15. 15
WHAT IF WE GIVE LESS DATA TO OUR ALGORITHMS?
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Cf. Zoltan Prekopcsak, Hadoop
Summit EU. 2015
16. 16
FEATURE TEAMS TO DELIVER CODE READY FOR PRODUCTION
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Business
rep.
Developer
Data Sc.
17. 17
MESSAGE BROKER TO REUSE DATA FLOWS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
App A App B
DW
DB X
App A App B
DW DB X
Kafka
App C
? ? ?
- Custom dev
- Data formats?
- SLA?
- Scheduling?
…
- Standard format
- Prod Ready
- Exploration and prod will
share same formats
18. 18
KAPPA ARCHITECTURE : EVERYTHING IS A STREAM
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Stream Data Stream Processing Serving DB
Topic Streaming app v1
Streaming app v2
Result data v1
Result data v2
Kafka
Batch jobs are just historical data you send into a streaming app
Application code is decoupled from technical requirements
One shot exploration code respecting the stream abstraction can go in
production easily
20. 20
IDEAS
Business must be aware of the opportunities to
use algorithms
OCTO TECHNOLOGY > THERE IS A BETTER WAY
21. 21
MIX THESE PEOPLE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Business
Knows what is
valuable
Data Scientist
Knows what is
feasible
Culture &
Collaboration
22. 22
FEATURE TEAMS ONCE AGAIN
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Business
rep.
Developer
Data Sc.
23. 23
EXPLAIN THEM THAT MACHINE LEARNING IS EASY (IT’S METHODOLOGY)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
24. 24
EXPLAIN THEM THAT MACHINE LEARNING IS EASY (IT’S MAGIC)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
25. 25
SPEND TIME TOGETHER
Show them the data
Pair Programming
Swap roles for one day
OCTO TECHNOLOGY > THERE IS A BETTER WAY
27. 27OCTO TECHNOLOGY > THERE IS A BETTER WAY
Story : Octo Datascience Competition Platform
28. HOW WIDELY DATADRIVEN IS YOUR
COMPANY?
Everybody is willing to make value out of
the available data
Data serves not only the core business but
every single function
Data is used in day-to-day activity in real-
time
OCTO TECHNOLOGY > THERE IS A BETTER WAY
29. HOW DEEPLY DATADRIVEN IS YOUR
COMPANY?
OCTO TECHNOLOGY > THERE IS A BETTER WAY
You are using cutting edges algorithms
to automate processes
You are used to A/B testing based on
data every week
You cross multiple data sources to build
insights and models
Editor's Notes
When you are able to evaluate the returns on your data analysis investments
The true revolution is analytics and data science and how it is democratized and easy is to use
How many views on which dashboards? Trends? Useless ones? Dashboard recommendation?
Data items search : i want to search for any dataset based on it’s content and semantics
Code and query transparency : what people query the most on the database? How they do it?
Easy pub/sub Data brokering. Hey, as i Data Scientist i suscribe to this data stream.
Free search
Vision client 360
Interfaces webs intéractives
Dashboards personnalisés
Abonnement à des Data News
Réseau social d’entreprise autour de contenu data-oriented
THIS IS NOT MAINLY ABOUT SCALABILITY
Example : instead of offering many complex services, just provide Hive/HDFS/Spark access with good guarantees and secured access
Offer software factories
Rely on python and R :
Maturity of Big Data framework frameworks is not that good for data exploration
Most of the time you don’t even have Big Data
Under Sampling services over Big Data
Les outils des Data Scientists ne sont pas taillés pour de la production
Les outils principalement utilisés ne scalent pas et produisent du code de mauvaise qualité (ipython, R Studio…) et se reposent sur des librairies dont la stabilité est variable
Les outils du marché présentés comme les incontournables sont lents à utiliser, et amènent souvent à des architectures éléphantesques. De plus leurs fonctionnalités en Data Science sont limitées (Spark Mllib par ex.)
D’un projet à l’autre les librairies pour l’exploration peuvent varier énormément ou devenir très spécifiques (Vowpal wabbit, Deep Learning, Code custom…)
La mise en production pose de nombreuses questions :
Les choix d’architecture limiteront ils le modèle d’un point de vue scientifique?
Doit-on simplifier les modèles au profit de l’opérabilité?
Quelles sont les performances techniques attendues? Sont-elles en adéquation avec les librairies ou les choix scientifiques réalisés lors des études?
Quelles différences entre l’étude hors ligne et un vrai modèle en production? étudier les effets d’hidden feedback loop, mettre en place de l’A/B testing
One topic per entity
Everything is avro
The integration effort is provided by the datalake team
Once its done, the format is known and its easy to get data
À gauche : en dev je dev des flux, en prod j’en dev d’autres à cause d’autres contraintes (ou parce que j’avais fait un import one shot)
À droite : quand je suis dev j’ai une interface normalisée, quand je passe en prod j’ai la même interface de lecture que lors de l’exploration
Why its good : both testing on historical and new data
To run in production is transparent whatever the requirements are, batch or streaming
You make data exploration with this in mind, and not on static data that will need development
Rituals of a team – The 10 minutes rule, morning standups, Brown Bag Lunches, monthly conferences
Machine learning is 90% about methodology : it is understandable from a non technical person.
How do you define the problem? What is your target? What is the question you're trying to answer? What is an example in your dataset?
How do you choose and generate features that are relevant to your target?
How do you cross-validate the results, what does the validation metric mean to the business?
How do you make profit out of the model in production? Is there any particular issue such as hidden feedback loops or presentation biais?
Mention TRAINING.
Data scientist and head of marketing
Coding CDO
Gamification / Engagement / Aligning the culture between different departments and profiles