How can you thrive in a future where machine learning has been popular for a few years already?
In this talk, I will give you actionable advice from my experience training serious data scientists at our retreat center in Berlin. You are going to face these pointy, hard questions:
- What is the promise of machine learning? Has it happened yet?
- Is it easy to take advance of machine learning, now that most algorithms are nicely packaged in APIs and libraries?
- How much time should I spend getting good at machine learning? Am I good enough now?
- Are data scientists going to be replaced by algorithms? Are we all?
- Is it easy to hire talent in machine learning after the explosion of MOOCs?
3. The machine learning promise
People should be able to predict:
• Which employee will leave in the next 6 months
• Which electric generator is likely to die in the next 2 weeks
• Which sales lead has the highest potential to close in the next 3
months
• What each new website visitor is likely to buy based on past visitors
6. Smile detection
Example Graduate portfolio project from DSR
03. Smile detection on video streams. Works
reliably with multiple people on cam.
Applications: youtube funny video evaluation
7. Data analysis has become super easy.
But has it?
• Great libraries exist with every algorithm under the sun
8. The machine learning promise
(Anyone who can turn on a computer) should be able
to predict:
• Which employee will leave in the next 6 months
• Which electric generator is likely to die in the next 2 weeks
• Which sales lead has the highest potential to close in the next 3 months
• What each new website visitor is likely to buy based on past visitors
15. Two machine learners, two maps
Andreas Mueller, PhD
Andy is an Assistant Research Scientist
at the NYU Center for Data Science,
building a group to work on open
source software for data science.
Previously I was a Machine Learning
Scientist at Amazon, working on
computer vision and forecasting
problems. I am one of the core
developers of the scikit-learn machine
learning library, and have maintained
it for several years.
Authored the now famous model
picker image from scikit-learn
Trent McConaghy, PhD
Trent is co-founder & CTO of ascribe,
which uses modern crypto, ML, and
big data to tackle challenges in digital
property ownership. His two startups
applied ML in the enterprise semi-
conductor space: ADA was acquired in
2004 and Solido is going strong. His
interests include large scale
regression, automating creativity,
anything labeled "impossible", and
thousand-fold improvements. He was
raised on a pig farm in Canada.
16. Why data analysis is still hard, after
all the libraries and APIs
• It’s too easy to lie to yourself about it working
• It’s very hard to tell whether it could work if it doesn’t
• There is no free lunch
http://blog.mikiobraun.de/2014/02/data-analysis-hard-
parts.html
17. No free lunch theorem
• There is no universally optimal learning algorithm as
shown by the No Free Lunch Theorem: There is no
algorithm which is better than all the rest for all kinds
of data.
18. “Toolified”
• As more and more ML techniques become "toolified" the
problem is that the business doesn't understand that the
hard work is still ahead of them.
• Home Depot sells hammers and lumber, and while some
people have the skill and dedication to build their own
house, most folks are smart enough to hire someone that
knows what they're doing so the thing doesn't fall in and kill
their family.
• Blind faith in the power of tools is not helpful
19. 80 % data mangling 20 % building & testing
models
Is model building automatable?
How about the data Wrangling part? It’s actually a larger chunk
23. • Zoubin Ghahramani, Automatic statistician
• It's easy to shoot yourself in the foot with automated
tools — and convince yourself that the results are
meaningful when they're not
24. Alternative:
interfaces that draw
the most useful
information out of
people
Aka ‘The Luis von Ahn trick’.
Human computation: combine
human brainpower with computers
to solve problems that neither could
solve alone.
ReCAPTCHA: Computer-generated
tests that humans are routinely able
to pass but that computers have not
yet mastered.
26. Goal
• Become a full-stack problem solver
• AKA the unicorn data scientist
27. How to get there
• Focus on delivering business value
28. How to get there
Only after the business side is covered: focus on the tech
stack.
• Machine learning
• Big data/ engineering
• When to use ML at scale, when to sample and run on a single
machine
29. Constant learning
• The field changes faster than any other in technology
• If you are not willing to allocate ‘time outside work’ to
learn new things you will stagnate fast
30. Not being the equivalent to a code
monkey
• MOOC haven decreased the barrier of entry to machine-
learning.
• Nowadays, you cannot be ‘the guy who knows how to
run (insert off-the-shelf-algo-here)’. In dataland, that’s
the equivalent to being a code monkey. MOOCs and
superb libraries (scikit-learn, R’s ecosystem) made sure
there is plenty of people who can throw say a random
forest to a problem. In the modern world, this is not
adding that much value.
31. Picking problems to add the most
value
• Sometimes beating what the company is already doing
(often, nothing) offers a lot of value. Detecting fraud
poorly is better than not detecting fraud
32. Data Science will continue to be
democratized
• There’s no shortage of data
scientists.
• 1900: Number of cars on the
road would be limited by the
supply of trained chauffeurs.
33. Machine learning can very quickly get
you, say, 80% of the way to solving just
about any (real world) problem
You want to apply ML to contexts that are fault tolerant:
• Online ad targeting
• Ranking search results
• Recommendations
• Spam filtering
34. ML quickly hits a point of
diminishing returns
“The gain is not worth the pain."
36. Talent: invest in it
• The hunt for the 10x programmer continues (although
few companies succeed)
• In data science, the equivalent is the unicorn data
scientist
• Unicorn data scientist should generate more business
value than a 10x programmer
• Market agrees: supersalaries of >200k are common for
unicorn data scientists
37. Talent: beware of the fake data
scientist
• Each linkedin job ad for data scientist gets ~150
applications
• Often people who just rebranded themselves but have no
real experience
• Very common in guys bailing out of academia
• HR managers cannot tell the difference
• It’s a common mistake to hire one, and never be able to
produce business value
38. Talent: easier to find than you may
think
• Online courses have raised the bar
• Intensive bootcamps do work, as long as people have
built something at the end
• You will still get 150 fake data scientist for each decent
one
39. A future where ML has
been popular for years.
How does it look like?
40. Next 3 years
• ML APIs will enable people with less and less skill to run
quite sophisticated analyses
• Startups doing ML as a service will grow up, then
contract. ML will stop being a key competitive
advantage on most (not all) domains
• Blind faith in the power of tools will lead to wrong
decisions, which will lead to a backslash
41. Next 10 years
• Prediction: C-level people will be data scientists in the
future
• Product managers become a data scientist, or get
replaced by one
42. DS is a chaotic field and
people don’t really know
what they want (much less
what they need)
43. Interested in Data Science Retreat?
Apply to any of our two tracks
http://datascienceretreat.com/
46. References
• Paco Nathan. Data science in future tense
• Chris Dixon Machine learning is really good at partially
solving just about any problem
• Jao. The Past, Present, and Future of Machine Learning
APIs
Editor's Notes
It was almost a joke
Too much email asking the ‘When to do what’ question
IF YOU thought sci-kit learn was convenient
What is business value? If you have been in academia or away from a customer-facing role most of your career, you probably don’t have good intuitions abut this. Sure-fire way to learn is to start a business. Or take a customer-facing role. Even so it may take years to know your market
What is business value? If you have been in academia or away from a customer-facing role most of your career, you probably don’t have good intuitions abut this. Sure-fire way to learn is to start a business. Or take a customer-facing role. Even so it may take years to know your market
The discussion about the shortage of Data Scientists reminds me that in the early 1900s people thought that the number of cars on the road would be limited by the supply of trained chauffeurs. Then Henry Ford and others built cars that owners could drive themselves. New tools are going to be available that business owners can use themselves without need data scientists
you need to apply ML to contexts that are fault tolerant:
online ad targeting,
ranking search results,
Recommendations
spam filtering.