Regrettably, datasets in the wild are much less clean than those in academia. At Merantix we apply Deep Learning to real world problems. In my talk at the Berlin.AI event on May 10, 2017 I shared 3 key learnings.
19. Problem: Datasets are expensive
Example 1 medical diagnostics: Cost for annotating 10’000 medical images
— 30min required per labelled image
— 100 EUR/hour
— 2 images/hour
— 50 EUR/image
EUR 500’000
Example 2 credit scoring: Cost of knowing if someone defaults
— To estimate default risk, labels of
defaulted people are required
— You can only get them if you let them
default
EUR 10’000/d
Assuming average default volume of EUR 10K
20. Pretraining is the solution!
Pretraining with cheap but large datasets on related domain1
Fine-tuning with well labeled data2
Performance
boost!!
21. How to get data for pretraining
IMDB
WIKI
25 36 14 51
66 34 54 18
Crawl dataPublic datasetsPretrained models
...
...
22. Weakly labeled data: Medical imaging
We don’t have labeled data so we get the labels from medical reports
We extract text
labels via NLP
and use them for
training
How do we do this?
1 Condition 2 Prognosis
Keine Pleuraerguss in der linken Lunge
Keine Erguss in der linken Lunge
Keine Pleuraergusses in der linken Lunge
Keine Randwinkelerguss in der rechte Lunge
Keine Erguß in der Lunge
Word embeddings
help to come up with
smart rules
If “Kein”/”Keine” → NO_EXISTENCE
If “Einige Beweise” → SMALLER_EXISTENCE
Else → DEFINITE_EXISTENCE
25. Academic datasets are balanced
Example 1: MNIST - equally many samples per digit Example 2: Food 101 - perfectly balanced
... ... ... ... ... ... ... ... ......
TrainingsetTestset
... ... ... ... ... ... ... ... ......
26. Real world datasets are not...
Credit scoring Medical Imaging
1-2% of people default Luckily, the majority of people are healthy
27. And: Making mistakes can be expensive
Credit scoring Medical Imaging
AcceptReject
Paid Defaulted
$
$$$$$
Diagnosed
Not
diagnosed
Healthy Sick
28. How to cope with this
Sick
Sick
Sick
Be careful
Training Inference
Rare class A
Rare class B
Frequent class
Rare class A & B
Frequent class
1. More data
2. Change labeling
29. How to cope with this
Easy:
Hard:
Oversampling Undersampling Negative mining
Hard:
Training batch Weighting of loss
3. Sampling
4. Weighting
32. Neural networks are black boxes
Lin. regression / decision trees:
Decision mechanism can be easily explained
Neural networks:
Complex systems are hard to understand!
In reality: 100m+ parameters….
33. This is problematic in the real world! Why?
King penguin Starfish Baseball Electric guitar
+E =
Panda
57.7% confidence
Gibbon
99.3% confidence
Can the neural network be fooled? Does it really work in production?
34. This is problematic in the real world! Why?
Why DIDN’T it work? What biases does it learn?
35. Our Picasso Visualizer in practice
Partial occlusion Saliency map
Soon to be open-sourced!
36. Join us on our journey
Science1 Datasets2 Business3
Research on the bleeding edge of
deep learning.
Get access to some of the best
datasets in the world.
Grow businesses in the space of
AI/deep learning