This document discusses productionizing deep learning from the ground up. It begins with an overview of deep learning and neural networks, explaining that deep learning performs pattern recognition on unlabeled and unstructured data using deep neural networks with three or more layers. It then discusses challenges like the computational intensity of deep learning models and the need for special hardware like GPUs. It also covers software engineering concerns in scaling deep learning to production, such as data pipelines, maintenance of GPU clusters, and different types of parallelism in deep learning models and data.
3. Overview
● What is Deep Learning?
● Why is it hard?
● Problems to think about
● Conclusions
4. What is Deep Learning?
Pattern
recognition on
unlabeled &
unstructured
data.
5. What is Deep Learning?
● Deep Neural Networks >= 3 Layers
● For media/unstructured data
● Automatic Feature Engineering
● Benefits From Complex Architectures
● Computationally Intensive
● Accelerates With Special Hardware
7. Deep Networks >= 3 Layers
● Backpropagation and Old School ANNs = 3
8. Deep Networks
● Neural Networks themselves as hidden
Layers
● Different Types of Layers can be
Interchanged/stacked
● Multiple Layer Types, each with own
Hyperparameters and Loss Functions
14. Other kinds
● Memory Networks
● Deep Reinforcement Learning
● Adversarial Architectures
● New recursive ConvNet variant to come in
2016?
● Over 9,000 Layers? (22 is already pretty
common)
18. Benefits from Complex Architectures
Google’s result combined:
● LSTMs (learning captions)
● Word Embeddings
● Convolutional features from images (aligned
to be same size as embeddings)
19. Computationally Intensive
● One iteration of ImageNet (1k label dataset
and over 1MM examples) takes 7 hours on
GPUs
● Project Adam
● Google Brain
21. Software Engineering Concerns
● Pipelines to deal with messy data,
not canned problems...
(Real life is not Kaggle, people.)
● Scale/Maintenance (Clusters of GPUs aren’t
done well today.)
● Different kinds of parallelism (model and
data)
22. Model vs Data Parallelism
● Model is sharding model across servers
(HPC style)
● Data is mini batch
23. Vectorizing unstructured data
● Data is stored in different databases
● Different kinds of files (raw)
● Deep Learning works well on mixed signal
25. Production Stacks today
● Hadoop/Spark not enough
● GPUs not friendly to average programmer
● Cluster management of GPUs as a resource
not typically done
● Many frameworks don’t work well in a
distributed env (getting better, though)
26. Problems With Neural Nets
● Loss functions
● Scaling data
● Mixing different neural nets
● Hyperparameter tuning
28. Scaling Data
● Zero mean and unit variance
● Zero to 1
● Other forms of preprocessing relative to
distribution of data
● Processing can also be columnwise
(categorical?)
29. Mixing and Matching Neural Networks
● Video: ConvNet + Recurrent
● Convolutional RBMs?
● Convolutional -> Subsampling -> Fully
Connected
● DBNs: Different hidden and visible units for
each layer
31. Hyperparameter Tuning (2)
● Grid search for neural nets (Don’t do it!)
● Bayesian (Getting better. There are at least
priors here.)
● Gradient-based approaches (Your hyper-
parameters are a neural net, so there are
neural nets optimizing your neural nets...)