Practical Guide to Computer Vision

Practical Computer Vision
A problem-driven approach to learning CV/ML/DL
Albert Y. C. Chen, Ph.D.
Vice President, R&D
Viscovery

Albert Y. C. Chen, Ph.D.
• Experience
2017-present: Vice President of R&D @ Viscovery
2016-2017: Chief Scientist @ Viscovery
2015: Principal Scientist @ Nervve Technologies
2013-2014 Computer Vision Scientist @ Tandent
2011-2012 @ GE Global Research
• Education
Ph.D. in Computer Science, SUNY-Buffalo
M.S. in Computer Science, NTNU
B.S. in Computer Science, NTHU

1. W.Wu,A.Y. C. Chen, L. Zhao, and J. J. Corso. Brain tumor detection and segmentation in a CRF framework with pixel-wise
afﬁnity and superpixel-level features. International Journal of Computer Assisted Radiology and Surgery, 2015.
2. S. N. Lim,A.Y. C. Chen and X.Yang. Parameter Inference Engine (PIE) on the Pareto Front. In Proceedings of International
Conference of Machine Learning,Auto ML Workshop, 2014.
3. A.Y. C. Chen, S.Whitt, C. Xu, and J. J. Corso. Hierarchical supervoxel fusion for robust pixel label propagation in videos. In
Submission to ACM Multimedia, 2013.
4. A.Y.C. Chen and J.J. Corso.Temporally consistent multi-class video-object segmentation with the video graph-shifts
algorithm. In Proceedings of IEEE Workshop on Applications of ComputerVision, 2011.
5. D.R. Schlegel,A.Y.C. Chen, C. Xiong, J.A. Delmerico, and J.J. Corso. Airtouch: Interacting with computer systems at a
distance. In Proceedings of IEEE Workshop on Applications of ComputerVision, 2011.
6. A.Y.C. Chen and J.J. Corso. On the effects of normalization in adaptive MRF Hierarchies. In Proceedings of International
Symposium CompIMAGE, 2010.
7. A.Y.C. Chen and J.J. Corso. Propagating multi-class pixel labels throughout video frames. In Proceedings of IEEE Western
NewYork Image Processing Workshop, 2010.
8. A.Y. C. Chen and J. J. Corso. On the effects of normalization in adaptive MRF Hierarchies. Computational Modeling of
Objects Represented in Images, pages 275–286, 2010.
9. Y.Tao, L. Lu, M. Dewan,A.Y. C. Chen, J. J. Corso, J. Xuan, M. Salganicoff, and A. Krishnan. Multi-level ground glass nodule
detection and segmentation in ct lung images. Medical Image Computing and Computer-Assisted Intervention, 2009.
10. A.Y.C. Chen, J.J. Corso, and L.Wang. Hops: Efﬁcient region labeling using higher order proxy neighborhoods. In
Proceedings of IEEE International Conference on Pattern Recognition, 2008.

Some work done before I
caught the startup fever
Freestyle Sketching Stage
AirTouch waits in background
for the initialization signal
Initialize
Terminate
Output
image
database
Start:
Results
CBIR
query
Airtouch HCI interface for Content-based Image Retrieval

Interactive Segmentation & Classification
• Segmentation then classification:
• computationally more efficient,
• results in much higher classification accuracy.
• Pioneered the “pixel label propagation” field.
• First to utilize superpixels and supervoxels for the task.
FG
Traditional Spatial
Propagation
Pixel label map
Label a subset of pixels
BG
Spatio-temporal Propagation
time

Image/Video Object Recognition
and Content Understanding
approaches
person carries
gives
recieves
Ontology
object
Person 1
Person 1Person 2
High-Level
Mid-Level
approach
activity
receives gives
carries
activity
activity activity
Time
Reasoning
x
x
x
Low-Level
x x
x
x

Learning and Adapting Optimal
Classiﬁer Parameters
subspace B
subspace
A
subspace
C
Image-level feature space
priors
Patch-level feature space
posterior
probability
suggest optimal
parameter conﬁguration

Graphical Models and
Stochastic Optimization
A
(a) The space-time volume of a
video showing the objects
(A--F) and their appearing
time-span.
space
time
A
B
C
D
E
F
B E
F
C
D
(b) The temporal relationship
graph. An edge between
two vertices mean that the
two objects overlap in time.
(c) The goal is: cover all objects
with the smallest number of
"ground truth key frames".
space
time
A
B
C
D
E
F
key 1 key 2
A
B E
F
C
D
(d) This translates to: iteratively
solving the max clique
problem until all vertices
belong to a clique.
A
B E
F
C
D
key 2
key 1
frame t-1 frame t
layer n layer n
layer n+1 layer n+1
Temporal
Shift
Shift
µ

Medical Imaging and
Geospatial Imaging
GNN detection and
segmentation
in Lung CT geospatial imaging:
building detection
Brain tumor detection and
segmentation in MR images.

Why are we here today?
To make a better change for our future.

Change is the only constant
-Heraclitus (535 BC - 475 BC)

Why Risk Innovating?
• Good business model NEVER last forever.
• Average “shelf life” on S&P 500: 20 years.
• 100-year old companies constantly reinvent
themselves every 10-20 years
• Startups contribute to 20% of USA’s GDP.

The Death of a Good
Business Model
• Foxconn 20 year revenue v.s. net proﬁt (now at 5%)

What do 100 year old
corporations do?
GE Schenectady, 1896

History of change at GE
• 1886: one of the 12 original companies on the Dow
Jone Industrial Average (also the only one remaining).
• 1889: lightbulbs
• 1919: radios
• 1927: TV
• 1941: jet engine
• 1960: nuclear power
• 1971: room AC units
• 1995: MRI

History of change at IBM
• 1960s: mainframe computer
• 1980s: personal computer
• 2000s: integrated solutions
• 2020s: AI, Watson

How about the leading
Semiconductor companies?

NVidia reinventing itself
—2 times in 20 years

“Bad money drives out good”
in the desktop GPU market

The rise of mobile computing,
and how NVidia missed the boat!

NVidia’s Tegra mobile
processors never took off
then, the market
saturated…

NVidia not just survived.
NVidia is thriving!

Meet the new NVidia: Deep Learning,
Deep Learning, and still, Deep Learning

The king is dead,
long live the king!

Now, again, do we want to
do OEM/ODM forever?
Optimizing an old business model
is just delaying its eventual death.

Computer Vision, it can’t be
that hard, right?
hmm… grayscale color can’t work alone…
maybe color works better?

that hard, right?
White and Gold
or
Blue and Black?
The Dress
2015/02/26

that hard, right?

Even if we can auto-correct all
lighting and color temperature
[w w w w]
[w r r w]
[w r r w]
[w w w w]
and force all apples
to be encoded as:
we’d still have all these “afﬁne transformation” issues:

Even if lighting, color, afﬁne
transformation are not an issue
• Our 3D world can’t simply be represented by
ﬁxed 2D encoding:

Brief History
Marvin Minsky
“In 1966, Minsky hired a ﬁrst-year undergraduate
student and assigned him a problem to solve over the
summer: connect a television camera to a computer
and get the machine to describe what it sees.”
Gerald Sussman
The student never worked on
Computer Vision problems again.

Brief History
• 1960’s: interpretation of synthetic worlds
• 1970’s: some progress on interpreting selected images
• 1980’s: ANNs come and go; shift toward geometry and increased
mathematical rigor
• 1990’s: face recognition; statistical analysis in vogue
• 2000’s: broader recognition; large annotated datasets available; video
processing starts
Guzman ‘68 Ohta Kanade ‘78 Turk and Pentland ‘91

What was in our arsenal?
• Image ﬁlters
• Feature descriptors
• Classiﬁers

Features: a compact and
(hopefully) invariant representation

Features: Laplacian of Gaussian
(LoG; scale detection)

Features: Orientation
How to compute the rotation?
Create edge orientation
histogram and find peak.

Classiﬁer Training in
Machine Learning
Classiﬁcation Clustering
Regression
Dimension
Reduction
supervised unsupervised
continuousdiscrete

Classiﬁers: Deformable Parts
Model (DPM)

Meta-Learning
• Different use
cases calls for
different ML
algorithms.
• Meta-Learning:
learning how to
learn.
• Requires plenty of
domain-speciﬁc
know-how.

Neural Network (NN)
Why didn’t it work; why now?
• MNIST digit data 28x28
• LeCunn’s 3 layer NN:
1170 variables.
• Require tens of
thousands of samples.
• Only learn simple line/
curve combinations

AI Winter (1970-1980, 1990-2000)
• Early NN problems:
• redundant structure,
• slow learning speed
• need too much data
• bad learning
stability.

What’s in a NN
( )zσ+
( )zσ+
( )zσ+
( )zσ+
Input
weights
bias
activation
function

NN breakthroughs since 1970’s
1. Better Network Structure
• Convolutional Neural Network greatly reduces the number
of variables in NN’s designed for images and videos. —>
Improved convergence speed, reduced data requirements.
Upper-left corner
Bird Beak
Detector
Center Bird Beak
Detector
Almost identical, can be shared across regions

1. Network Structure

2. Improved Activation Functions
Large
Small
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM

3. Effective Backpropagation
w1
w2
Clipping
[Razvan Pascanu, ICML’13]

4. Efﬁcient Training Methods
• Mini-batch
• Adaptive
Learning Rate
• Dropout, Batch-
normalization
minibatchminibatch
1 epoch

Deep Neural Networks (DNN)
way more complex and capable!

What do DNNs learn?
• Neurons act like “custom-trained ﬁlters”; react to
very different visual cues, depending on data.

• Does not “memorize” millions of viewed images.
• Extracts greatly reduced number of features that
are vital to classify different classes of data.
• Classifying data becomes a simple task when
the features measured are “”good”.
What do DNNs learn?

Mature/Maturing Computer
Vision Applications

• Final inspection cells
• Robot guidance and
checking orientation of
components
• Packaging Inspection
• Medical vial inspection
• Food pack checks
• Verifying engineered
components[5]
• Wafer Dicing
• Reading of Serial
Numbers
• Inspection of Saw
Blades
• Inspection of Ball Grid
Arrays (BGAs)
• Surface Inspection
• Measuring of Spark
Plugs
• Molding Flash Detection
• Inspection of Punched
Sheets
• 3D Plane
Reconstruction with
Stereo
• Pose Veriﬁcation of
Resistors
• Classiﬁcation of Non-
Woven Fabrics
1970s-now: Machine Vision
for Industrial Inspection
• Automated Train
Examiner (ATEx)
Systems
• Automatic PCB
inspection
• Wood quality
inspection
• Final inspection of
sub-assemblies
• Engine part inspection
• Label inspection on
products
• Checking medical
devices for defects

Industrial Inspection: turbofan
jet engine blade maintenance
• Some seemingly daunting
machine vision tasks actually
works with relatively simple
image processing algorithms.

Industrial Inspection: Cognex Omniview

License Plate Recognition
(1979-now)

License Plate Readers with Text
Detection and Neural Networks

Automated Fingerprint
Identiﬁcation (1970s-now)

Face Recognition
(1990s-now)
• Face Detection (Viola and Jones, 2001)
• Face Veriﬁcation (1:1) v.s. Identiﬁcation (1:N)

Face Veriﬁcation and Identiﬁcation, 
Labeled Faces in the Wild (LFW)
Recognition
Accuracy:
• 1 to 1: 99%+
• 1 to 100: 90%
• 1 to 10,000:
50%-70%.
• 1 to 1M: 30%.
LFW dataset, common FN↑, FP↓

Sports—NFL ﬁrst down line
(1995-now)

Sports—NFL ﬁrst down line
minus
equals

3D Reconstruction
(As old as CV; became
practical since SIFT)

3D Reconstruction with Feature
Matching, Structure from Motion

Solving Panorama Problem
with Markov Random Fields
Input:

ICM (Iterated Conditional Modes), 1986

Belief Propagation (1980-2000)

Graph-Cuts (alpha expansion), 2001

Solving Photosynthesis Problems
with Alpha-matting (2000s-now)

Object Detection & Classification
state-of-the-art
• ImageNet Large Scale Visual
Recognition Challenge (ILSVRC)
• 1000+ classes, 1.2M images.
0
0.125
0.25
0.375
0.5
11 12 13 14 11 12 13 14
classification
error
classification
+localization error

Image Scene Classiﬁcation
• MIT Places 401
dataset.
• top-5 accuracy
rates >80%.

2005 winner, Stanley (Stanford),
3mph through desert

2007 winner, Boss (CMU),
13mpg through the city

Self Driving Cadillac, US
congressman to airport, 2013

How did we come this far?
Race car drivers know the trick

Focus on Free Space /
Drivable Area, not Obstacles!

Up-and-coming
Computer Vision
Applications

Object Recognition
Blue River Technology

Retail Insights
Source:
Prism Skylabs

Other Applications in
Business Intelligence
• Measure brand exposure.
• Measure sponsorship effectiveness.
• Loss prevention and retail layout optimization.

Exciting applications
many of you might be
attempting to SOLVE!!!

Problem Solving Workflow
Classical Workflow:
1. Data collection
2. Feature Extraction
3. Dimension Reduction
4. Classifier (re)Design
5. Classifier Verification
6. Deploy
Modern Brute-force workflow
1. Data collection
2. Throw everything into a Deep Neural Network
3. Mommy, why doesn’t it work ???

Classical Problem #1:
Curse of Dimensionality
ze
sit
앉다
sentarse
• Number of Variables vs Number of Samples
Q. Who would make such naive mistakes?
A. Many “newbies” repeatedly do so.

Example 1-1:
illegal parking detection
legal parking samples x100 illegal parking samples x100
Let’s train a 150-layer Res-Net!!!
What could possibly go wrong?

Example 1-1:
illegal parking detection
• Data: try cleaner data
• Feature: fine-tune with pre-trained model; don’t
train from scratch
• Classifier overfitting: beware of statistical
coincidences,

Example 1-2: Smart Photo
Album with Google Cloud Vision

Example 1-2: Smart Photo
Album with Google Cloud Vision
No effective distance measure for thousands,
if not millions of dimensions (tags); would be
approximately zero most of the time.

Classical Problem #2:
Overfitting Data
• Make sure your deep learning algorithm is
learning better features for data, not overfitting
the data with complex classifiers.

Deep Learning Cookbook
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
New activation function
Adaptive Learning Rate
Early Stopping
Regularization
Dropout
(credit: Prof. H.Y. Lee, NTU)

Example: AOI breakthroughs with
Deep Learning—Metal Inspection
D Weimer et al. 2017

Deep Learning—Textile Inspection
X
Funding Li et al. / IEEE Tran Automation Science and Engineer 2017 (to appear)

Deep Learning—Laser Welding
Johannes Günther et al. / Procedia Technology 15 (2014) 474 – 483

Example: AOI breakthroughs with Deep
Learning—Serial Number Processing
S. N. Lim et al. / GE Global Research

Example: AOI breakthroughs with Deep
Learning—Serial Number Processing

Deep Learning—Corrosion Detection

Example: Dermatologist-level Skin Cancer
Diagnosis with DNN+Smartphones
• 5.4M cancer cases, 58M pre-cancer cases
diagnosed every year in the US.
(Andre Esteva, Sebastian Thrun, 2017)

Example: Dermatologist-level Skin Cancer
Diagnosis with DNN+Smartphones

Example: Hippocampus
Segmentation in 7T MR Images
(Dinggang Shen, 2017)

(Dinggang Shen, 2017)
Example: Hippocampus
Segmentation in 7T MR Images

Example: Histopathological
Image Classiﬁcation w. DNN
Microscopic view of Breast malignant tumor
40x 100x
200x 400x
(FA Spanhol,
IJCNN 2016)

Example: Histopathological
Image Classiﬁcation w. DNN

Example: DNN for Plant
Disease Detection
(S Mohanty, 2016)

Example: DNN for Plant
Disease Detection

Thank You!
albert@viscovery.com

Appendix 1: Startups
• A company, partnership, or temporary
organization designed to search for a new,
repeatable and scalable business model.

Your Idea
• Are you passionate about it?
• Is it disruptive enough?
• What is your business plan?
• What is it?
• Can it make money?
• What is the future of the idea?
• What is your competitive advantage?
• How do you build up your entry barrier?

A minimal startup team
• A hacker
• A hustler
• A hipster

Prototype
• Hack out a prototype
• Spend 2-10 weeks max.
• Investors are much more likely to fund you if
you have a minimal initial version of your idea.
• Hackathons are a good place to start.
• Iteratively improve the prototype

Buildup your entry barrier!
• Market (users)
• Speed
• Team
• Technology

Building entry barrier with Technology!!

Appendix 2: My humble attempts
at putting the latest Computer
Vision algorithms to work

Intrinsic Imaging at Tandent
Vision Science
Computer Vision would be half-solved without shadows!
LightOriginal Image Surface

Tandent Lightbrush
Video Tutorial for Tandent Lightbrush: https://vimeo.com/47009123

Issues
• Highly anticipated, highly acclaimed, but small
crowd at $500 a license.
• Adobe Photoshop monopoly and the “not
invented here” syndrome.
• Adobe’s arch-rival, Corel (Corel Draw, Paint
Shop Pro, Ulead PhotoImpact) was DYING and
asked too much from the botched deal.

Have fun scribbling out your
shadows in photoshop!
Poor Bob from Adobe wasted 9 minutes removing just 1 shadow

Intrinsic Imaging for improving the
RGB signal in autonomous driving

Intrinsic Imaging’s other
applications

Retrospect
• 20 researchers burned 25 million in 8 years;
investors got 50 patents in return, period.
• Overestimated the total addressable market
size, in a market with existing monopoly.
• Many missed opportunities. Counterexample of
the lean startup model.

Satellite/Aerial Imagery Analysis
• 40cm resolution at 30fps for 90 sec for any location on earth.
• One LEO satellite revisits any place on Earth every 3 days.
• Need 24 satellites to revisit any place on Earth every 3 hours.

Challenges for Single satellite depth
estimation and 3D reconstruction
• At 30fps, a LEO satellite
travels 250m between two
consecutive frames —>
theoretically sufﬁcient for
cm-level depth estimation.
• Sources of Noise:
• Camera distortions
• Atmospheric Disturbance
• Ground vegetation
• Sub-pixel sampling noise
1
2

What happened?
• B2B customers takes too long to strike deals.
• Google ate us alive in just 3 months, while we
were still pitching for VC-funding with our
prototype.

Retrospect
• Growth pains expanding from intelligence
community clients to advertisement clients.
• Forming the right team of engineers and
researchers and moving at the right pace.
• For any Computer Vision/Machine Learning
company:
• Researchers that cannot program—> OUT
• Engineers that don’t know math —> OUT

Once in a lifetime opportunity in
China’s video streaming market

What do we need?
Face Motion
Image
scene Text Audio Object
Semantics

Viscovery VDS (Video Discovery Service)

Challenges Encountered
Along the Way
• From Product Recognition in Images, to Face,
Logo, Object, Scene recognition in Videos.
• Number of Categories
• Recognition Accuracy
• Recognition Speed
• System Architecture
• Business Model

Viscovery’s Edge
• Market: ﬁrst mover’s advantage in China’s video
streaming market.
• Speed: we built the whole VDS thing in a few months!
• Team: You! Seriously!
• Technology:
• Depth
• Breadth
• Cloud
• Customizability
• Self-Learning

Life is not all rosy at startups
• High Risk, High Pressure, High Uncertainty!
• Resources are scarce, but you MUST DELIVER!
• Forming your all-star team is not that easy…
• Focus, and persistence.

Appendix 3: What can Taiwan’s
academia do to help bridge the gap?
HMM….

Academia
IndustryGeneral Public
reputation and
policy support
improved
living standards
students
opportunity
well-trained
graduates
grants and
collaborations
A healthy cycle

Academia
IndustryGeneral Public
unsupportive
policies
stagnant wages
useless
education
unemployable
graduates
A vicious cycle
no grants
no students

Where should we start?
Maybe with a few more stories.

The Goldilocks zone of innovation

The Goldilocks zone of innovation
Business
Relevance
Academic
Relevance
plentiful resources; hierarchical organization
lack of resources; responsive organization
traditional corporations
talking “innovation”
corporate research
startups struggling to survive
academic spinoffs
MSR

Practical Guide to Computer Vision

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Practical Guide to Computer Vision

Similar to Practical Guide to Computer Vision (20)

More from Albert Y. C. Chen

More from Albert Y. C. Chen (14)

Recently uploaded

Recently uploaded (20)

Practical Guide to Computer Vision