Real time machine learning proposers day v3

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited"
Real Time Machine Learning (RTML)
Andreas Olofsson
Program Manager
DARPA/MTO
Proposers Day
April 2, 2019

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 2
Background

Objective:
Exploit the physics of emerging devices, analog CMOS, and
non-Boolean computational models to achieve new levels of
performance and power for real-time sensor imaging systems.
Approach:
TA1: Image Application for Benchmarking: Recreate a
traditional image processing pipeline (IPP) using UPSIDE
Compute models showing no degradation in performance.
TA2: MS CMOS Demonstration: Mixed signal CMOS
implementation of the computational model and system test
bed showing 1x105x combined speed-power improvement for
analog CMOS.
TA3: Emerging Device Implementation: Image processing
demonstration combining next-generation devices with new
computation model. 1x107x (projected)
Goal: Demonstrate the capability and pathway toward embedded computing efficiency in ISR
applications w/ >1,000x processing speed and >10,000x improvement in power consumption
Mapped into
emerging devices
and analog CMOS
Detected Salient Pixels
Extracted Library
RAC
DPU
7 Nodes
TC
L1L1 L1L1L2 L2L3
0.4mm
0.9mm
Analog Vector
Matrix Multiply
Analog, Floating Gate
Pattern MatchOscillators
Memristors
Graphene
Emerging Devices Analog CMOS
Benchmarked using object classification and tracking applications
Low-precision probabilistic computing algorithms
Ex. Edges3x3 pixels
Image Pixels
DARPA UPSIDE program (2012-2018)
Unconventional Processing of Signals for Intelligent Data Exploitation

4
Selected UPSIDE results
16μm
18 μm
Functional Array
Dummy Array
Dummy Array
DummyArray
DummyArray
(b)
O1
Floating-Gate Array
c
IE IE
I1
I3
I5
I7
I9
I2
I4
I6
I8
I10
O2 O3 O4 O5 O6 O7 O8 O9 O10
(a)
2 Layer MLP Neural Network
InputNeurons
Output Neurons
Incoming
Image
• Mixed signal processing (50TOPS/W)
• Sparse image reconstruction in memristors
• Numerous publications (Nature, …)
• First memristor based multilayer perceptron
• Flash based 55nm analog computing (>10TOPS/W)
• Numerous publications (Nature, …)
University of Michigan UCSB
Key Takeaways
Analog computing beats digital on VMMs
Challenges:
• Comparing results (lack of data)  RTML
• Transition valley of death  RTML
• High cost of design  RTML
• Manufacturing latency too long RTML
• Manufacturability and scalability  RTML

5
Building a proper baseline for path-finding AI HW research
ISSCC2019
• Extreme expense of HW development
means extremely sparse data
• How can we know if a new result is
good without a baseline?
• A compiler would let us “paint the
space” of possibilities
• Objective: Better science

6
Generating “right-sized” HW for SWaP constrained systems
A. Canziani, et al, “Analysis of deep neural networks”
• 10-100X network tradeoffs
• Additional micro-tradeoffs (bit-
width, pruning, etc.)
• Having more accuracy than needed
wastes energy, latency, and power
• A compiler would enable generation
of right-sized HW
• Objective: Enable new applications

7
Optimizing hardware for ultra-low latency
• Current HW optimized for throughput
and programmability
• Extreme expense of HW development
means latency of ASICs is unexplored.
• Green-field: How low can we go?
• Objective: Enable new applications
Source: NVIDIA

8
Building bridges
Application Experts ML Experts Platform Experts
RTML
(New!)
Objective: Faster innovation
Source: NVIDIA, Getty Images, Wikipedia
TensorFlow PyTorch

9
Example of a low-latency application
Source: Qualcomm, 2017Source: Qualcomm

DARPA RTML Program

DARPA RTML program details
The DARPA RTML program seeks to create no-human-in-the-loop
hardware generators and compilers to enable fully automated
creation of ML Application-Specific Integrated Circuits
(ASICs) from high level source code.

Phase 1: machine learning hardware compiler
• Develop hardware generator that converts programs expressed in common ML frameworks (such as
TensorFlow, PyTorch) and generate standard Verilog code and hardware configurations
• Generate synthesizable Verilog that can be fed into layout generation tools, such as from DARPA IDEA
• Demonstrate a compiler that auto-generates a large catalog of scalable ML hardware instances
• Demonstrate generation of instances for a diversity of architectures

The RTML generator should support a diversity of ML architectures. Architectures of interest include:
a) conventional feed forward (convolutional) neural networks,
b) recurrent networks and their specialized versions,
c) neuroscience-inspired architectures, such as spike time-dependent neural nets including their stochastic
counterparts,
d) non-neural ML architectures inspired by psychophysics as well as statistical techniques,
e) classical supervised learning (e.g., regression and decision trees),
f) unsupervised learning (e.g., clustering) approaches,
g) semi-supervised learning methods,
h) generative adversarial learning techniques, and
i) other approaches such as transfer learning, reinforcement learning, manifold learning, and/or life-long
learning
RTML general purpose generator

Phase 1 RTML generator metrics
Metrics
Type Training and Inference
Peak Performance
Scalable configurable at generation with support
up to full reticle size at 14nm
Inference Energy Efficiency1 >10 TOPS/W
Min Number of Architectures2 10
Hardware Generation Automation 100% (ML to Verilog)
I/O Interface
Highly efficient chip-to-chip interface
(such as from the DARPA CHIPS program)
Design Input (source code)
High level network description. Support for
TensorFlow, PyTorch, Caffe2, CNTK, MXNet, ONNX
Generator (Compiler Front-end) Output Verilog
Deliverable
Software, license3, generator source code, flow scripts, documentation,
GDSII for generated designs
1Program is interested in real work accomplished per Watt, not arbitrary peak mathematical ops/W. As a general guidance we are specifying a 10 TOPS/W at 14nm
as a minimum threshold with the understanding that efficiency numbers are tightly coupled to accuracy, data sets, and actual applications. Efficiency metric includes all
SoC power including IO power needed to sustain peak throughput. Based upon normalized MAC for the proposed application.
2To demonstrate a general purpose ML compiler, teams are expected to complete GDSII implementation of multiple ML architectures
3Delivered with a minimum of government purpose rights; open source licenses are preferred.

An introduction to the IDEA silicon compiler (RTL/schematic to
GDSII)
Data
IDEA
Unified Layout Generator
Package BoardChip
Models
Training
24 hours, No Human In the Loop
2018
• Program Kickoff (Jun)
2019
• First Integration Exercise (Jan)
2019
• Alpha code drop (Jun)
2020
• A usable Silicon Compiler
• 50% PPA
2022
• A great Silicon Compiler
• 100% PPA

An introduction to the CHIPS interface
• AIB (Advanced Interface Bus) is a PHY-level interface standard
for high bandwidth, low power die-to-die communication
• AIB is a clock-forwarded parallel data transfer like DDR DRAM
• High density with 2.5D interposer (e.g., CoWoS, EMIB) for
multi-chip packaging
• AIB is PHY level (OSI Layer 1)
• Can build protocols like AXI-4 on top of AIB
• AIB Performance:
• 1Tbps/mm shoreline
• ~0.1pJ/bit
• <5ns latency
• Open Source!
• Standard and reference implementation
• https://github.com/intel/aib-phy-hardware
AIB Adopters:
-Boeing
-Intrinsix
-Synopsys
-Intel
-Lockheed Martin
-Sandia
-Jariet
-NCSU
-U. of Michigan
-Ayar Labs
ADC/DAC
Machine Learning
Memory
Processors
Adjacent IP
Etc. …
Your
Chiplet
AIB
Our Chiplet
AIB
AIB
AIB
CHIPS Platform
Stratix 10
FPGA die
14nm
A
I
B
A
I
B
A
I
B
A
I
B
A
I
B
A
I
B
Ethernet Tile
56G PAM/28G NRZ
Your Chip
Here
Your Chip
Here
Opt1 Opt2 Opt4 Opt5

• Design space exploration through circuit implementation of multiple ML architectures
• General purpose, tunable generator that can support optimization of ML hardware for specific
requirements
• Hardware demonstration of RTML for a particular application area
• Application areas:
• Future high bandwidth wireless communication systems, like the 60 GHz range of the 5G standard
• High bandwidth image processing in SWaP constrained systems
• DARPA will provide fabrication support through a number of separately funded multi-project or
dedicated wafer runs
Phase 2: real time machine learning systems

Phase 2 RTML metrics
Phase 2 Hardware Guidelines Min1 Max1
Data Throughput 400 Kbps 400 Gbps
Latency 100 µs 100 s
Total Power2 200 µW 200 W
Application-Specific Accuracy3 0.6 0.99
Dataset Proposer defined4
I/O Interface Highly efficient chip to chip interface (such as CHIPS)
Design Input (source code)
High level network description. Support for
TensorFlow, PyTorch, Caffe2, CNTK, and MXNet, ONNX
Design Output GDSII ready for manufacturing
Hardware Generation Automation 100%
Deliverables
Software, license5, Design Source code, flow scripts, documentation,
GDSII, chiplet hardware
1Teams are expected to explore a wide trade space of power, latency, accuracy, and data throughput and show the ability to tune hardware over a large range of performance metrics. Max
values are not expected to be achieved simultaneously.
2Power must include everything needed to operate, including power delivery, thermal management, external memory, and sensor interfaces.
3For example, ResNet152 has an accuracy of > 0.96 on the ImageNet database:
http://image-net.org/challenges/LSVRC/2015/results
4Proposals are expected to outline a clear plan for validating the quality of the compiler output, including details of the publicly available benchmarks and datasets from industry, government, and
academia that will be used
5Delivered with a minimum of government purpose rights; open source licenses are preferred

RTML schedule
• 0 months (Fall 2019): Kickoff workshop
• 9 months (Mid 2020): Alpha release of RTML generator at joint NSF/DARPA workshop
• 18 months (Spring 2021): Release of V1.0 RTML generator and demonstration with a RTML compiler flow
• 27 months (End 2021): Release of V1.5 tunable hardware generator
• 36 months (Fall 2022): Hardware demonstration of a real time machine for specific application

RTML seeks answers for the following research questions
• Can we build an application specific silicon compiler for RTML?
• What subset of current ML frameworks syntax/methods can be supported with a compiler?
• What needs to be added to current ML frameworks to support efficient translation?
• What hardware architectures are best suited for real time operation?
• What are the lower latency limits for various RTML tasks?
• What is the lowest SWaP feasible for various RTML tasks?
• What are the tradeoffs between energy efficiency, throughput, latency, area, accuracy?

• Investigatory research that does not result in deliverable hardware designs
• Circuits that cannot be produced in standards CMOS foundries (like 14nm)
• New Domain Specific Languages
• New approaches to physical layout (RTL to GDSII)
• Incremental efforts
RTML does NOT seek proposals for these areas

22
Joint NSF collaboration
• NSF: Single phase, exploratory research into circuit architectures and algorithms
• DARPA:
• Phase 1: Fully automated hardware generators “compilers” for state of the art machine learning algorithms
and networks, using existing programming frameworks (TensorFlow, etc.) as inputs
• Phase 2: Deliver novel machine learning architectures and circuit generators that enable real time machine
learning for autonomous machines
• Joint solicitation release and workshops at 9 and 18 mos into each phase
• DARPA teams pull in NSF work during Phase 1 to Phase 2 transition
DARPA Phase 1 (18 mos) DARPA Phase 2 (18 mos)
NSF Phase 1 (36 mos)
Alpha
Release
V1.0 Release
and GDSII
Delivery
V1.5 Release
& Tapeout
Silicon
Demo
NSF and DARPA team to explore
rapid development of energy efficient
hardware and real-time machine
learning architectures

Required:
• Collaboration with other program performers
• Active participation in joint DARPA-NSF workshops every 9 months
• Open interfaces
Strongly encouraged:
• Publishing code and results early and often
• Permissive (non-viral, non-proprietary) open source licensing
Collaboration and licensing

Funding of DARPA RTML Phase 2
• RTML includes a base Phase 1 and option Phase 2
• The proposed planning and costing by Phase (and by Task) provides DARPA with convenient times to
evaluate funding options and technical progress
• Progression into Phase 2 is not guaranteed; factors that may affect Phase 2 funding decisions
• Availability of funding
• Cost of proposals selected for funding
• Demonstrated performance relative to program goals
• Interaction with government evaluation teams
• Compatibility with potential national security needs

Important dates
• BAA Posting Date: March 15, 2019
• Proposers Day: April 2, 2019
• FAQ Submission Deadline: April 15, 2019 at 1:00 PM
o DARPA will post a consolidated Question and Answer (FAQ) document on a regular basis. To
access the posting go to: http://www.darpa.mil/work-with-us/opportunities.
• Proposal Due Date: May 1, 2019 at 1:00 PM
• Estimated period of performance start: October 2019
• Questions: HR001119S0037@darpa.mil

1. Overall Scientific and Technical Merit
o Demonstrate that the proposed technical approach is innovative, feasible, achievable, and complete
o A clear and feasible plan for release of high quality software is provided
o Task descriptions and associated technical elements provided are complete and in a logical sequence with all proposed research
clearly defined such that a final outcome that achieves the goal
2. Potential Contribution and Relevance to the DARPA Mission
o Note the updated wording, with an emphasis on contribution to U.S. national security and U.S. technological capabilities
3. Impact on Machine Learning Landscape
o The proposed research will successfully complete a fundamental exploration of the tradeoffs between system efficiency and
performance for a number of ML architectures
o The proposed research significantly advanced the state of the art in machine learning hardware
4. Cost Realism
o Ensure proposed costs are realistic for the technical and management approach and accurately reflect the goals and objectives of the
solicitation
o Verify that proposed costs are sufficiently detailed, complete, and consistent with the Statement of Work
Evaluation criteria, in order of importance

Agenda
RTML Proposers Day
DARPA - 675 N Randolph Street, Arlington, VA 22203
Tuesday, April 2, 2019
Start End Time Speaker
8:00 AM 9:00 AM 60 Registration and Poster Setup
9:00 AM 9:15 AM 15 Welcome - Security, Logistics Ron Baxter
9:15 AM 9:55 AM 40 DARPA RTML Program Overview Andreas Olofsson
9:55 AM 10:15 AM 20 NSF RTML Collaboration Overview Sankar Basu
10:15 AM 10:45 AM 30 Contracting Overview Michael Blackstone
10:45 AM 11:00 AM 15 Break
11:00 AM 11:45 AM 45 Question and Answer Session Andreas Olofsson
11:45 AM 1:00 PM 75 Lunch (On Your Own)
1:00 PM 3:00 PM 120 Poster and Networking Session All
3:00 PM 3:00 PM 0 Conclude

www.darpa.mil

Real time machine learning proposers day v3

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real time machine learning proposers day v3

Similar to Real time machine learning proposers day v3 (20)

More from mustafa sarac

More from mustafa sarac (20)

Recently uploaded

Recently uploaded (20)

Real time machine learning proposers day v3