SlideShare a Scribd company logo
1 of 48
Download to read offline
Better
Predictions!

H2O – The Open Source Math Engine !
H2O –
Open Source
in-memory
Machine Learning
for Big Data
4/23/13
Universe is sparse. Life is messy. 

Data is sparse & messy.!
- Lao Tzu
Hadoop = opportunity
Not enough Data Scientists
Analysts won’t code java
Group	
  By	
  
Grep	
  
Messy	
  
NAs	
  

Classifica-on	
  

Regression	
  

Clustering	
  

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
   Ensembles
100’s	
  	
  	
   nanos	
   	
  
models
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  

H 2O

Big Data

the
Adhoc	
  
Explora-on	
  

Math	
  
Modeling	
  

Real-­‐-me	
  
Scoring	
  

Prediction
Engine
No New API!
Big	
  Data	
  
Explora-on	
  
Modeling	
  
Scoring	
  
Real-­‐-me	
  
	
  

H 2O
the
Prediction
Engine

Approximate!
results each step!
Intellectual	
  
Legacy	
  
	
  
Math	
  needs	
  	
  
to	
  be	
  free	
  
	
  
Open	
  Source	
  
	
  

Support and Innovation

hFps://github.com/0xdata/h2o	
  

H 2O
the
Prediction
Engine
All Top 10ʼs are binary!
- Anonymous
10	
  	
  	
  Move Code not Data	
  
Data chunks > code chunks
TCP for Data. UDP for Control.
>> Generated Java Assist
A Chunk, Unit of Parallel Access
A Frame: Vec[]
age	
  

sex	
  

zip	
  

ID	
  

car	
  

JVM
1
Heap
JVM
2
Heap
JVM
3
Heap
JVM
4
Heap

Vecs aligned
in heaps
l Optimized for
concurrent access
l Random access
any row, any JVM
l 
9	
  	
  	
  Chunk-ing Express!	
  
season for Variable-sized chunks
and a season Uniform chunks.
Tightly-packed!
(chunk is also unit of batch!)
8	
  	
  	
  Reduce early. Reduce Often!	
  
No Expensive intermediate states.
Fine-grain parallelism wins!
>> Fork / Join
8	
  	
  	
  Reduce early. Reduce Often!	
  
Vec	
   Vec	
   Vec	
   Vec	
   Vec	
  

All CPUs grab
Chunks in parallel
Map/Reduce & F/J
handles all sync

JVM
1
Hea
p
JVM
2
Hea
p
JVM
3
Hea
p
JVM
4
Hea
p
7	
  	
  	
  Slow is not different from Dead	
  
Debugging slow
>> Heartbeats, Messages
Two General’s Paradox
6	
  	
  	
  Memory Manager	
  
in-memory system as good as
your memory manager!
lazy eviction.
compress.
align.
Corollary: Track down Leaks!
5	
  	
  	
  Memory Overheads	
  
Use primitives
// A Distributed Vector
//
much more than 2billion elements
class Vec {
long length(); // more than an int's worth
// fast random access
double at(long idx); // Get the idx'th elem
boolean isNA(long idx);

}

void set(long idx, double d); // writable
void append(double d); // variable sized
4	
  	
  	
  Cache-­‐Oblivious	
  
Tree size
Bin size

Recursively divide
Till Data à Cache
3	
  	
  	
  EC2 – Nothing is bounded	
  
User-mode reliability
S3 Readers will TCP Reset
Mux your connections
Not all toolkits are equal.
>> JetS3
2 No Locks, No Cry

	
  

Non-Blocking Data Structures.

// VOLATILE READ before key compare.
// CAS
private final boolean CAS_kvs( final Object[]
oldkvs, final Object[] newkvs ) {
return _unsafe.compareAndSwapObject(this,
_kvs_offset, oldkvs, newkvs );
}
1 endian wars ended!
Keep-It-Simple-Serialization.

	
  

byte[ ]. roll-your-own. fast.
public AutoBuffer putA1
( byte[] ary, int sofar, int length )
{
while( sofar < length ) {
int len = Math.min(length - sofar, _bb.remaining());
_bb.put(ary, sofar, len);
sofar += len;
if( sofar < length ) sendPartial();
}
return this;

}
Data Movement is a Defect.
Slowing down helps communication.
Got Speed?	
  
0	
  	
  	
  Math always produces a number	
  
Accuracy rules over speed.
Predictive Performance
1	
  	
  	
  Shuffle	
  
Data presentation bias.
Sorted data => interesting results
2	
  	
  	
  Random acts of Kindness?	
  
3	
  	
  	
  Convex Problems: ADMM	
  
4  Amdahl strikes:
Cholesky / QR Decomposition	
  
Matrix operations
jama, jblas.. all single node.
Distributed version
needs data transfer!
5	
  	
  Random	
  Forests	
  
embarrassingly parallel
binning
tree-building
splits
6	
  	
  Boos-ng	
  
iterate & stage
weak-learners =>
strong learners
each tree can be parallel
minimize communication
7	
  	
  Neural	
  Nets	
  &	
  Clustering	
  
embarrassingly parallel
pre-calculate base stats
distance calculation
weight matrices – small footprint
8	
  	
  Ensembles	
  
Daisy chain a bunch of models
Interleave.
JIT – Minimize loops over data.
9	
  	
  	
  Tools	
  
Deterministic versions first!
Got Pen & Paper?
Optimize often.
Test Big Data soon.
Replace NAs to improves

predictive performance by about 10pc.





!
- Newton
Munging Missing Features

impute NAs with mean

impute NAs with knn

impute with recursive pca!
- Boyd
Unbalanced data

single rare classes

Fraud / No-Fraud!
Stratify
Unbalanced data

multiple rare classes

Browse, Click, Purchase!
Stratify
10	
  	
  	
  Data

is the System	
  

Use Customer Data
Algorithms for Sparse vs. Dense
Unbalanced Data.
Robustness under noise
Before H2O

Velocity:	
  Events	
  

Online	
  Scoring	
  

Volume:	
  HDFS	
  

Rule	
  Engine	
  

Munging
slice n dice
Features

HIVE/SQL

Applications

Explora-on	
  

Data Scientist

	
  	
  	
  	
  Modeling	
  

Offline	
  Scoring	
  
Engineer

Business Analyst

Ensemble models
Low latency

Classification
Regression
Clustering
Optimal Model
Predictions
Big	
  Data	
  
Explora-on	
  
Modeling	
  
Scoring	
  
Real-­‐-me	
  
	
  

Big Data beats Better Algorithms!
Big	
  Data	
  
Explora-on	
  
Modeling	
  
Scoring	
  
Real-­‐-me	
  
	
  

Big Data and Better Algorithms!
Scale & Parallelism!
Intellectual	
  
Legacy	
  
	
  
Math	
  needs	
  	
  
to	
  be	
  free	
  
	
  
Open	
  Source	
  
	
  

Support and Innovation

hFps://github.com/0xdata/h2o	
  

H 2O
the
Prediction
Engine
Better
Predictions!

H2O – The Open Source Math Engine !
Distributed Coding Taxonomy

l 

No Distribution Coding:
l 
l 

l 

Whole Algorithms, Whole Vector-Math!
REST + JSON: e.g. load data, GLM, get results!

Simple Data-Parallel Coding:
l 
l 

l 

Per-Row (or neighbor row) Math!
Map/Reduce-style: e.g. Any dense linear algebra!

Complex Data-Parallel Coding
l 

K/V Store, Graph Algo's, e.g. PageRank!

0xdata.c45	
  
Distributed Coding Taxonomy

l 

No Distribution Coding:
l 

l 

Whole Algorithms, Whole Vector-Math!

l 

REST + JSON: e.g. load data, GLM, get results!

Simple Data-Parallel Coding:
l 

Per-Row (or neighbor row) Math!

l 

l 

Read	
  the	
  docs!	
  

This	
  talk!	
  

Map/Reduce-style: e.g. Any dense linear algebra!

Complex Data-Parallel Coding
l 

K/V Store, Graph Algo's, e.g. PageRank!

Join	
  our	
  GIT!	
  

46	
  
Distributed Data Taxonomy

Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 elems
elem – a java double
Row i – i'th elements of all the Vecs in a Frame

0xdata.c47	
  
Usecases

Conversion, Retention & Churn!
•  Lead Conversion!
•  Engagement!
•  Product Placement!
•  Recommendations!
Pricing Engine!
Fraud Detection!

More Related Content

What's hot

Performing network security analytics
Performing network security analyticsPerforming network security analytics
Performing network security analytics
DataWorks Summit
 

What's hot (20)

Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
GlobalLogic Webinar: Massive aggregations with Spark and Hadoop
GlobalLogic Webinar: Massive aggregations with Spark and HadoopGlobalLogic Webinar: Massive aggregations with Spark and Hadoop
GlobalLogic Webinar: Massive aggregations with Spark and Hadoop
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Stream all the things
Stream all the thingsStream all the things
Stream all the things
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Performing network security analytics
Performing network security analyticsPerforming network security analytics
Performing network security analytics
 

Viewers also liked

Machine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math RefresherMachine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math Refresher
butest
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Alex Pinto
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
Chris Fregly
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Chris Fregly
 

Viewers also liked (20)

The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly Problem
 
Machine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine LearningMachine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine Learning
 
[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊
 
02 math essentials
02 math essentials02 math essentials
02 math essentials
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
 
Machine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math RefresherMachine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math Refresher
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
 
高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用
 
Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
 
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
 
NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANs
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 

Similar to qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Srisatish

Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 
Malware vs Big Data
Malware vs Big DataMalware vs Big Data
Malware vs Big Data
Frank Denis
 

Similar to qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Srisatish (20)

0xdata H2O Podcast
0xdata H2O Podcast0xdata H2O Podcast
0xdata H2O Podcast
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Whynosql
WhynosqlWhynosql
Whynosql
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Malware vs Big Data
Malware vs Big DataMalware vs Big Data
Malware vs Big Data
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
BigData primer
BigData primerBigData primer
BigData primer
 

More from Sri Ambati

More from Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Srisatish

  • 1. Better Predictions! H2O – The Open Source Math Engine !
  • 2. H2O – Open Source in-memory Machine Learning for Big Data 4/23/13
  • 3. Universe is sparse. Life is messy. 
 Data is sparse & messy.! - Lao Tzu
  • 4. Hadoop = opportunity Not enough Data Scientists Analysts won’t code java
  • 5. Group  By   Grep   Messy   NAs   Classifica-on   Regression   Clustering                           Ensembles 100’s       nanos     models                           H 2O Big Data the Adhoc   Explora-on   Math   Modeling   Real-­‐-me   Scoring   Prediction Engine
  • 6. No New API! Big  Data   Explora-on   Modeling   Scoring   Real-­‐-me     H 2O the Prediction Engine Approximate! results each step!
  • 7. Intellectual   Legacy     Math  needs     to  be  free     Open  Source     Support and Innovation hFps://github.com/0xdata/h2o   H 2O the Prediction Engine
  • 8. All Top 10ʼs are binary! - Anonymous
  • 9. 10      Move Code not Data   Data chunks > code chunks TCP for Data. UDP for Control. >> Generated Java Assist
  • 10. A Chunk, Unit of Parallel Access A Frame: Vec[] age   sex   zip   ID   car   JVM 1 Heap JVM 2 Heap JVM 3 Heap JVM 4 Heap Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM l 
  • 11. 9      Chunk-ing Express!   season for Variable-sized chunks and a season Uniform chunks. Tightly-packed! (chunk is also unit of batch!)
  • 12. 8      Reduce early. Reduce Often!   No Expensive intermediate states. Fine-grain parallelism wins! >> Fork / Join
  • 13. 8      Reduce early. Reduce Often!   Vec   Vec   Vec   Vec   Vec   All CPUs grab Chunks in parallel Map/Reduce & F/J handles all sync JVM 1 Hea p JVM 2 Hea p JVM 3 Hea p JVM 4 Hea p
  • 14. 7      Slow is not different from Dead   Debugging slow >> Heartbeats, Messages Two General’s Paradox
  • 15. 6      Memory Manager   in-memory system as good as your memory manager! lazy eviction. compress. align. Corollary: Track down Leaks!
  • 16. 5      Memory Overheads   Use primitives // A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); } void set(long idx, double d); // writable void append(double d); // variable sized
  • 17. 4      Cache-­‐Oblivious   Tree size Bin size Recursively divide Till Data à Cache
  • 18. 3      EC2 – Nothing is bounded   User-mode reliability S3 Readers will TCP Reset Mux your connections Not all toolkits are equal. >> JetS3
  • 19. 2 No Locks, No Cry   Non-Blocking Data Structures. // VOLATILE READ before key compare. // CAS private final boolean CAS_kvs( final Object[] oldkvs, final Object[] newkvs ) { return _unsafe.compareAndSwapObject(this, _kvs_offset, oldkvs, newkvs ); }
  • 20.
  • 21. 1 endian wars ended! Keep-It-Simple-Serialization.   byte[ ]. roll-your-own. fast. public AutoBuffer putA1 ( byte[] ary, int sofar, int length ) { while( sofar < length ) { int len = Math.min(length - sofar, _bb.remaining()); _bb.put(ary, sofar, len); sofar += len; if( sofar < length ) sendPartial(); } return this; }
  • 22. Data Movement is a Defect. Slowing down helps communication. Got Speed?  
  • 23. 0      Math always produces a number   Accuracy rules over speed. Predictive Performance
  • 24. 1      Shuffle   Data presentation bias. Sorted data => interesting results
  • 25. 2      Random acts of Kindness?  
  • 26.
  • 27. 3      Convex Problems: ADMM  
  • 28. 4  Amdahl strikes: Cholesky / QR Decomposition   Matrix operations jama, jblas.. all single node. Distributed version needs data transfer!
  • 29. 5    Random  Forests   embarrassingly parallel binning tree-building splits
  • 30. 6    Boos-ng   iterate & stage weak-learners => strong learners each tree can be parallel minimize communication
  • 31. 7    Neural  Nets  &  Clustering   embarrassingly parallel pre-calculate base stats distance calculation weight matrices – small footprint
  • 32. 8    Ensembles   Daisy chain a bunch of models Interleave. JIT – Minimize loops over data.
  • 33. 9      Tools   Deterministic versions first! Got Pen & Paper? Optimize often. Test Big Data soon.
  • 34. Replace NAs to improves
 predictive performance by about 10pc.
 
 
 ! - Newton
  • 35. Munging Missing Features
 impute NAs with mean
 impute NAs with knn
 impute with recursive pca! - Boyd
  • 36. Unbalanced data
 single rare classes
 Fraud / No-Fraud! Stratify
  • 37. Unbalanced data
 multiple rare classes
 Browse, Click, Purchase! Stratify
  • 38. 10      Data is the System   Use Customer Data Algorithms for Sparse vs. Dense Unbalanced Data. Robustness under noise
  • 39. Before H2O Velocity:  Events   Online  Scoring   Volume:  HDFS   Rule  Engine   Munging slice n dice Features HIVE/SQL Applications Explora-on   Data Scientist        Modeling   Offline  Scoring   Engineer Business Analyst Ensemble models Low latency Classification Regression Clustering Optimal Model Predictions
  • 40. Big  Data   Explora-on   Modeling   Scoring   Real-­‐-me     Big Data beats Better Algorithms!
  • 41. Big  Data   Explora-on   Modeling   Scoring   Real-­‐-me     Big Data and Better Algorithms! Scale & Parallelism!
  • 42. Intellectual   Legacy     Math  needs     to  be  free     Open  Source     Support and Innovation hFps://github.com/0xdata/h2o   H 2O the Prediction Engine
  • 43.
  • 44. Better Predictions! H2O – The Open Source Math Engine !
  • 45. Distributed Coding Taxonomy l  No Distribution Coding: l  l  l  Whole Algorithms, Whole Vector-Math! REST + JSON: e.g. load data, GLM, get results! Simple Data-Parallel Coding: l  l  l  Per-Row (or neighbor row) Math! Map/Reduce-style: e.g. Any dense linear algebra! Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank! 0xdata.c45  
  • 46. Distributed Coding Taxonomy l  No Distribution Coding: l  l  Whole Algorithms, Whole Vector-Math! l  REST + JSON: e.g. load data, GLM, get results! Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math! l  l  Read  the  docs!   This  talk!   Map/Reduce-style: e.g. Any dense linear algebra! Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank! Join  our  GIT!   46  
  • 47. Distributed Data Taxonomy Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame 0xdata.c47  
  • 48. Usecases Conversion, Retention & Churn! •  Lead Conversion! •  Engagement! •  Product Placement! •  Recommendations! Pricing Engine! Fraud Detection!