SlideShare a Scribd company logo
1 of 17
Big, Practical Recommendations
with Alternating Least Squares

Sean Owen • Apache Mahout / Myrrix.com
WHERE’S BIG LEARNING?
 Next: Application Layer
    Analytics
    Machine Learning
                                     Applications
 Like Apache Mahout
    Common Big Data app today       Processing
    Clustering, recommenders,
     classifiers on Hadoop            Database
    Free, open source; not mature

 Where’s commercialized               Storage
  Big Learning?
A RECOMMENDER SHOULD …
 Answer in Real-time                Accept Diverse Input
    Ingest new data, now               Not just people and products
    Modify recommendations based       Not just explicit ratings
      on newest data                    Clicks, views, buys
    No “cold start” for new data
                                        Side information
 Scale Horizontally                 Be “Pretty Accurate”
    For queries per second
    For size of data set
NEED: 2-TIER ARCHITECTURE
 Real-time Serving Layer
    Quick results based on
      precomputed model
    Incremental update
    Partitionable for scale

 Batch Computation Layer
    Builds model
    Scales out (on Hadoop?)
    Asynchronous, occasional,
      long-lived runs
A PRACTICAL ALGORITHM

MATRIX FACTORIZATION                BENEFITS
 Factor user-item matrix to         Models intuition
  user-feature + feature-item        Factorization is batch
  matrix                              parallelizable
 Well understood in ML, as:         Reconstruction (recs) in
    Principal Component Analysis     low-dimension is fast
    Latent Semantic Indexing
                                     Allows projection of new data
 Several algorithms, like:             Cold start solution
    Singular Value Decomposition       Approximate update solution
    Alternating Least Squares
A PRACTICAL IMPLEMENTATION
ALTERNATING LEAST
SQUARES                              BENEFITS
 Simple factorization P ≈ X YT       Parallelizable by row --
 Approximate: X, Y are                very Hadoop-friendly
  “skinny” (low-rank)                 Iterative: OK answer fast,
 Faster than the SVD                  refine as long as desired
    Trivially parallel, iterative    Yields to “binary” input model
 Dumber than the SVD                    Ratings as regularization
                                           instead
    No singular values,
                                         Sparseness / 0s no longer a
      orthonormal basis
                                           problem
ALS ALGORITHM 1
 Input: (user, item, strength)      1   4   3
  tuples
                                             3
     Anything you can quantify is
       input                             4       3   2
     Strength is positive           5       2       3
 Many tuples per user-item                      5
 R is sparse user-item              2   4               R
  interaction matrix
 rij = total strength of
  interaction between user i
  and item j
ALS ALGORITHM 2
 Follow “Collaborative                    1   1   1   0   0
  Filtering for Implicit
                                           0   0   1   0   0
  Feedback Datasets”
  www2.research.att.com/~yifanhu/PUB/cf.   0   1   0   1   1
  pdf
                                           1   0   1   0   1
 Construct “binary” matrix P
                                           0   0   0   1   0
    1 where R > 0
                                           1   1   0   0   0   P
    0 where R = 0

 Factor P, not R
    R returns in regularization

 Still sparse; implicit 0s fine
ALS ALGORITHM 3
 P is m x n
 Choose k << m, n
 Factor P as Q = X YT, Q ≈ P
    X is m x k ; YT is k x n               YT
 Find best approximation Q
    Minimize L2 norm of diff: || P-Q   X
      ||2
    Minimal squared error:
      “Least Squares”
 Recommendations are
  largest values in Q
ALS ALGORITHM 4
 Optimizing X, Y
  simultaneously is non-
  convex, hard
 If X or Y are fixed, system of
                                       YT
  linear equations:
  convex, easy
 Initialize Y with random         X
  values
 Solve for X
 Fix X, solve for Y
 Repeat (“Alternating”)
ALS ALGORITHM 5
 Define regularization weights cui = 1 + α rui
 Minimize:

  Σ cui(pui – xuTyi)2 + λ(Σ||xu||2 + Σ||yi||2)

 Simple least-squares regression objective, plus
    Weighted least-squared error terms by strength,
      a penalty for not reconstructing 1 at “strong” association is higher
    Standard L2 regularization term
ALS ALGORITHM 6
 With fixed Y, compute optimal X
 Each row xu is independent
 Define Cu as diagonal matrix of cu (user strength weights)
 xu = (YTCuY + λI)-1 YTCupu
 Compare to simple least-squares regression solution (YTY)-1 YTpu
    Adds Tikhonov / ridge regression regularization term λI
    Attaches cu weights to YT

 See paper for how YTCuY is computed efficiently;
  skipping the engineering!
EXAMPLE FACTORIZATION
 k = 3, λ = 2, α = 40, 10 iterations

                                        0.96   0.99   0.99    0.38    0.93
         1    1   1   0   0
                                        0.44   0.39   0.98    -0.11   0.39
         0    0   1   0   0

                               ≈
                                        0.70   0.99   0.42    0.98    0.98
         0    1   0   1   1
         1    0   1   0   1             1.00   1.04   0.99    0.44    0.98   Q = X•YT
                                        0.11   0.51   -0.13   1.00    0.57
         0    0   0   1   0
                                        0.97   1.00   0.68    0.47    0.91
         1    1   0   0   0
FOLD-IN
 Need immediate, if               Note (YTY)(YTY)-1 = I
  approximate, updates for         Gives YT’s right inverse:
  new data                          YT (Y(YTY)-1) = I
 New user u needs new row         Xu = Qu Y(YTY)-1
  Qu = Xu YT
                                   Xu ≈ Pu Y(YTY)-1
 We have Pu ≈ Qu
                                   Recommend as usual:
 Compute Xu via right inverse:     Qu = XuYT
  X YT(YT)-1 = Q(YT)-1 so:
                                   For existing user, instead
  X = Q(YT)-1
                                    add to existing row Xu
 What is   (YT)-1?
THIS IS MYRRIX
 Soft-launched
 Serving Layer available
  as open source download
 Computation Layer available
  as beta
 Ready on Amazon EC2 / EMR
                                srowen@myrrix.com
 Full launch Q4 2012
 myrrix.com
APPENDIX
EXAMPLES

STACKOVERFLOW TAGS               WIKIPEDIA LINKS
 Recommend tags to               Recommend new linked
  questions                        articles from existing links
 Tag questions automatically,    Propose missing, related
  improve tag coverage             links
 3.5M questions x 30K tags       2.5M articles x 1.8M articles
 4.3 hours x 5 machines on       28 hours x 2 PCs on
  Amazon EMR                       Apache Hadoop 1.0.3
 $3.03 ≈ $0.08 per 100,000
  recs

More Related Content

What's hot

Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at ScaleFabian Reinartz
 
Zabbix Performance Tuning
Zabbix Performance TuningZabbix Performance Tuning
Zabbix Performance TuningRicardo Santos
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress ControllerJulien Pivotto
 
Proxmox Clustering with CEPH
Proxmox Clustering with CEPHProxmox Clustering with CEPH
Proxmox Clustering with CEPHFahadIbrar5
 
Music Recommendation Tutorial
Music Recommendation TutorialMusic Recommendation Tutorial
Music Recommendation TutorialOscar Celma
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Achieving Continuous Availability for Your Applications with Oracle MAA
Achieving Continuous Availability for Your Applications with Oracle MAAAchieving Continuous Availability for Your Applications with Oracle MAA
Achieving Continuous Availability for Your Applications with Oracle MAAMarkus Michalewicz
 
Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisitedXavier Amatriain
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialTim Vaillancourt
 
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...Demetris Trihinas
 
Oracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingOracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingTanel Poder
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San JoseThe Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San JoseNikolay Samokhvalov
 
PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HAharoonm
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Informationrecsysfr
 
C10k and beyond - Uri Shamay, Akamai
C10k and beyond - Uri Shamay, AkamaiC10k and beyond - Uri Shamay, Akamai
C10k and beyond - Uri Shamay, AkamaiCodemotion Tel Aviv
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyVidhya Murali
 

What's hot (20)

Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at Scale
 
Zabbix Performance Tuning
Zabbix Performance TuningZabbix Performance Tuning
Zabbix Performance Tuning
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress Controller
 
Proxmox Clustering with CEPH
Proxmox Clustering with CEPHProxmox Clustering with CEPH
Proxmox Clustering with CEPH
 
Music Recommendation Tutorial
Music Recommendation TutorialMusic Recommendation Tutorial
Music Recommendation Tutorial
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Achieving Continuous Availability for Your Applications with Oracle MAA
Achieving Continuous Availability for Your Applications with Oracle MAAAchieving Continuous Availability for Your Applications with Oracle MAA
Achieving Continuous Availability for Your Applications with Oracle MAA
 
Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisited
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_Tutorial
 
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...
 
Oracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingOracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention Troubleshooting
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San JoseThe Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
 
PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HA
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
 
C10k and beyond - Uri Shamay, Akamai
C10k and beyond - Uri Shamay, AkamaiC10k and beyond - Uri Shamay, Akamai
C10k and beyond - Uri Shamay, Akamai
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 

Similar to Big Practical Recommendations with Alternating Least Squares

Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep LearningSourya Dey
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackarogozhnikov
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsDmitriy Selivanov
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2arogozhnikov
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1arogozhnikov
 
Machine Learning - Regression model
Machine Learning - Regression modelMachine Learning - Regression model
Machine Learning - Regression modelRADO7900
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Relaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete MarketsRelaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete Marketsguasoni
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14Sri Ambati
 
2014.10.dartmouth
2014.10.dartmouth2014.10.dartmouth
2014.10.dartmouthQiqi Wang
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdfanandsimple
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationIlya Loshchilov
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Introduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesIntroduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesArithmer Inc.
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4arogozhnikov
 

Similar to Big Practical Recommendations with Alternating Least Squares (20)

Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
Machine Learning - Regression model
Machine Learning - Regression modelMachine Learning - Regression model
Machine Learning - Regression model
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Relaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete MarketsRelaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete Markets
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14
 
2014.10.dartmouth
2014.10.dartmouth2014.10.dartmouth
2014.10.dartmouth
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
opt_slides_ump.pdf
opt_slides_ump.pdfopt_slides_ump.pdf
opt_slides_ump.pdf
 
Introduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesIntroduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave Machines
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Partial Derivatives.pdf
Partial Derivatives.pdfPartial Derivatives.pdf
Partial Derivatives.pdf
 
Lec1 01
Lec1 01Lec1 01
Lec1 01
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 

More from Data Science London

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingData Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayData Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignData Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Data Science London
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryData Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersData Science London
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxData Science London
 

More from Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 

Recently uploaded

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Big Practical Recommendations with Alternating Least Squares

  • 1. Big, Practical Recommendations with Alternating Least Squares Sean Owen • Apache Mahout / Myrrix.com
  • 2. WHERE’S BIG LEARNING?  Next: Application Layer  Analytics  Machine Learning Applications  Like Apache Mahout  Common Big Data app today Processing  Clustering, recommenders, classifiers on Hadoop Database  Free, open source; not mature  Where’s commercialized Storage Big Learning?
  • 3. A RECOMMENDER SHOULD …  Answer in Real-time  Accept Diverse Input  Ingest new data, now  Not just people and products  Modify recommendations based  Not just explicit ratings on newest data  Clicks, views, buys  No “cold start” for new data  Side information  Scale Horizontally  Be “Pretty Accurate”  For queries per second  For size of data set
  • 4. NEED: 2-TIER ARCHITECTURE  Real-time Serving Layer  Quick results based on precomputed model  Incremental update  Partitionable for scale  Batch Computation Layer  Builds model  Scales out (on Hadoop?)  Asynchronous, occasional, long-lived runs
  • 5. A PRACTICAL ALGORITHM MATRIX FACTORIZATION BENEFITS  Factor user-item matrix to  Models intuition user-feature + feature-item  Factorization is batch matrix parallelizable  Well understood in ML, as:  Reconstruction (recs) in  Principal Component Analysis low-dimension is fast  Latent Semantic Indexing  Allows projection of new data  Several algorithms, like:  Cold start solution  Singular Value Decomposition  Approximate update solution  Alternating Least Squares
  • 6. A PRACTICAL IMPLEMENTATION ALTERNATING LEAST SQUARES BENEFITS  Simple factorization P ≈ X YT  Parallelizable by row --  Approximate: X, Y are very Hadoop-friendly “skinny” (low-rank)  Iterative: OK answer fast,  Faster than the SVD refine as long as desired  Trivially parallel, iterative  Yields to “binary” input model  Dumber than the SVD  Ratings as regularization instead  No singular values,  Sparseness / 0s no longer a orthonormal basis problem
  • 7. ALS ALGORITHM 1  Input: (user, item, strength) 1 4 3 tuples 3  Anything you can quantify is input 4 3 2  Strength is positive 5 2 3  Many tuples per user-item 5  R is sparse user-item 2 4 R interaction matrix  rij = total strength of interaction between user i and item j
  • 8. ALS ALGORITHM 2  Follow “Collaborative 1 1 1 0 0 Filtering for Implicit 0 0 1 0 0 Feedback Datasets” www2.research.att.com/~yifanhu/PUB/cf. 0 1 0 1 1 pdf 1 0 1 0 1  Construct “binary” matrix P 0 0 0 1 0  1 where R > 0 1 1 0 0 0 P  0 where R = 0  Factor P, not R  R returns in regularization  Still sparse; implicit 0s fine
  • 9. ALS ALGORITHM 3  P is m x n  Choose k << m, n  Factor P as Q = X YT, Q ≈ P  X is m x k ; YT is k x n YT  Find best approximation Q  Minimize L2 norm of diff: || P-Q X ||2  Minimal squared error: “Least Squares”  Recommendations are largest values in Q
  • 10. ALS ALGORITHM 4  Optimizing X, Y simultaneously is non- convex, hard  If X or Y are fixed, system of YT linear equations: convex, easy  Initialize Y with random X values  Solve for X  Fix X, solve for Y  Repeat (“Alternating”)
  • 11. ALS ALGORITHM 5  Define regularization weights cui = 1 + α rui  Minimize: Σ cui(pui – xuTyi)2 + λ(Σ||xu||2 + Σ||yi||2)  Simple least-squares regression objective, plus  Weighted least-squared error terms by strength, a penalty for not reconstructing 1 at “strong” association is higher  Standard L2 regularization term
  • 12. ALS ALGORITHM 6  With fixed Y, compute optimal X  Each row xu is independent  Define Cu as diagonal matrix of cu (user strength weights)  xu = (YTCuY + λI)-1 YTCupu  Compare to simple least-squares regression solution (YTY)-1 YTpu  Adds Tikhonov / ridge regression regularization term λI  Attaches cu weights to YT  See paper for how YTCuY is computed efficiently; skipping the engineering!
  • 13. EXAMPLE FACTORIZATION  k = 3, λ = 2, α = 40, 10 iterations 0.96 0.99 0.99 0.38 0.93 1 1 1 0 0 0.44 0.39 0.98 -0.11 0.39 0 0 1 0 0 ≈ 0.70 0.99 0.42 0.98 0.98 0 1 0 1 1 1 0 1 0 1 1.00 1.04 0.99 0.44 0.98 Q = X•YT 0.11 0.51 -0.13 1.00 0.57 0 0 0 1 0 0.97 1.00 0.68 0.47 0.91 1 1 0 0 0
  • 14. FOLD-IN  Need immediate, if  Note (YTY)(YTY)-1 = I approximate, updates for  Gives YT’s right inverse: new data YT (Y(YTY)-1) = I  New user u needs new row  Xu = Qu Y(YTY)-1 Qu = Xu YT  Xu ≈ Pu Y(YTY)-1  We have Pu ≈ Qu  Recommend as usual:  Compute Xu via right inverse: Qu = XuYT X YT(YT)-1 = Q(YT)-1 so:  For existing user, instead X = Q(YT)-1 add to existing row Xu  What is (YT)-1?
  • 15. THIS IS MYRRIX  Soft-launched  Serving Layer available as open source download  Computation Layer available as beta  Ready on Amazon EC2 / EMR srowen@myrrix.com  Full launch Q4 2012  myrrix.com
  • 17. EXAMPLES STACKOVERFLOW TAGS WIKIPEDIA LINKS  Recommend tags to  Recommend new linked questions articles from existing links  Tag questions automatically,  Propose missing, related improve tag coverage links  3.5M questions x 30K tags  2.5M articles x 1.8M articles  4.3 hours x 5 machines on  28 hours x 2 PCs on Amazon EMR Apache Hadoop 1.0.3  $3.03 ≈ $0.08 per 100,000 recs