SlideShare a Scribd company logo
1 of 18
6    42   8

78   14   98

1    7    8

               Simple Matrix Factorization for
               Recommendation
               Sean Owen • Apache Mahout
Apache Mahout
•   Scalable machine learning
•   (Mostly) Hadoop-based
•   Clustering, classification and
    recommender engines


•   Nearest-neighbor
     •   User-based                  mahout.apache.org
     •   Item-based
     •   Slope-one
     •   Clustering-based

•   Latent factor
     •   SVD-based
     •   ALS
     •   More!
Matrix = Associations
                               Things are associated
        Rose   Navy   Olive
                                Like people to colors

Alice    0      +4     0       Associations have strengths
                                Like preferences and dislikes
Bob      0      0      +2
                               Can quantify associations
                                Alice loves navy = +4,
Carol    -1     0      -2       Carol dislikes olive = -2

Dave    +3      0      0       We don’t know all
                                associations
                                Many implicit zeroes
From One Matrix, Two
 Like numbers, matrices can               n
  be factored

 m•n matrix = m•k times k•n

 Associations can
                                   m       P
                                                   =
  decompose into others
                                       k               n
 Alice likes navy =

                                           •
  Alice loves blues, and
                                               k   Y’
  blues includes navy          m       X
In Terms of Few Features
 Can explain associations by appealing to underlying
  intermediate features (e.g. “blue-ness”)

 Relatively few (one “blue-ness”, but many shades)


                              (Blue)
       (Alice)




                                                      (Navy)
Losing Information is Helpful
 When k (= features) is small, information is lost

 Factorization is approximate
  (Alice appears to like blue-ish periwinkle too)


                                 (Blue)
        (Alice)

                                                      (Periwinkle)

                                                      (Navy)
How to Compute?
     n            k           n


                      •   k   Y’

           =
m    P      m     X
Skip the Singular Value
    Decomposition for now …
        n        k                n


                     •   Σ   •k   T’

             =
m       A    m   S
Alternating Least Squares
 Collaborative Filtering for Implicit Feedback Datasets
  www2.research.att.com/~yifanhu/PUB/cf.pdf
 R = matrix of user-item interactions “strengths”
 P = R reduced to 0 and 1
 Factor as approximate P ≈ X•Y’
   Start with random Y
   Compute X such that X•Y’ best approximates P
    (Frobenius / L2 norm)            (Least Squares)
   Repeat for Y         (Alternating)
   Iterate, Iterate, Iterate

 Large values in X•Y’ are good recommendations
Example


    1   4   3           1   1   1   0   0
            3           0   0   1   0   0
        4       3   2   0   1   0   1   1
R                                           P
    5       2       3   1   0   1   0   1
                5       0   0   0   1   0
    2   4               1   1   0   0   0
k = 3, λ=2, α=40
            1 iteration


1   1   1    0   0       2.18   -0.01   0.35        0.43    0.48    0.48    0.16    0.10



0   0   1    0   0       1.83   -0.11   -0.68       -0.27   0.39    -0.13   0.03    0.05




                     ≈
0   1   0    1   1       0.79   1.15    -1.80       -0.03   -0.09   -0.13   -0.47   -0.47



1   0   1    0   1       0.97   -1.90   -2.12
                                                                                      Y’
0   0   0    1   0       1.01   -0.25   -1.77



1   1   0    0   0       2.33   -8.00   1.06
                                                X
k = 3, λ=2, α=40
            1 iteration


1   1   1    0   0
                         0.94   1.00    1.00   0.18    0.07



0   0   1    0   0       0.84   0.89    0.99   0.60    0.50




                     ≈
0   1   0    1   1       0.07   0.99    0.46   1.01    0.98

                                                               X•Y’
1   0   1    0   1       1.00   -0.09   1.00   1.08    0.99



0   0   0    1   0       0.55   0.54    0.75   0.98    0.92



1   1   0    0   0       1.01   0.99    0.98   -0.13   -0.25
k = 3, λ=2, α=40
            10 iterations


1   1   1    0   0
                         0.96   0.99   0.99    0.38    0.93



0   0   1    0   0       0.44   0.39   0.98    -0.11   0.39




                     ≈
0   1   0    1   1       0.70   0.99   0.42    0.98    0.98

                                                              X•Y’
1   0   1    0   1       1.00   1.04   0.99    0.44    0.98



0   0   0    1   0       0.11   0.51   -0.13   1.00    0.57



1   1   0    0   0       0.97   1.00   0.68    0.47    0.91
Interesting Because…



 This is all very
 parallelizable
by row, column
BONUS: Folding in New Data
 Model building takes time       Apply some right inverse:
                                       ⌃
                                   X•Y’•(Y’)-1 = Q•(Y’)-1 = so
 Sometimes need                   X = Q•(Y’)-1
  immediate, if approximate,
  updates for new data            OK, what is (Y’)-1?

 For new user U, need new        Of course (Y’•Y)•(Y’•Y)-1 = I
  row, XU•Y’ = QU, but have PU
                                  So Y’•(Y•(Y’•Y)-1) = I and
 What is XU?                      right inverse is Y•(Y’•Y)-1

                                  Xu = QU•Y•(Y’•Y)-1 and so
                                   Xu ≈ Pu•Y•(Y’•Y)-1
In Mahout
 org.apache.mahout.cf.          MAHOUT-737
  taste.hadoop.als.
  ParallelALSFactorizationJob     Alternate implementation
   Alternating least squares      of alternating least
                                   squares
   Distributed, Hadoop-
    based                        And more…
 org.apache.mahout.cf.           DistributedLanczosSolver
  taste.impl.recommender.         SequentialOutOfCoreSvd
  svd.SVDRecommender
                                  …
   SVD-based
   Non-distributed, not
    Hadoop
 Complete product
            Real-time Serving Layer
Myrrix      Hadoop-based
             Computation Layer
            Tuned, documented

          Free / open: Serving Layer,
           for small data

          Commercial: add
           Computation Layer for big
           data; Hosting

          Matrix factorization-based,
           attractive properties

          http://myrrix.com
Thank You
srowen at myrrix.com
mahout.apache.org

More Related Content

What's hot

自作ペアリング/BLS署名ライブラリの紹介
自作ペアリング/BLS署名ライブラリの紹介自作ペアリング/BLS署名ライブラリの紹介
自作ペアリング/BLS署名ライブラリの紹介MITSUNARI Shigeo
 
5. Stream Ciphers
5. Stream Ciphers5. Stream Ciphers
5. Stream CiphersSam Bowne
 
幾何コンテスト2013
幾何コンテスト2013幾何コンテスト2013
幾何コンテスト2013Naoto Mizuno
 
Social Recommender Systems
Social Recommender SystemsSocial Recommender Systems
Social Recommender Systemsguest77b0cd12
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...PyData
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016Taehoon Kim
 
強くなるためのプログラミング -プログラミングに関する様々なコンテストとそのはじめ方-#pyconjp
強くなるためのプログラミング -プログラミングに関する様々なコンテストとそのはじめ方-#pyconjp強くなるためのプログラミング -プログラミングに関する様々なコンテストとそのはじめ方-#pyconjp
強くなるためのプログラミング -プログラミングに関する様々なコンテストとそのはじめ方-#pyconjpcocodrips
 
머피의 머신러닝 13 Sparse Linear Model
머피의 머신러닝 13 Sparse Linear Model머피의 머신러닝 13 Sparse Linear Model
머피의 머신러닝 13 Sparse Linear ModelJungkyu Lee
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programmingShakil Ahmed
 
Presentation about RSA
Presentation about RSAPresentation about RSA
Presentation about RSASrilal Buddika
 
Rabin karp string matching algorithm
Rabin karp string matching algorithmRabin karp string matching algorithm
Rabin karp string matching algorithmGajanand Sharma
 
[2019] 하이퍼파라미터 튜닝으로 모델 성능 개선하기
[2019] 하이퍼파라미터 튜닝으로 모델 성능 개선하기[2019] 하이퍼파라미터 튜닝으로 모델 성능 개선하기
[2019] 하이퍼파라미터 튜닝으로 모델 성능 개선하기NHN FORWARD
 
Rabin Carp String Matching algorithm
Rabin Carp String Matching  algorithmRabin Carp String Matching  algorithm
Rabin Carp String Matching algorithmsabiya sabiya
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention NetworksTaeoh Kim
 

What's hot (20)

Asymptotic notation
Asymptotic notationAsymptotic notation
Asymptotic notation
 
自作ペアリング/BLS署名ライブラリの紹介
自作ペアリング/BLS署名ライブラリの紹介自作ペアリング/BLS署名ライブラリの紹介
自作ペアリング/BLS署名ライブラリの紹介
 
5. Stream Ciphers
5. Stream Ciphers5. Stream Ciphers
5. Stream Ciphers
 
幾何コンテスト2013
幾何コンテスト2013幾何コンテスト2013
幾何コンテスト2013
 
Social Recommender Systems
Social Recommender SystemsSocial Recommender Systems
Social Recommender Systems
 
Shortest path algorithms
Shortest path algorithmsShortest path algorithms
Shortest path algorithms
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
 
強くなるためのプログラミング -プログラミングに関する様々なコンテストとそのはじめ方-#pyconjp
強くなるためのプログラミング -プログラミングに関する様々なコンテストとそのはじめ方-#pyconjp強くなるためのプログラミング -プログラミングに関する様々なコンテストとそのはじめ方-#pyconjp
強くなるためのプログラミング -プログラミングに関する様々なコンテストとそのはじめ方-#pyconjp
 
Asymptotic notation
Asymptotic notationAsymptotic notation
Asymptotic notation
 
머피의 머신러닝 13 Sparse Linear Model
머피의 머신러닝 13 Sparse Linear Model머피의 머신러닝 13 Sparse Linear Model
머피의 머신러닝 13 Sparse Linear Model
 
Simulated annealing
Simulated annealingSimulated annealing
Simulated annealing
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Presentation about RSA
Presentation about RSAPresentation about RSA
Presentation about RSA
 
LR Parsing
LR ParsingLR Parsing
LR Parsing
 
Rabin karp string matching algorithm
Rabin karp string matching algorithmRabin karp string matching algorithm
Rabin karp string matching algorithm
 
[2019] 하이퍼파라미터 튜닝으로 모델 성능 개선하기
[2019] 하이퍼파라미터 튜닝으로 모델 성능 개선하기[2019] 하이퍼파라미터 튜닝으로 모델 성능 개선하기
[2019] 하이퍼파라미터 튜닝으로 모델 성능 개선하기
 
Rabin Carp String Matching algorithm
Rabin Carp String Matching  algorithmRabin Carp String Matching  algorithm
Rabin Carp String Matching algorithm
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
 

Similar to Simple Matrix Factorization for Recommendation in Apache Mahout

Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial IntelligenceManoj Harsule
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresData Science London
 
Faster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationFaster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationSilvio Cesare
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Ted Dunning
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsViet-Trung TRAN
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)jillmitchell8778
 
Lesson31 Higher Dimensional First Order Difference Equations Slides
Lesson31   Higher Dimensional First Order Difference Equations SlidesLesson31   Higher Dimensional First Order Difference Equations Slides
Lesson31 Higher Dimensional First Order Difference Equations SlidesMatthew Leingang
 
Normal distribution and hypothesis testing
Normal distribution and hypothesis testingNormal distribution and hypothesis testing
Normal distribution and hypothesis testingLorelyn Turtosa-Dumaug
 
Signal Processing Course : Theory for Sparse Recovery
Signal Processing Course : Theory for Sparse RecoverySignal Processing Course : Theory for Sparse Recovery
Signal Processing Course : Theory for Sparse RecoveryGabriel Peyré
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)마이캠퍼스
 
Class 30: Sex, Religion, and Politics
Class 30: Sex, Religion, and PoliticsClass 30: Sex, Religion, and Politics
Class 30: Sex, Religion, and PoliticsDavid Evans
 
Deep learning simplified
Deep learning simplifiedDeep learning simplified
Deep learning simplifiedLovelyn Rose
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Beating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit ArithmeticBeating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit Arithmeticinside-BigData.com
 

Similar to Simple Matrix Factorization for Recommendation in Apache Mahout (20)

talk9.ppt
talk9.ppttalk9.ppt
talk9.ppt
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Faster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationFaster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware Classification
 
December 7, Projects
December 7, ProjectsDecember 7, Projects
December 7, Projects
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
Taylor problem
Taylor problemTaylor problem
Taylor problem
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)
 
Lesson31 Higher Dimensional First Order Difference Equations Slides
Lesson31   Higher Dimensional First Order Difference Equations SlidesLesson31   Higher Dimensional First Order Difference Equations Slides
Lesson31 Higher Dimensional First Order Difference Equations Slides
 
Normal distribution and hypothesis testing
Normal distribution and hypothesis testingNormal distribution and hypothesis testing
Normal distribution and hypothesis testing
 
1010n3a
1010n3a1010n3a
1010n3a
 
Signal Processing Course : Theory for Sparse Recovery
Signal Processing Course : Theory for Sparse RecoverySignal Processing Course : Theory for Sparse Recovery
Signal Processing Course : Theory for Sparse Recovery
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
 
Class 30: Sex, Religion, and Politics
Class 30: Sex, Religion, and PoliticsClass 30: Sex, Religion, and Politics
Class 30: Sex, Religion, and Politics
 
Class10
Class10Class10
Class10
 
Deep learning simplified
Deep learning simplifiedDeep learning simplified
Deep learning simplified
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Beating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit ArithmeticBeating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit Arithmetic
 

More from Data Science London

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingData Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayData Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignData Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Data Science London
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryData Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersData Science London
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxData Science London
 

More from Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Simple Matrix Factorization for Recommendation in Apache Mahout

  • 1. 6 42 8 78 14 98 1 7 8 Simple Matrix Factorization for Recommendation Sean Owen • Apache Mahout
  • 2. Apache Mahout • Scalable machine learning • (Mostly) Hadoop-based • Clustering, classification and recommender engines • Nearest-neighbor • User-based mahout.apache.org • Item-based • Slope-one • Clustering-based • Latent factor • SVD-based • ALS • More!
  • 3. Matrix = Associations  Things are associated Rose Navy Olive Like people to colors Alice 0 +4 0  Associations have strengths Like preferences and dislikes Bob 0 0 +2  Can quantify associations Alice loves navy = +4, Carol -1 0 -2 Carol dislikes olive = -2 Dave +3 0 0  We don’t know all associations Many implicit zeroes
  • 4. From One Matrix, Two  Like numbers, matrices can n be factored  m•n matrix = m•k times k•n  Associations can m P = decompose into others k n  Alice likes navy = • Alice loves blues, and k Y’ blues includes navy m X
  • 5. In Terms of Few Features  Can explain associations by appealing to underlying intermediate features (e.g. “blue-ness”)  Relatively few (one “blue-ness”, but many shades) (Blue) (Alice) (Navy)
  • 6. Losing Information is Helpful  When k (= features) is small, information is lost  Factorization is approximate (Alice appears to like blue-ish periwinkle too) (Blue) (Alice) (Periwinkle) (Navy)
  • 7. How to Compute? n k n • k Y’ = m P m X
  • 8. Skip the Singular Value Decomposition for now … n k n • Σ •k T’ = m A m S
  • 9. Alternating Least Squares  Collaborative Filtering for Implicit Feedback Datasets www2.research.att.com/~yifanhu/PUB/cf.pdf  R = matrix of user-item interactions “strengths”  P = R reduced to 0 and 1  Factor as approximate P ≈ X•Y’  Start with random Y  Compute X such that X•Y’ best approximates P (Frobenius / L2 norm) (Least Squares)  Repeat for Y (Alternating)  Iterate, Iterate, Iterate  Large values in X•Y’ are good recommendations
  • 10. Example 1 4 3 1 1 1 0 0 3 0 0 1 0 0 4 3 2 0 1 0 1 1 R P 5 2 3 1 0 1 0 1 5 0 0 0 1 0 2 4 1 1 0 0 0
  • 11. k = 3, λ=2, α=40 1 iteration 1 1 1 0 0 2.18 -0.01 0.35 0.43 0.48 0.48 0.16 0.10 0 0 1 0 0 1.83 -0.11 -0.68 -0.27 0.39 -0.13 0.03 0.05 ≈ 0 1 0 1 1 0.79 1.15 -1.80 -0.03 -0.09 -0.13 -0.47 -0.47 1 0 1 0 1 0.97 -1.90 -2.12 Y’ 0 0 0 1 0 1.01 -0.25 -1.77 1 1 0 0 0 2.33 -8.00 1.06 X
  • 12. k = 3, λ=2, α=40 1 iteration 1 1 1 0 0 0.94 1.00 1.00 0.18 0.07 0 0 1 0 0 0.84 0.89 0.99 0.60 0.50 ≈ 0 1 0 1 1 0.07 0.99 0.46 1.01 0.98 X•Y’ 1 0 1 0 1 1.00 -0.09 1.00 1.08 0.99 0 0 0 1 0 0.55 0.54 0.75 0.98 0.92 1 1 0 0 0 1.01 0.99 0.98 -0.13 -0.25
  • 13. k = 3, λ=2, α=40 10 iterations 1 1 1 0 0 0.96 0.99 0.99 0.38 0.93 0 0 1 0 0 0.44 0.39 0.98 -0.11 0.39 ≈ 0 1 0 1 1 0.70 0.99 0.42 0.98 0.98 X•Y’ 1 0 1 0 1 1.00 1.04 0.99 0.44 0.98 0 0 0 1 0 0.11 0.51 -0.13 1.00 0.57 1 1 0 0 0 0.97 1.00 0.68 0.47 0.91
  • 14. Interesting Because… This is all very parallelizable by row, column
  • 15. BONUS: Folding in New Data  Model building takes time  Apply some right inverse: ⌃ X•Y’•(Y’)-1 = Q•(Y’)-1 = so  Sometimes need X = Q•(Y’)-1 immediate, if approximate, updates for new data  OK, what is (Y’)-1?  For new user U, need new  Of course (Y’•Y)•(Y’•Y)-1 = I row, XU•Y’ = QU, but have PU  So Y’•(Y•(Y’•Y)-1) = I and  What is XU? right inverse is Y•(Y’•Y)-1  Xu = QU•Y•(Y’•Y)-1 and so Xu ≈ Pu•Y•(Y’•Y)-1
  • 16. In Mahout  org.apache.mahout.cf.  MAHOUT-737 taste.hadoop.als. ParallelALSFactorizationJob  Alternate implementation  Alternating least squares of alternating least squares  Distributed, Hadoop- based  And more…  org.apache.mahout.cf.  DistributedLanczosSolver taste.impl.recommender.  SequentialOutOfCoreSvd svd.SVDRecommender  …  SVD-based  Non-distributed, not Hadoop
  • 17.  Complete product  Real-time Serving Layer Myrrix  Hadoop-based Computation Layer  Tuned, documented  Free / open: Serving Layer, for small data  Commercial: add Computation Layer for big data; Hosting  Matrix factorization-based, attractive properties  http://myrrix.com
  • 18. Thank You srowen at myrrix.com mahout.apache.org