SlideShare a Scribd company logo
1 of 30
Why Topological Data
Analysis Beats
Dimension Reduction
+
Instead of asking data specific
questions we can use traditional tools
to join different data sources and
prepare a holistic dataset
This dataset can be automatically
processed using topological data
analysis and presented as map
of dependencies and correlations
The motivation
=
Get answers to questions you didn’t ask yet
A topological invariant is a map f that assigns the same object to
homeomorphic spaces, that is:
Homology: is a machine that
converts local data about a space
into global algebraic structure
Reference: Wikipedia, 2010.
Topological invariants
Theorem:
Suppose h : X g is a discrete Morse
function.
Then X is homotopy equivalent to a
CW-complex with exactly one cell of
dimension p for each critical simplex
of dimension p.
Reference: Teng Ma ; Zhuangzhi Wu ; Pei Luo ; Lu Feng. Reeb graph computation through spectral clustering, 2011.
Morse Theory and Reeb Graph
Case study: Netflix competition
A dataset from Netflix open
competition best collaborative
filtering algorithm to predict user
ratings for films:
• 100,480,507 ratings
• 480,189 users
• 17,770 movies
• 2.1 GB of CSV file
Case study: Netflix competition
Data Transformation
Source data
users
movies
Data format for TDA
[100,480,507:3]
300 millions of elements
[17,770:480,189]
8.5 billions of elements
Challenges:
• During pivoting we’re transforming 300 millions of data
items into 8.5 billions of data items, which require more
than 200 GB of ram
• My current TDA algorithm implementation has O( log(n)
) computational and memory complexity, which makes it
even more complicated to compute as is
Case study: Netflix competition
Data Transformation
Split dataset in buckets by
range of movie_ids
Pivot each data bucket
(rows: movies, columns:
users)
…
…
Perform serial executions of PCA on each
batch using previously learned PCA vectors
Merging batches in whole dataset
Learn PCA coefficients on random
subset
Case study: Netflix competition
Data Transformation: the solution
Case study: Netflix competition
Music
Indian
Anime
French
Honk
Kong
US
Cartoons
Kids
Movie
German
US
Retro
Horror
Case study: Netflix competition
Case study: Netflix competition
Horror movies example
Case study: Netflix competition
Result comparison: PCA
Case study: Netflix competition
Result comparison: Spectral Embedding
Case study: Netflix competition
Result comparison: Locally-linear embedding (LLE)
Case study: Netflix competition
Result comparison: Hessian LLE
Case study: Netflix competition
Result comparison: Local tangent space alignment (LTSA)
Case study: Netflix competition
Result comparison: TDA with other techniques
LLE
PCA
LTSA
Hessian LLE
Topological Data Analysis
Spectral Embedding
Case study: Yelp Dataset
Challenge
Sample of our data from
the greater Phoenix, AZ
metropolitan area including:
• 15,585 businesses
• 111,561 business attributes
• 11,434 check-in sets
• 70,817 users
• 151,516 edge social graph
• 113,993 tips
• 335,022 reviews
http://www.yelp.com/dataset_challenge
Case study: Yelp Dataset Challenge
Data Transformation
{
'type': 'checkin',
'business_id': (encrypted business id),
'checkin_info': {
'0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
'1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
...
'14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
...
'23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
}, # if there was no checkin for a hour-day block it will not be in the dict
}
Check-ins
[15,585:168]
2.6 millions of elements
Case study: Yelp Dataset Challenge
Visualisation: All categories
Case study: Yelp Dataset Challenge
Visualisation: Food, Restaurants
Case study: Yelp Dataset Challenge
Visualisation: Shopping
Case study: Yelp Dataset Challenge
Visualisation: Nightlife
Case study: Yelp Dataset Challenge
Visualisation: Beauty & Spas, Active Life
Case study: Yelp Dataset Challenge
Visualisation: cluster examination
Cluster characteristics:
• Tuesday, 2:00 is not
NaN
Case study: Yelp Dataset Challenge
Visualisation: cluster examination
Cluster characteristics:
• More than 35 check-ins
everyday at 10:00
• Less than 17 check-ins
everyday at 15:00
• Most has category
“Breakfast and brunch”
Case study: Yelp Dataset Challenge
Result comparison: TDA with other techniques
PCA
(0.19 sec)
Spectral
Embedding
(806 sec)
LLE
(366 sec)
Modified LLE
(1206 sec)
Topological Data Analysis
(275 sec)
Live Demo
Links
Topology And Data (Gunnar Carlsson):
http://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/S0273-
0979-09-01249-X.pdf
Discrete Morse Theory and Persistent Homology (Kevin P. Knudson):
http://www.math.fsu.edu/~hironaka/FSUUF/knudson.pdf
Topological Persistence and Simplification
(Herbert Edelsbrunner, David Letscher, Afra Zomorodian):
http://math.uchicago.edu/~shmuel/AAT-
readings/Data%20Analysis%20/PersTop.pdf
Netflix Diagram (3200x3200):
http://datarefiner.com/netflix17770movies.png
Netflix Diagram with movie titles (17000x17000, 86MB):
http://datarefiner.com/netflix17770movies_annotation.png
info@datarefiner.com
www.datarefiner.com
Please sign up for free beta access:

More Related Content

Viewers also liked

Anthony Bak, Principal Data Scientist at Ayasdi at MLconf SEA - 5/01/15
Anthony Bak, Principal Data Scientist at Ayasdi at MLconf SEA - 5/01/15Anthony Bak, Principal Data Scientist at Ayasdi at MLconf SEA - 5/01/15
Anthony Bak, Principal Data Scientist at Ayasdi at MLconf SEA - 5/01/15MLconf
 
Subspace Indexing on Grassmannian Manifold for Large Scale Visual Identification
Subspace Indexing on Grassmannian Manifold for Large Scale Visual IdentificationSubspace Indexing on Grassmannian Manifold for Large Scale Visual Identification
Subspace Indexing on Grassmannian Manifold for Large Scale Visual IdentificationUnited States Air Force Academy
 
This one genre theory powerpoint
This one genre theory powerpointThis one genre theory powerpoint
This one genre theory powerpointbir
 
Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Ha Phuong
 

Viewers also liked (8)

Anthony Bak, Principal Data Scientist at Ayasdi at MLconf SEA - 5/01/15
Anthony Bak, Principal Data Scientist at Ayasdi at MLconf SEA - 5/01/15Anthony Bak, Principal Data Scientist at Ayasdi at MLconf SEA - 5/01/15
Anthony Bak, Principal Data Scientist at Ayasdi at MLconf SEA - 5/01/15
 
Subspace Indexing on Grassmannian Manifold for Large Scale Visual Identification
Subspace Indexing on Grassmannian Manifold for Large Scale Visual IdentificationSubspace Indexing on Grassmannian Manifold for Large Scale Visual Identification
Subspace Indexing on Grassmannian Manifold for Large Scale Visual Identification
 
This one genre theory powerpoint
This one genre theory powerpointThis one genre theory powerpoint
This one genre theory powerpoint
 
Ayasdi strata
Ayasdi strataAyasdi strata
Ayasdi strata
 
Kernel Methods on Manifolds
Kernel Methods on ManifoldsKernel Methods on Manifolds
Kernel Methods on Manifolds
 
Genre theory quotes
Genre theory quotesGenre theory quotes
Genre theory quotes
 
Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)
 
Netflix case study
Netflix case studyNetflix case study
Netflix case study
 

Recently uploaded

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Recently uploaded (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Why Topological Data Analysis Beats Dimension Reduction

  • 1. Why Topological Data Analysis Beats Dimension Reduction
  • 2. + Instead of asking data specific questions we can use traditional tools to join different data sources and prepare a holistic dataset This dataset can be automatically processed using topological data analysis and presented as map of dependencies and correlations The motivation = Get answers to questions you didn’t ask yet
  • 3. A topological invariant is a map f that assigns the same object to homeomorphic spaces, that is: Homology: is a machine that converts local data about a space into global algebraic structure Reference: Wikipedia, 2010. Topological invariants
  • 4. Theorem: Suppose h : X g is a discrete Morse function. Then X is homotopy equivalent to a CW-complex with exactly one cell of dimension p for each critical simplex of dimension p. Reference: Teng Ma ; Zhuangzhi Wu ; Pei Luo ; Lu Feng. Reeb graph computation through spectral clustering, 2011. Morse Theory and Reeb Graph
  • 5. Case study: Netflix competition A dataset from Netflix open competition best collaborative filtering algorithm to predict user ratings for films: • 100,480,507 ratings • 480,189 users • 17,770 movies • 2.1 GB of CSV file
  • 6. Case study: Netflix competition Data Transformation Source data users movies Data format for TDA [100,480,507:3] 300 millions of elements [17,770:480,189] 8.5 billions of elements
  • 7. Challenges: • During pivoting we’re transforming 300 millions of data items into 8.5 billions of data items, which require more than 200 GB of ram • My current TDA algorithm implementation has O( log(n) ) computational and memory complexity, which makes it even more complicated to compute as is Case study: Netflix competition Data Transformation
  • 8. Split dataset in buckets by range of movie_ids Pivot each data bucket (rows: movies, columns: users) … … Perform serial executions of PCA on each batch using previously learned PCA vectors Merging batches in whole dataset Learn PCA coefficients on random subset Case study: Netflix competition Data Transformation: the solution
  • 9. Case study: Netflix competition
  • 11. Case study: Netflix competition Horror movies example
  • 12. Case study: Netflix competition Result comparison: PCA
  • 13. Case study: Netflix competition Result comparison: Spectral Embedding
  • 14. Case study: Netflix competition Result comparison: Locally-linear embedding (LLE)
  • 15. Case study: Netflix competition Result comparison: Hessian LLE
  • 16. Case study: Netflix competition Result comparison: Local tangent space alignment (LTSA)
  • 17. Case study: Netflix competition Result comparison: TDA with other techniques LLE PCA LTSA Hessian LLE Topological Data Analysis Spectral Embedding
  • 18. Case study: Yelp Dataset Challenge Sample of our data from the greater Phoenix, AZ metropolitan area including: • 15,585 businesses • 111,561 business attributes • 11,434 check-in sets • 70,817 users • 151,516 edge social graph • 113,993 tips • 335,022 reviews http://www.yelp.com/dataset_challenge
  • 19. Case study: Yelp Dataset Challenge Data Transformation { 'type': 'checkin', 'business_id': (encrypted business id), 'checkin_info': { '0-0': (number of checkins from 00:00 to 01:00 on all Sundays), '1-0': (number of checkins from 01:00 to 02:00 on all Sundays), ... '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays), ... '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays) }, # if there was no checkin for a hour-day block it will not be in the dict } Check-ins [15,585:168] 2.6 millions of elements
  • 20. Case study: Yelp Dataset Challenge Visualisation: All categories
  • 21. Case study: Yelp Dataset Challenge Visualisation: Food, Restaurants
  • 22. Case study: Yelp Dataset Challenge Visualisation: Shopping
  • 23. Case study: Yelp Dataset Challenge Visualisation: Nightlife
  • 24. Case study: Yelp Dataset Challenge Visualisation: Beauty & Spas, Active Life
  • 25. Case study: Yelp Dataset Challenge Visualisation: cluster examination Cluster characteristics: • Tuesday, 2:00 is not NaN
  • 26. Case study: Yelp Dataset Challenge Visualisation: cluster examination Cluster characteristics: • More than 35 check-ins everyday at 10:00 • Less than 17 check-ins everyday at 15:00 • Most has category “Breakfast and brunch”
  • 27. Case study: Yelp Dataset Challenge Result comparison: TDA with other techniques PCA (0.19 sec) Spectral Embedding (806 sec) LLE (366 sec) Modified LLE (1206 sec) Topological Data Analysis (275 sec)
  • 29. Links Topology And Data (Gunnar Carlsson): http://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/S0273- 0979-09-01249-X.pdf Discrete Morse Theory and Persistent Homology (Kevin P. Knudson): http://www.math.fsu.edu/~hironaka/FSUUF/knudson.pdf Topological Persistence and Simplification (Herbert Edelsbrunner, David Letscher, Afra Zomorodian): http://math.uchicago.edu/~shmuel/AAT- readings/Data%20Analysis%20/PersTop.pdf Netflix Diagram (3200x3200): http://datarefiner.com/netflix17770movies.png Netflix Diagram with movie titles (17000x17000, 86MB): http://datarefiner.com/netflix17770movies_annotation.png