SlideShare a Scribd company logo
1 of 13
Download to read offline
Building
nginesRecommendation
using Lambda rchitecture
Address Performance Issues – Reporting and
Portal Latency
Implement Product Recommendation Engine
Implement highly-scalable Big Data solution
RDBMS –Data Migration and Decommission
Implement Models for Predictive Analytics
Implement Visualization Platform – Standard Dashboards
Right
Approach
Datamatics Solution
Why
It is
Required ?
Increased traffic with the help of discovery tools
Improved decision-making and better ROI
Deliver right content to right audience
Manage Inventory and Supply efficiently
Enhance Customer Experience
Quick Turn-around-time(TAT) for data analysis
Content-Based Collaborative filtering
Datamatics Solution
Analytical
Reports
Historical data
Non-volatile
Fast retrievals
Larger volumes of data
Operational
Real-time
Reads/write/updates
Current/recent data
Fast inserts/updates
Large volumes of data
Approach:
Combination of both operational and analytical framework in a
distributed environment as opposed to a single machine installation
would best suit the requirement.
Technology Stack:
A hybrid solution that includes an integration of Hive and HBase.
Solution Requirements
Integration between HDFS/Hive and HBase
HBase: Row-level updates solve
Data Duplicity
• HBase is a scale out table store
• HBase does not allow duplicate rows
• Very high rate row-level updates over massive data
volume
• Allows fast random reads and writes
• Keeps recently updated data in memory
• Incrementally rewrites data to new files
• Splits & merges intelligently based on distribution
changes
Hive: Solves High Volume & Analytics
Need
• Hive storage is based on Hadoop’s underlying append-only file-
system architecture
• Ideal for capturing and analyzing streams of events
• Can store event information for hundreds of millions of users
HDFS:
• Handles high volume of multi-structured data.
• Infinitely scalable at a very low-cost
Challenges with HDFS/Hive and HBase
HBase: Row-level updates
solve Data Duplicity
• Right Design: To ensure high throughput
and low latency the schema has to be
designed right
Other challenges include
• Data model
• Designing based on derived query
patterns
• Row key design
• Manual splitting
• Monitoring Compactions
Hive: Solves High Volume
& Analytics Need
• Keeping the warehouse up to date
• Has append-only constraint
• Impossible to directly apply individual
updates to warehouse tables
• Optionally allows to p`eriodically pull
snapshots of live data and dump them
to new Hive partition which this is a
costly operation.
• Can be done utmost daily once, and
leaves the data to be stale.
HDFS:
• Cannot handle high velocity of random
reads and writes
• Fast random reads require the data to
be stored structured(ordered)
• Unable to change a file without
completely writing it
• Only way to a modify file stored
without rewriting is appending
Hive
Storm
+
HBase
HDFS
+
+
Lambda Architecture
• Lambda architecture is a modern big data
architecture framework for distributed data
processing that is fault-tolerant, against
both hardware failures and human errors.
• It also serves wide range of work-loads that
need low-latency reads and writes at high
throughputs
SERVING
QUERY
Serving layer is used to index batch views.
• Ad-hoc batch views can be queried with low-latency
• Hive is used for Batch views
• HBase is used for Real-time views.
• Hive and HBase integration happens in this layer.
• Batch views and Real-time views are merged in this layer.
• Merged views are implemented using HBase
Majorly handles two functions
• Managing master dataset. (append-only set of raw data).
• Pre-compute arbitrary query functions (batch views)
• Hadoop’s HDFS is used in this layer
SPEED
DATA
STREAM
BATCH
Majorly handles two functions
• Managing master dataset. (append-only set of raw data).
• Pre-compute arbitrary query functions (batch views)
• Hadoop’s HDFS is used in this layer
• Speed layer deals with recent data only.
• High latency of updates due to the batch layer in the serving
layer are addressed in the Speed Layer.
• Storm is used in this layer.
Proposed Architecture – Option 1 - OpenSource
Features MapR Apache
Data
Protection
Complete snapshot
recovery
Inconsistent
snapshot recovery
Security Encrypted transmissions of
cluster data.
Permissions checked on
file access
Permissions for
users are checked
on file open only
Disaster
Recovery
Out of the box disaster
recovery services with
simple mirroring
No standard
mirroring solution.
Scripts are hard to
manage and
administer
Enterprise
Integration
Minimal disruption to the
ecosystem while
integrating
-
Performance DirectShuffle technology
leverages performance
advantages
Apache Hadoop’s
NFS cannot read or
write to an open
file.
Scalable
without single
point of
failure
MapR clusters don’t use
NameNodes and provide
stateful high-availability
for the MapReduce
JobTracker and Direct
Access NFS
Needs to be
specially
configured.
Proposed Architecture – Option 2 -MapRDistribution
Data Collection
Input Data Processing (ETL, Filtering)
Recommendation Data Building (Mahout)
Loading Final Data to Serving Layer
Recommendation Serving Layer
Output Data Post-Processing (Re-ordering)
Recommendation Engine – Architecture&Strategy
• Meta-data from item
• Normalize the meta-data
into a feature vector
• Compute Distances
 Euclidean Distance Score
 Cosine Similarity Score
 Pearson Correlation Score
• Collaborative Filtering
Item-Based Recommendation User-based Recommendation
• Group users into different clusters
• Find representative items for each
cluster
• Graph Traversal
• Highest bought
• Most liked
Recommendation Engine – Strategies
Construct a co-occurrence matrix (product similarity matrix), S[NxN]
Personalized Recommendation
Based on collaborative filtering
• Build preference vector
• Multiply both the matrices R = SxP
• Sort the final vector elements of R
ItemSimilarityJob
• Class to compute co-occurrence matrix
Algorithms
• Alternate Least Squares
• Singular Value Decomposition
Collaborative Filtering
RecommenderJob
• Main class to generate personalized recommendations
• Input file
• Similarities –
• CoOccurenceCountSimilarity
• TanimotoCoefficientSimilarity
• LogLikelihoodSimilarity
Role of Mahout
[ Preference matrix,S : Similarity matrix R : Recommendation matrix]
business@datamatics.combusiness@datamatics.com

More Related Content

Viewers also liked

Datamatics Content Management
Datamatics Content ManagementDatamatics Content Management
Datamatics Content Managementkgrantham
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talkDanny Yuan
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectiveJustin Basilico
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...Dawen Liang
 
Investment Insights from NIFCU$: Standard & Poor's Downgrades U.S. Government...
Investment Insights from NIFCU$: Standard & Poor's Downgrades U.S. Government...Investment Insights from NIFCU$: Standard & Poor's Downgrades U.S. Government...
Investment Insights from NIFCU$: Standard & Poor's Downgrades U.S. Government...NAFCU Services Corporation
 
Как совершить прорыв в бизнесе?
Как совершить прорыв в бизнесе?Как совершить прорыв в бизнесе?
Как совершить прорыв в бизнесе?revyakina
 

Viewers also liked (10)

Datamatics Content Management
Datamatics Content ManagementDatamatics Content Management
Datamatics Content Management
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talk
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry Perspective
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
 
Investment Insights from NIFCU$: Standard & Poor's Downgrades U.S. Government...
Investment Insights from NIFCU$: Standard & Poor's Downgrades U.S. Government...Investment Insights from NIFCU$: Standard & Poor's Downgrades U.S. Government...
Investment Insights from NIFCU$: Standard & Poor's Downgrades U.S. Government...
 
Как совершить прорыв в бизнесе?
Как совершить прорыв в бизнесе?Как совершить прорыв в бизнесе?
Как совершить прорыв в бизнесе?
 
Annie m
Annie mAnnie m
Annie m
 

Recently uploaded

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Recently uploaded (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Building Recommendation Engines Using Lambda Architecture

  • 2. Address Performance Issues – Reporting and Portal Latency Implement Product Recommendation Engine Implement highly-scalable Big Data solution RDBMS –Data Migration and Decommission Implement Models for Predictive Analytics Implement Visualization Platform – Standard Dashboards Right Approach Datamatics Solution
  • 3. Why It is Required ? Increased traffic with the help of discovery tools Improved decision-making and better ROI Deliver right content to right audience Manage Inventory and Supply efficiently Enhance Customer Experience Quick Turn-around-time(TAT) for data analysis Content-Based Collaborative filtering Datamatics Solution
  • 4. Analytical Reports Historical data Non-volatile Fast retrievals Larger volumes of data Operational Real-time Reads/write/updates Current/recent data Fast inserts/updates Large volumes of data Approach: Combination of both operational and analytical framework in a distributed environment as opposed to a single machine installation would best suit the requirement. Technology Stack: A hybrid solution that includes an integration of Hive and HBase. Solution Requirements
  • 5. Integration between HDFS/Hive and HBase HBase: Row-level updates solve Data Duplicity • HBase is a scale out table store • HBase does not allow duplicate rows • Very high rate row-level updates over massive data volume • Allows fast random reads and writes • Keeps recently updated data in memory • Incrementally rewrites data to new files • Splits & merges intelligently based on distribution changes Hive: Solves High Volume & Analytics Need • Hive storage is based on Hadoop’s underlying append-only file- system architecture • Ideal for capturing and analyzing streams of events • Can store event information for hundreds of millions of users HDFS: • Handles high volume of multi-structured data. • Infinitely scalable at a very low-cost
  • 6. Challenges with HDFS/Hive and HBase HBase: Row-level updates solve Data Duplicity • Right Design: To ensure high throughput and low latency the schema has to be designed right Other challenges include • Data model • Designing based on derived query patterns • Row key design • Manual splitting • Monitoring Compactions Hive: Solves High Volume & Analytics Need • Keeping the warehouse up to date • Has append-only constraint • Impossible to directly apply individual updates to warehouse tables • Optionally allows to p`eriodically pull snapshots of live data and dump them to new Hive partition which this is a costly operation. • Can be done utmost daily once, and leaves the data to be stale. HDFS: • Cannot handle high velocity of random reads and writes • Fast random reads require the data to be stored structured(ordered) • Unable to change a file without completely writing it • Only way to a modify file stored without rewriting is appending
  • 7. Hive Storm + HBase HDFS + + Lambda Architecture • Lambda architecture is a modern big data architecture framework for distributed data processing that is fault-tolerant, against both hardware failures and human errors. • It also serves wide range of work-loads that need low-latency reads and writes at high throughputs
  • 8. SERVING QUERY Serving layer is used to index batch views. • Ad-hoc batch views can be queried with low-latency • Hive is used for Batch views • HBase is used for Real-time views. • Hive and HBase integration happens in this layer. • Batch views and Real-time views are merged in this layer. • Merged views are implemented using HBase Majorly handles two functions • Managing master dataset. (append-only set of raw data). • Pre-compute arbitrary query functions (batch views) • Hadoop’s HDFS is used in this layer SPEED DATA STREAM BATCH Majorly handles two functions • Managing master dataset. (append-only set of raw data). • Pre-compute arbitrary query functions (batch views) • Hadoop’s HDFS is used in this layer • Speed layer deals with recent data only. • High latency of updates due to the batch layer in the serving layer are addressed in the Speed Layer. • Storm is used in this layer. Proposed Architecture – Option 1 - OpenSource
  • 9. Features MapR Apache Data Protection Complete snapshot recovery Inconsistent snapshot recovery Security Encrypted transmissions of cluster data. Permissions checked on file access Permissions for users are checked on file open only Disaster Recovery Out of the box disaster recovery services with simple mirroring No standard mirroring solution. Scripts are hard to manage and administer Enterprise Integration Minimal disruption to the ecosystem while integrating - Performance DirectShuffle technology leverages performance advantages Apache Hadoop’s NFS cannot read or write to an open file. Scalable without single point of failure MapR clusters don’t use NameNodes and provide stateful high-availability for the MapReduce JobTracker and Direct Access NFS Needs to be specially configured. Proposed Architecture – Option 2 -MapRDistribution
  • 10. Data Collection Input Data Processing (ETL, Filtering) Recommendation Data Building (Mahout) Loading Final Data to Serving Layer Recommendation Serving Layer Output Data Post-Processing (Re-ordering) Recommendation Engine – Architecture&Strategy
  • 11. • Meta-data from item • Normalize the meta-data into a feature vector • Compute Distances  Euclidean Distance Score  Cosine Similarity Score  Pearson Correlation Score • Collaborative Filtering Item-Based Recommendation User-based Recommendation • Group users into different clusters • Find representative items for each cluster • Graph Traversal • Highest bought • Most liked Recommendation Engine – Strategies
  • 12. Construct a co-occurrence matrix (product similarity matrix), S[NxN] Personalized Recommendation Based on collaborative filtering • Build preference vector • Multiply both the matrices R = SxP • Sort the final vector elements of R ItemSimilarityJob • Class to compute co-occurrence matrix Algorithms • Alternate Least Squares • Singular Value Decomposition Collaborative Filtering RecommenderJob • Main class to generate personalized recommendations • Input file • Similarities – • CoOccurenceCountSimilarity • TanimotoCoefficientSimilarity • LogLikelihoodSimilarity Role of Mahout [ Preference matrix,S : Similarity matrix R : Recommendation matrix]