SlideShare a Scribd company logo
1 of 18
SCALE YOUR DATA COLLECTION
ON THE CLOUD LIKE A CHAMP
Moty Michaely, VP R&D Xplenty
SCALING DATA COLLECTION = A PAIN
Plenty of companies are limited by their data collection
methods when it comes to scalability.
Once they need more detailed data and in larger quantities,
scaling the system can become a major pain.
THREE COMMON METHODS FOR COLLECTING BIG
DATA... IS YOUR COMPANY USING THE RIGHT ONE?
▪ Storing directly in the DB
▪ Keeping it in a local file
▪ S3/CloudFront logging
STORING DIRECTLY IN THE DB
This is what companies usually start with. As the name
suggests, data is inserted right into the DB.
There are two ways to do it:
▪ Row by row means the data is added as a row to the DB in
real time.
▪ Bulk insert adds multiple rows to the DB in one transaction.
(It’s faster than row by row, but insertion of the entire batch may fail, thus having to re-insert a
big chunk of data.)
PROS FOR STORING DIRECTLY IN THE DB
▪ Better performance than other methods for inserting data.
▪ Real-time data available when adding row by row.
CONS FOR STORING DIRECTLY IN THE DB
▪ Schema changes are required to add new types of data.
▪ Scaling is required in two layers - application and database.
Scaling the application is usually easier (using a network load
balancer for example) but scaling the database requires hiring
an expert DBA, partitioning the DB, and scaling up the server.
(Relational DBs that scale out to multiple nodes are expensive and require a lot of
maintenance.)
BOTTOM LINE
Storing directly in the DB gives you fast performance, but it
doesn’t scale.
KEEPING IT LOCAL
Data is dumped in big local files. These files are periodically
uploaded via a program to S3 or inserted in batches into a
NoSQL DB, such as Amazon DynamoDB or a data warehouse
like Amazon RedShift.
PROS FOR KEEPING IT IN A LOCAL FILE
▪ New types of data can be added easily since no schema
changes are required.
▪ Compatible with all applications because any file format can
be used.
▪ Quicker filtering via customized directory/file names, e.g.
with date/time indication.
CONS FOR KEEPING IT IN A LOCAL FILE
▪ One needs to develop a tracking program to deal with the
files - rotating logs while more data is incoming, handling
failures, and transactionality. Even if you have the manpower,
time, and money, it’s hard to develop such a program.
▪ Scaling means adding more servers, more maintenance, and
more money.
▪ Data is not as query-able compared to storage in a DB.
▪ Staging and production environments require extra servers.
BOTTOM LINE
More flexible than direct DB storage, but requires more
development, and scaling is still an issue.
S3/CLOUDFRONT LOGGING
This old school solution goes back to the early days when
visitor counters and burning “hot!” animations ruled the web.
To track an event, an HTTP request is sent for a 1x1 pixel image
from a relevant S3 directory. Accessing the image automatically
generates a W3C log with all HTTP request parameters: IP
address, browser, date/time, etc. Extra session level data like
username or mouse position is passed via the query string. To
differentiate between event types, images are placed in
accordingly named directories, e.g. /click/.
PROS FOR S3/CLOUDFRONT LOGGING
▪ No tracking server required - data reaches S3 automatically.
▪ No file management - Amazon handles all file monkey
business.
▪ No servers - Amazon provides them.
▪ Cost effective - only log storage and bandwidth are paid for.
The logs take little space since they are all GZipped and the
bandwidth for 1x1 pixel images is marginal.
PROS FOR S3/CLOUDFRONT LOGGING
CONTINUED
▪ Easily scalable with practically infinite space and firepower.
▪ Quick and easy to implement.
▪ Simple setup for staging/production environments via
additional distributions and a prefix.
▪ Web application performance unharmed, especially using the
CloudFront CDN.
CONS FOR S3/CLOUDFRONT LOGGING
▪ Slower filtering performance compared to local setup. Amazon handles
log file/directory names automatically and no customization is available.
▪ Not suitable for real time or impatience. Data is aggregated into a new
file in the bucket only once per hour, and that’s Amazon’s best effort so
it could take longer.
▪ Data is not as query-able compared to storage in a DB.
▪ Vendor dependent. Having your servers outside of Amazon will
decrease performance.
▪ No control over the file format. W3C Extended Log File Format is
mandatory and some applications may not like that.
BOTTOM LINE
Quick, cheap, and scalable though it doesn’t provide the best
performance and customization.
WHAT’S RIGHT FOR YOU?
So much emphasis has been put on the technologies used
for processing, analyzing, and visualizing data. But so often
getting lost in the shuffle is the importance of the
collection of this data. The two go hand in hand. To get
good output from your data, you must first have proper
input.
Only once you have achieved the synergy between the two
will you fully be able to tap into your data’s potential.
XPLENTY
WWW.XPLENTY.COM

More Related Content

What's hot

Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata StreamingZoomdata
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma
 
AWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarAWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarArti Bhatia
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsDan Lynn
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFAmazon Web Services
 
Flickr Architecture Presentation
Flickr Architecture PresentationFlickr Architecture Presentation
Flickr Architecture Presentationeraz
 
Ramunas Balukonis. Research DWH
Ramunas Balukonis. Research DWHRamunas Balukonis. Research DWH
Ramunas Balukonis. Research DWHVolha Banadyseva
 
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL databaseСергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL databaseVolha Banadyseva
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon
 
MySql to HBase in 5 Steps
MySql to HBase in 5 StepsMySql to HBase in 5 Steps
MySql to HBase in 5 StepsScott Cinnamond
 
Hadoop, Infrastructure and Stack
Hadoop, Infrastructure and StackHadoop, Infrastructure and Stack
Hadoop, Infrastructure and StackJohn Dougherty
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceJoydeep Sen Sarma
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design PatternsMongoDB
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
 

What's hot (20)

RubiX
RubiXRubiX
RubiX
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata Streaming
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
AWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarAWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinar
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
Flickr Architecture Presentation
Flickr Architecture PresentationFlickr Architecture Presentation
Flickr Architecture Presentation
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Ramunas Balukonis. Research DWH
Ramunas Balukonis. Research DWHRamunas Balukonis. Research DWH
Ramunas Balukonis. Research DWH
 
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL databaseСергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL database
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
MySql to HBase in 5 Steps
MySql to HBase in 5 StepsMySql to HBase in 5 Steps
MySql to HBase in 5 Steps
 
Hadoop, Infrastructure and Stack
Hadoop, Infrastructure and StackHadoop, Infrastructure and Stack
Hadoop, Infrastructure and Stack
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design Patterns
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 

Similar to How to scale your data collection on the cloud like a champ

Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013Ivan Sanders
 
Scaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyScaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyOliver Seemann
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
5 Reasons to Move Your BI to the Cloud
5 Reasons to Move Your BI to the Cloud5 Reasons to Move Your BI to the Cloud
5 Reasons to Move Your BI to the CloudTableau Software
 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree AnikeyRoy
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtreesamirandev1
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtreedevraajsingh
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog sameerroshan
 
Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and consSaniya Khalsa
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonCapgemini
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsBob Pusateri
 

Similar to How to scale your data collection on the cloud like a champ (20)

Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013
 
Scaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyScaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case study
 
IBM Dash DB
IBM Dash DBIBM Dash DB
IBM Dash DB
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
5 Reasons to Move Your BI to the Cloud
5 Reasons to Move Your BI to the Cloud5 Reasons to Move Your BI to the Cloud
5 Reasons to Move Your BI to the Cloud
 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtree
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog
 
Mongodb
MongodbMongodb
Mongodb
 
Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and cons
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
L20 Scalability
L20 ScalabilityL20 Scalability
L20 Scalability
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A Comparison
 
Database management system
Database management systemDatabase management system
Database management system
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 

Recently uploaded

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 

Recently uploaded (20)

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 

How to scale your data collection on the cloud like a champ

  • 1. SCALE YOUR DATA COLLECTION ON THE CLOUD LIKE A CHAMP Moty Michaely, VP R&D Xplenty
  • 2. SCALING DATA COLLECTION = A PAIN Plenty of companies are limited by their data collection methods when it comes to scalability. Once they need more detailed data and in larger quantities, scaling the system can become a major pain.
  • 3. THREE COMMON METHODS FOR COLLECTING BIG DATA... IS YOUR COMPANY USING THE RIGHT ONE? ▪ Storing directly in the DB ▪ Keeping it in a local file ▪ S3/CloudFront logging
  • 4. STORING DIRECTLY IN THE DB This is what companies usually start with. As the name suggests, data is inserted right into the DB. There are two ways to do it: ▪ Row by row means the data is added as a row to the DB in real time. ▪ Bulk insert adds multiple rows to the DB in one transaction. (It’s faster than row by row, but insertion of the entire batch may fail, thus having to re-insert a big chunk of data.)
  • 5. PROS FOR STORING DIRECTLY IN THE DB ▪ Better performance than other methods for inserting data. ▪ Real-time data available when adding row by row.
  • 6. CONS FOR STORING DIRECTLY IN THE DB ▪ Schema changes are required to add new types of data. ▪ Scaling is required in two layers - application and database. Scaling the application is usually easier (using a network load balancer for example) but scaling the database requires hiring an expert DBA, partitioning the DB, and scaling up the server. (Relational DBs that scale out to multiple nodes are expensive and require a lot of maintenance.)
  • 7. BOTTOM LINE Storing directly in the DB gives you fast performance, but it doesn’t scale.
  • 8. KEEPING IT LOCAL Data is dumped in big local files. These files are periodically uploaded via a program to S3 or inserted in batches into a NoSQL DB, such as Amazon DynamoDB or a data warehouse like Amazon RedShift.
  • 9. PROS FOR KEEPING IT IN A LOCAL FILE ▪ New types of data can be added easily since no schema changes are required. ▪ Compatible with all applications because any file format can be used. ▪ Quicker filtering via customized directory/file names, e.g. with date/time indication.
  • 10. CONS FOR KEEPING IT IN A LOCAL FILE ▪ One needs to develop a tracking program to deal with the files - rotating logs while more data is incoming, handling failures, and transactionality. Even if you have the manpower, time, and money, it’s hard to develop such a program. ▪ Scaling means adding more servers, more maintenance, and more money. ▪ Data is not as query-able compared to storage in a DB. ▪ Staging and production environments require extra servers.
  • 11. BOTTOM LINE More flexible than direct DB storage, but requires more development, and scaling is still an issue.
  • 12. S3/CLOUDFRONT LOGGING This old school solution goes back to the early days when visitor counters and burning “hot!” animations ruled the web. To track an event, an HTTP request is sent for a 1x1 pixel image from a relevant S3 directory. Accessing the image automatically generates a W3C log with all HTTP request parameters: IP address, browser, date/time, etc. Extra session level data like username or mouse position is passed via the query string. To differentiate between event types, images are placed in accordingly named directories, e.g. /click/.
  • 13. PROS FOR S3/CLOUDFRONT LOGGING ▪ No tracking server required - data reaches S3 automatically. ▪ No file management - Amazon handles all file monkey business. ▪ No servers - Amazon provides them. ▪ Cost effective - only log storage and bandwidth are paid for. The logs take little space since they are all GZipped and the bandwidth for 1x1 pixel images is marginal.
  • 14. PROS FOR S3/CLOUDFRONT LOGGING CONTINUED ▪ Easily scalable with practically infinite space and firepower. ▪ Quick and easy to implement. ▪ Simple setup for staging/production environments via additional distributions and a prefix. ▪ Web application performance unharmed, especially using the CloudFront CDN.
  • 15. CONS FOR S3/CLOUDFRONT LOGGING ▪ Slower filtering performance compared to local setup. Amazon handles log file/directory names automatically and no customization is available. ▪ Not suitable for real time or impatience. Data is aggregated into a new file in the bucket only once per hour, and that’s Amazon’s best effort so it could take longer. ▪ Data is not as query-able compared to storage in a DB. ▪ Vendor dependent. Having your servers outside of Amazon will decrease performance. ▪ No control over the file format. W3C Extended Log File Format is mandatory and some applications may not like that.
  • 16. BOTTOM LINE Quick, cheap, and scalable though it doesn’t provide the best performance and customization.
  • 17. WHAT’S RIGHT FOR YOU? So much emphasis has been put on the technologies used for processing, analyzing, and visualizing data. But so often getting lost in the shuffle is the importance of the collection of this data. The two go hand in hand. To get good output from your data, you must first have proper input. Only once you have achieved the synergy between the two will you fully be able to tap into your data’s potential.