When we met with WalkMe, a company which offers helpful in-app walkthroughs (we use it for our app and it’s great), our meeting took a surprising turn. We expected a discussion about crunching big data. They already had a data collection mechanism in place, but they had a problem that preceded any sort of crunching. They had a problem scaling the data collection process. Here's how they solved it.
Identifying Appropriate Test Statistics Involving Population Mean
How to scale your data collection on the cloud like a champ
1. SCALE YOUR DATA COLLECTION
ON THE CLOUD LIKE A CHAMP
Moty Michaely, VP R&D Xplenty
2. SCALING DATA COLLECTION = A PAIN
Plenty of companies are limited by their data collection
methods when it comes to scalability.
Once they need more detailed data and in larger quantities,
scaling the system can become a major pain.
3. THREE COMMON METHODS FOR COLLECTING BIG
DATA... IS YOUR COMPANY USING THE RIGHT ONE?
▪ Storing directly in the DB
▪ Keeping it in a local file
▪ S3/CloudFront logging
4. STORING DIRECTLY IN THE DB
This is what companies usually start with. As the name
suggests, data is inserted right into the DB.
There are two ways to do it:
▪ Row by row means the data is added as a row to the DB in
real time.
▪ Bulk insert adds multiple rows to the DB in one transaction.
(It’s faster than row by row, but insertion of the entire batch may fail, thus having to re-insert a
big chunk of data.)
5. PROS FOR STORING DIRECTLY IN THE DB
▪ Better performance than other methods for inserting data.
▪ Real-time data available when adding row by row.
6. CONS FOR STORING DIRECTLY IN THE DB
▪ Schema changes are required to add new types of data.
▪ Scaling is required in two layers - application and database.
Scaling the application is usually easier (using a network load
balancer for example) but scaling the database requires hiring
an expert DBA, partitioning the DB, and scaling up the server.
(Relational DBs that scale out to multiple nodes are expensive and require a lot of
maintenance.)
8. KEEPING IT LOCAL
Data is dumped in big local files. These files are periodically
uploaded via a program to S3 or inserted in batches into a
NoSQL DB, such as Amazon DynamoDB or a data warehouse
like Amazon RedShift.
9. PROS FOR KEEPING IT IN A LOCAL FILE
▪ New types of data can be added easily since no schema
changes are required.
▪ Compatible with all applications because any file format can
be used.
▪ Quicker filtering via customized directory/file names, e.g.
with date/time indication.
10. CONS FOR KEEPING IT IN A LOCAL FILE
▪ One needs to develop a tracking program to deal with the
files - rotating logs while more data is incoming, handling
failures, and transactionality. Even if you have the manpower,
time, and money, it’s hard to develop such a program.
▪ Scaling means adding more servers, more maintenance, and
more money.
▪ Data is not as query-able compared to storage in a DB.
▪ Staging and production environments require extra servers.
11. BOTTOM LINE
More flexible than direct DB storage, but requires more
development, and scaling is still an issue.
12. S3/CLOUDFRONT LOGGING
This old school solution goes back to the early days when
visitor counters and burning “hot!” animations ruled the web.
To track an event, an HTTP request is sent for a 1x1 pixel image
from a relevant S3 directory. Accessing the image automatically
generates a W3C log with all HTTP request parameters: IP
address, browser, date/time, etc. Extra session level data like
username or mouse position is passed via the query string. To
differentiate between event types, images are placed in
accordingly named directories, e.g. /click/.
13. PROS FOR S3/CLOUDFRONT LOGGING
▪ No tracking server required - data reaches S3 automatically.
▪ No file management - Amazon handles all file monkey
business.
▪ No servers - Amazon provides them.
▪ Cost effective - only log storage and bandwidth are paid for.
The logs take little space since they are all GZipped and the
bandwidth for 1x1 pixel images is marginal.
14. PROS FOR S3/CLOUDFRONT LOGGING
CONTINUED
▪ Easily scalable with practically infinite space and firepower.
▪ Quick and easy to implement.
▪ Simple setup for staging/production environments via
additional distributions and a prefix.
▪ Web application performance unharmed, especially using the
CloudFront CDN.
15. CONS FOR S3/CLOUDFRONT LOGGING
▪ Slower filtering performance compared to local setup. Amazon handles
log file/directory names automatically and no customization is available.
▪ Not suitable for real time or impatience. Data is aggregated into a new
file in the bucket only once per hour, and that’s Amazon’s best effort so
it could take longer.
▪ Data is not as query-able compared to storage in a DB.
▪ Vendor dependent. Having your servers outside of Amazon will
decrease performance.
▪ No control over the file format. W3C Extended Log File Format is
mandatory and some applications may not like that.
16. BOTTOM LINE
Quick, cheap, and scalable though it doesn’t provide the best
performance and customization.
17. WHAT’S RIGHT FOR YOU?
So much emphasis has been put on the technologies used
for processing, analyzing, and visualizing data. But so often
getting lost in the shuffle is the importance of the
collection of this data. The two go hand in hand. To get
good output from your data, you must first have proper
input.
Only once you have achieved the synergy between the two
will you fully be able to tap into your data’s potential.