Sometimes , some things work better than other things. MongoDB is great for quick access to low-latency data; Treasure Data is great for infinitely scalable historical data store. A lambda architecture is also explained.
3. About Me
• A recovering software engineer turned digital artist
once interested in fractals;
• now into data visualization based on large datasets
rendered directly to GPU (RGL, various Python GL
libraries, etc.)
• it’s easier these days to manipulate large dataset
with limited effort
4. Images courtesy of Edureka, 10gen, MongoDB,
clipart panda and aperfectworld.org
9. “not so strengths” of
MongoDB
• Horowitz was also very honest about where and how
MongoDB is lacking in its current offering – most notably in
terms of integration capabilities and some areas of high
performance.
• “In the relational world you’ve got a few big boxes, in the
MongoDB world you could have 2,000 commodity servers,
so you need really great management tools for that.
That’s a huge problem for us.”
• “The other big thing is automation, where you can have
automation tools that let you manage very large clusters all
from a very simple pane of glass.”
http://diginomica.com/2014/11/10/mongodb-cto-mongo-works-doesnt/
From an interview with MongoDB CTO Eliot Horowitz
11. hmmm…moar “not so strengths” ;)
of MongoDB
• The dreaded “Write Lock”
• https://news.ycombinator.com/item?id=1691748
• http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
- is the data actually relational or not?
• Slideshare “where not to use MongoDB”
• Slideshare “Hive vs. Cassandra vs. MongoDB”
• http://www.slideshare.net/johnrjenson/mongodb-pros-and-cons
• “choosing the right NoSQL database” video:
https://www.youtube.com/watch?t=34&v=gJFG04Sy6NY
• limits of MongoDB: http://docs.mongodb.org/manual/reference/limits/
24. Complementing MongoDB
• Operationally?
• Managing MongoDB is hard (spin up instances
with Mongolab)
• MongoDB (monitoring products: Ops Manager
and MMS)
• What’s your pain? What’s are you missing?
29. DATA ACCESS > ADVANCED ANALYTICS
• Product analytics: data
access is a major issue.
• “Machine learning” is still
simple and “small” in scale
(can be done inside Python)
• Future work:
productized/operationalized
machine learning
bluetooth
iOS/Android SDK
Fluentd
Python/Pandas SDK
Data Science Team (5 people)
30. SCHEMA(LESS) COUNTS
• Redshift: lots of co-use
cases
• Event data is semi-
structured → Can be
modeled as JSON but
schemas change
• Treasure Data provides a
SQL-accessible, semi-
structured data lake.
email
Source of truth/JSON
More intensive data processing
Hourly/Daily load
Big data mart
More interactive data processing
Ad hoc queries
for new data
Ad hoc queries
for cached data
31. DATA COLLECTION IS HARD
• Want to assume all data is
on S3 or HDFS, but reality is
murkier.
• Sensor readings available as
email attachments
• Provide data collection tools
for 90% of the use cases.
Have APIs ready for 10%.
GH SCADA
email
Parse & transform
Import via REST
Import data as JSON
Analyze via SQL
Query
Results
Data-informed
maintenance
33. Some revised scenarios
• Revised scenario 1: Using Treasure Data for
Ingestion and analytics; exporting results to
MongoDB for reporting
34. Some revised scenarios
• Revised scenario 2: Ingestion data into MongoDB
and exporting to Treasure Data
35. Treasure Data is good for a
some of the same things…
• less overhead in setup
• less - make that practically no - effort to scale
• less overhead/effort to use
• but -> less fine-tuned control over outcome
Just a quick review…
Sharding strategies: Range sharding - (shard key divided by e.g. device id by range)
Hash Sharding: MongoDB applies a MD5 hash on the key when the subkey is used
Tag-Aware Sharding: allows a subset of shards to be tagged, and assigned to a sub-range of the shard key
The folks at Edureka did a comparative study of different database types.
Breakoff discussion: Let’s talk about what kind of databases we’re using, and for what purposes.
Question to audience: How do mongo and HBase (Plazma?) fall on the boundary between partition tolerance and consistency? (Might consider leaving this slide out
BSON looks like JSON and translate nicely to things like python dictionaries. Working the Mongo prompt is easy but requires mastering another API/paradigm.
What are some other strengths of MongoDB
Managing MongoDB is hard (spin up instances with Mongolab)
MongoDB (monitoring products: Ops Manager and MMS)
If you can lose 5sec. worth of updates, a MongoDB replication pair is just fine. If you can lose a day's worth of updates (or can easily reconstruct the database contents from other sources), you can try out pretty much anything without bad repercussions. If you can't lose anything, you're pretty much limited to the most conservative databases (the SQL bunch).
Any places where this process could be problematic? One is a failure before cache is written, during finalize. Another could be a failure during any step of the M-R process.
NOTE: Need transitions on this slide to control how things appear with my story
We start in the app which generates the logs. The app synchronously logs to a fluentd running on the same host. There’s no network latency and the load on each local fluentd is trivial, so we’ve never had problems with these getting slow
The local fluentd accepts the logs and buffers them on disk for reliability. Periodically, it flushes those buffers to one of the hosts in our fluentd aggregation tier with at-least-once semantics. These run active/active and can be scaled out linearly.
They also buffer on disk and periodically flush into Hadoop. As an added bonus, they also flush into S3 for backup. This tier gives us an easy to monitor & manage conduit for our logs to flow through without imposing extra costs on the app.
To recap, the logs from our app are buffered by a fluentd on the same host. That reliably forwards to a tier of aggregation fluentds, which forward to Hadoop and S3.
The sunk cost fallacy is the idea that your sunk costs (unrecoverable) create barriers to adjusting your future spending. For example, “I’m hungry. Therefore I should eat that egg salad in the fridge (even if it’s gone bad) because I’ve already spent the money on it (rather than going for fresh food.”