Learn about recent advances in MongoDB in the area of In-Memory Computing (Apache Spark Integration, In-memory Storage Engine), and how these advances can enable you to build a new breed of applications, and enhance your Enterprise Data Architecture.
2. In-Memory Computing
How can we process data as fast as possible
by leveraging in-memory speed at it’s best?
What are the possibilities if we could?
3. High-frequency trading (HFT) is a program trading platform that uses
powerful computers to transact a large number of orders at very fast
speeds. It uses complex algorithms to analyze multiple markets and
execute orders based on market conditions.
Typically, the traders with the fastest execution speeds are more
profitable than traders with slower execution speeds.
Source: Investopedia
Speed Matters…
4. Speed Matters…
Amazon found that it increased revenue by 1% for every 100ms of
improvement [source: Amazon]
A 1-second delay in page load time equals 11% fewer page views,
a 16% decrease in customer satisfaction, and 7% loss in
conversions. [Source: Aberdeen Group]
A study found that 27% of the participants who did mobile shopping
were dissatisfied due to the experience being too slow. [Source:
Forrester Consulting]
5. How Fast?
Latency Unit
RAM access 100s ns
SSD access 100s µs
HDD access 10s ms
Normalized to 1 s
~6 min
~6 days
~12 months
8. "This will process these data using algorithms for machine
learning and artificial intelligence before sending the data
back to the car.
The zFAS board will in this way continuously extend its
capabilities to master even complex situations increasingly
better," Audi stated. "The piloted cars from Audi thus learn
more every day and with each new situation they
experience.”
Source: T3.com
The possibilities…
11. Challenges: Cost Viability
Storage Type Avg. Cost ($/GB) Cost at 100TB ($)
RAM 5.00 500K
SSD 0.47-1.00 47K to 100K
HDD 0.03 3K
http://www.statisticbrain.com/average-cost-of-hard-drive-storage/
http://www.myce.com/news/ssd-price-per-gb-drops-below-0-50-how-low-can-they-go-70703/
12. Challenges: Durability
Volatile Memory
• What happens when things fail,
and what data maybe loss?
• How does the system synchronize
with your durable storage? Does it
do this well, and is it simple to
implement?
15. Scenario : ECommerce Modernization
Initiative
Business Problems Technology Limitation
Customer experience is suffering during high traffic
events.
Too expensive to scale system to support spike
events.
Scaling system is hard, and engineering teams
can’t react fast enough in the event of unexpected
growth
Some caching solution implemented, but it mostly
only helps with read performance; synchronizing
writes has been a development nightmare.
Lack of mobile customers in Europe and Asia has
been attributed to latency issues.
Difficult to extend data architecture globally, so
effort is put on hold
16. Scenario : ECommerce Modernization
Initiative
Business Problems Technology Limitation
Below industry conversation rate performance
has been attributed partly to poor personalization
Customer info is siloed across across the
Enterprise, and it’s too complicated to bring this
data together so effective models can be built to
drive personalization
“Big Data” project to bring data together to drive
machine learning and cognitive capabilities in
platform failed as data scientists report platform
was too slow to develop on, and performance
was impractical.
Business analysts have siloed views of the
eCommerce channel, and information isn’t
getting to them fast enough
Related to limitations above
Integrating data into data warehouse is slow and
hard to maintain
17. Orders
Product
Catalog
Customer Data:
Profile, Sessions,
Carts, Personalization
Inventory
NoSQLRDBMS
Platform Services
eCommerce Datastores Dependent External Data Sources and Integrations
CRM ERP PIM
Data warehouse
BI Tools
…
Platform API
Scenario : ECommerce Modernization
Initiative
23. Flexible Data Model: facilitates
agile development and continuous
delivery methodologies
Scalability: scale-out dynamically
as demand grows
Still Agile, Scalable and Simple
24. High Performance:
• More predictable, and lower
latency on less in-memory
infrastructure.
In-Memory Storage Engine
Infrastructure Optimization:
• Assign a data subset on the
In-Memory SE via Zone
Sharding.
• Optimize on cost vs.
performance without silos.
.Rich Query Capability:
• Full MongoDB Query and
Indexing Support.
IN-MEMORY SE NODES WIREDTIGER NODES
25. WEST EAST
Update
SHARD 4
TAG: EAST, WT
Local Read/Write with Strong Consistency
Session Data Geographically Localized, and with In-memory Engine Latency
SHARD 2
TAG: WEST, WT
SHARD 3
TAG: EAST, IN_MEM
SHARD 1
TAG: WEST, IN_MEM
26. Durability and Fault-Tolerance:
• Mixed ReplicaSets allow data to
be replicated from In-Memory SE
to WT SE.
• Full High Availability: automatic
fail-over, cross geography.
In-Memory Storage Engine
27. NoSQLRDBMS
Platform Databases Dependent External Data Sources and Integrations
CRM ERP PIM
Partner Sources: Supplier
databases…etc.
Legacy:
Mainframe
Operational Unified View
Advance Personalization
1. TRAIN/RE-TRAIN
ML MODELS
2. APPLY MODELS TO
REAL-TIME
STREAM OF
INTERACTIONS
3. DRIVE TARGETED
CONTENT,
RECOMMENDATIONS…ET
C.
28. Why ?
Speed. By exploiting in-memory optimizations, Spark
has shown up to 100x higher performance than
MapReduce running on Hadoop.
Simplicity. Easy-to-use APIs for operating on large
datasets. This includes a collection of sophisticated
operators for transforming and manipulating
semi-structured data.
Unified Framework. Packaged with higher-level libraries,
including support for SQL queries, machine learning,
stream and graph processing. These standard libraries
increase developer productivity and can be combined to
create complex workflows.
29. Operational Single View
+Spark Connector
• Native Scala connector,
certified by Databricks
• Exposes all Spark APIs &
libraries
• Efficient data filtering
with predicate
pushdown, secondary
indexes, & in-database
aggregations
• Locality awareness to
reduce data movement
31. Operational Single View
+Spark Connector
Blend client data from multiple
internal and external sources to
drive real time campaign
optimization
32. MongoDB+Spark at China Eastern
180m fare calculations & 1.6
billion searches per day
Oracle database peaked at 200
searches per second.
Radically re-architect their fare
engine to meet the required
100x growth in search traffic.
35. Resources for You
Spark Connector
• Download: Spark Packages
GitHub
• Documentation
• Whitepaper:
Turning Analytics into Real-Time
Action
• Education:M233: Getting
Started with Spark and
MongoDB
In-Memory Storage Engine
• Download: Enterprise Server
• Documentation
BI Connector
• Download: BI Connector
• Documentation
Put simply, there are two big questions that I think define and drive in-memory computing:
How can we process data s fast as possible by leveraging in-memory speed at it’s best?
Secondly, what are the possibilities if we could?
Why do we care about speed? It matters in a lot of cases…
In the Financial world, it matters in areas like High Frequency trading, which is estimated to account for 50-70% of trades in the past 5 years.
HFT platforms transact a large number of orders at very fast speeds, and often use complex algorithms to analyze multiple markets and market conditions
Typically, the traders with the fastest execution speeds are more profitable than traders with slower execution speeds.
Research by Enterprises and Analysts correlating performance, online experiences and revenue are well documented. I list a few here from some Analysts and Amazon, but there are other public studies from Google and Walmart demonstrating the same
Well known study by Aberdeen Group discovered: A 1-second delay in page load time equals 11% fewer page views, a 16% decrease in customer satisfaction, and 7% loss in conversions
translated to dollars, if your business earn just $100,000 a day, this equates to $2.5M in potential sales annually.
– faster is better. Slow online experiences translate to lost opportunities and we as users and consumers can relate.
So, how fast is in-memory?
Here’s the rough units that best measure data access times across different storage mediums.
Click
If we normalize to 1s, it is clear that the magnitude in speed is drastic between RAM and even fast SSD storage.
Some may already be nodding their heads… RAM isn’t new technology, and we’re aware that the price of RAM has dropped drastically over the decade.
By 2010, the sharp decline in average cost has made RAM “generally affordable” for mainstream use; however, it is far from cheap especially when we consider the data volumes that we work with today.
However, prices continue to fall, and an average price of $4.37 in 2015 make RAM an option even at scale for greenfield projects that need the speed.
IOT is certainly not a space short of innovation and possibilities, and the ability to scale in-memory performance only makes possibilities more exciting.
I came across an article where Audi is discussing their plans for their connected self-driving car, and their intentions to send data collected from various sensors on the car back to the cloud where they will leverage ML to process data to send back to the car so that it can learn and better adapt to complex situations.
“…machine learning it will mean adverse weather conditions, such as snow, which can affect sensors will be less of a problem as cars will have a thorough understanding of the piece of tarmac it is traversing”
Consider the future, the scale of every vehicle on the road, the amount of data collected that needs to be processed. In-memory computing solutions will be needed to process big data fast especially in the world of smart cars where information will drive important decisions in real-time.
Despite the significant increase in the amount of RAM you could put on a single server in the past couple of years, there are still limits, and the data volumes that we work with today continue to grow due to the type of applications we build, and the type of data sources we analyze and data mine.
For many organizations, the bulk of workloads are being moved to or are in the cloud, and the ability to scale on cloud infrastructure is critical.
The ability to scale-out to fit large data-sets in RAM across servers is critical. If not, data volume, then compute to support large scale services in the cloud.
We previously discuss how cost has lowered dramatically, and while it is an option at scale, it can still be cost prohibitive for certain projects.
Consider AWS’s X1 instance. Impressively provides nearly 2TB of RAM, but at a hefty price. At a scale of 100TBs, $1.74M just for infrastructure isn’t an option for certain projects.
Question is, does the problem really require to have all your data in RAM?
While memory is magnitudes faster than other storage mediums, the difference in relative cost is also significant.
With that said, in-memory solutions shouldn’t be designed around needing your Enterprise data-architecture or even application to run entirely in-memory. The value of the data and the problem you’re solving should dictate what is the right medium, and an in-memory solution should seamless integrate into a Enterprise Data Architecture that supports all storage mediums.
Generally, when we talk about memory we refer to what is readily available-- volatile memory; if you server goes down, then the data stored in that server’s RAM is lost unless it has also been put on durable storage like disk.
Trading off data-loss for speed, in most use cases, isn’t acceptable.
A good in-memory solution needs to provide fault tolerance, and it needs to synchronize with durable storage, and just as importantly, simply and reliably (which often isn’t the case for some solutions like external distributed caches).
As fast as RAM is, it doesn’t remedy bad design.
More importantly, any in-memory computing technology shouldn’t introduce new bottlenecks into the architecture, or limit your data architecture to addressing the biggest performance bottlenecks in your system.
For instance:
Does your in-memory computing solution require you to move large volumes of data around? If so, is that creating bottlenecks in other ways?
How does your solution bring data into RAM? Is there an efficient caching algorithm, and is relevant data selected and filtered efficiently?
How is your data being processed in RAM? Is there an efficient algorithm? Is it introducing inefficiencies and new performance bottlenecks by shuffling data unnecessarily across a distributed system?
So know that we understand the challenges and core requirements around introducing in-memory technologies into your Enterprise Data Architecture, let’s understand how MongoDB fits into the big picture and what it can offer in this area.
Let’s hone in on the product catalog and customer session management parts of the system as the problem is most clear.
Customer session management component is key to driving customer experience like personalization, and effective personalization needs to be based on full picture of the customer – realistically, in an Enterprise, customer touch points and information is siloed across many systems, and rarely is there one place in an Enterprise where an operational system can get everything it needs to know about the customer.
Likewise, with the Product Catalog, information about products will be siloed. Perhaps some info is stored within the ecommerce platform, but likely has to be synchronized with external systems like PIMs, and Supplier systems. Additionally, a modern platform should also be able to keep availability up to date as part of the product search, so problems aren’t caused downstream around order fulfillment.
Finally, the business analysts will also need to analyze the same data sources.
Consolidating these systems isn’t realistic
Integration is necessary, and ideally it shouldn’t involve heavy redundancy; for instance, across operational and BI environments.
Federated data access of these systems isn’t an option on many fronts due to performance and scale.
Sufficient integration of data into the DW via traditional ETL is a huge effort and likely too slow to make happen.
This component would be well served by MongoDB, and in fact, is one of the most common use cases for MongoDB.
This component would be well served by MongoDB, and in fact, is one of the most common use cases for MongoDB.