In this talk, we describe using Redis, an open source, in-memory key-value store, to capture large volumes of data from numerous remote sources while also allowing real-time monitoring and analytics. With this approach, we were able to capture a high volume of continuous data from numerous remote environmental sensors while consistently querying our database for real time monitoring and analytics.
* See more of my work at http://www.codehenge.net
3. Overview
• Problem Statement
• Sensor Hardware & System Requirements
• System Overview
– Data Collection
– Data Modeling
– Data Access
– Event Monitoring and Notification
• Conclusions and Future Work
5. Why?
Stuxnet
• Two major components:
1) Send centrifuges spinning wildly out of control
2) Record ‘normal operations’ and play them back
to operators during the attack 1
• Environmental monitoring provides secondary
indicators, such as abnormal
heat/motion/sound
1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2&
6. The Broader Vision
Quick, flexible out-of-band monitoring
• Set up monitoring in minutes
• Versatile sensors, easily repurposed
• Data communication is secure (P2P VPN) and
requires no existing systems other than
outbound networking
7. The Platform
A CMU research project called Sensor Andrew
• Features:
– Open-source sensor platform
– Scalable and generalist system supporting a
wide variety of applications
– Extensible architecture
• Can integrate diverse sensor types
9. End
Users
Nodes
Gateway
Server
Gateway
Sensor Andrew Overview
10. What is a Node?
A node collects data and sends it to a collector, or gateway
Environment Node Power Node Radiation Node
Sensors Sensors Sensors
• Light • Current • Alpha particle
• Voltage count per minute
• Audio
• Humidity • True Power
• Pressure • Energy Particulate
• Motion Node Sensors
• Temperature • Small Part. Count
• Acceleration • Large Part. Count
11. What is a Gateway?
• A gateway receives UDP data
from all nodes registered to Gateway
it
• An internal service:
– Receives data continuously
– Opens a server on a specified
port
– Continually transmits UDP
data over this port
12. Requirements
We need to..
1. Collect data from nodes once per second
2. Scale to 100 gateways each with 64 nodes
3. Detect events in real-time
4. Notify users about events in real-time
5. Retain all data collected for years, at least
14. What Is Big Data?
“When your data sets become so
large that you have to start innovating
around how to collect, store,
organize, analyze and share it.”
21. Collecting Data
Problem: Store and retrieve immense amounts of data at a high rate.
Constraints: Data cannot remain on the nodes or gateways due to
security concerns.
Limited infrastructure.
8 GB / hour
Gateway
?
22. We Tried PostgreSQL…
• Advantages:
– Reliable, tested and scalable
– Relational => complex queries => analytics
• Problems:
– Performance problems reading while writing at a
high rate; real-time event detection suffers
– ‘COPY FROM’ doesn’t permit horizontal scaling
24. Q: How can we decrease I/O load?
A: Read and write collected data directly
from memory
25. Enter Redis
Redis is an in-memory
NoSQL database
Commonly used as a web application cache or
pub/sub server
26. Redis
• Created in 2009
• Fully In-memory key-value store
– Fast I/O: R/W operations are equally fast
– Advanced data structures
• Publish/Subscribe Functionality
– In addition to data store functions
– Separate from stored key-value data
27. Persistence
• Snapshotting
– Data is asynchronously transferred from memory
to disk
• AOF (Append Only File)
– Each modifying operation is written to a file
– Can recreate data store by replaying operations
– Without interrupting service, will rebuild AOF as
the shortest sequence of commands needed to
rebuild the current dataset in memory
28. Replication
• Redis supports master-slave replication
• Master-slave replication can be chained
• Be careful:
– Slaves are writeable!
– Potential for data inconsistency
• Fully compatible with Pub/Sub features
29. Redis Features Advanced Data
Structures
List Set Sorted Set Hash
A:3 field1 “A”
“A”
A
field2 “B”
“B” B C:1 B:4
D
field3 “C”
“C”
C D:2
“D” field4 “D”
{value:score} {key:value}
[A, B, C, D] {A, B, C, D} {C:1, D:2, A:3, D:4} {field1:“A”, field2:“B”…}
31. Constraints
Our data store must:
– Hold time-series data
– Be flexible in querying (by time, node, sensor)
– Allow efficient querying of many records
– Accept data out of order
32. Tradeoffs: Efficiency vs. Flexibility
One record per One record per
timestamp sensor data type
VS
Motion Light
Motion
Audio
Light Temperature
Pressure
Humidity Audio Humidity
Acceleration
Temperature Pressure Acceleration
A
45. We could throw away data…
• If we only cared about current values
• However, our data
– Must be stored for 1+ years for compliance
– Must be able to be queried for historical/trend
analysis
46. We Still Need Long-term Data Storage
Solution? Migrate data to an archive with
expansive storage capacity
49. Yes, Winning
Redis
Gateway A
Archiver
P
I
Postgre
Some Happy Client SQL
50. Gateway
Redi
Best of both worlds
s
Redis allows quick access to
A real-time data, for
Archiver
P monitoring and event
I detection
Postg PostgreSQL allows complex
reSQL queries and scalable storage
for deep and historical
analysis
51. We Have the Data, Now What?
Incoming data must be monitored and
analyzed, to detect significant events
52. We Have the Data, Now What?
Incoming data must be monitored and
analyzed, to detect significant events
What is “significant”?
53. We Have the Data, Now What?
Incoming data must be monitored and
analyzed, to detect significant events
What is “significant”?
What about new data types?
54. Gateway
Redis
A
Archiver
P
I
Postgre motion > x
SQL
&& pressure < y
&& audio > z
New guy: provide a way
App Django to read the data and
DB App create rules
55. Gateway
Redis
A
Archiver
P
I
motion > x Postgre
SQL
All true? pressure < y
audio > z
New guy:
Event
read the rules and Event App Django
Monitor
data, trigger Monitor DB App
alarms
56. Gateway
Redis
A
Archiver
P
I
Postgre
SQL
Event monitor
services can be
scaled
independently Event
Event App Django
Monitor
Monitor DB App
58. Getting The Message Out
Considerations
• Event monitor already has a job, avoid re-
tasking as a notification engine
59. Getting The Message Out
Considerations
• Event monitor already has a job, avoid re-
tasking as a notification engine
• Notifications most efficiently should be a
“push” instead of needing to poll
60. Getting The Message Out
Considerations
• Event monitor already has a job, avoid re-
tasking as a notification engine
• Notifications most efficiently should be a
“push” instead of needing to poll
• Notification system should be generalized,
e.g. SMTP, SMS
62. Pub/Sub with synchronized
workers is an optimal solution to
real-time event notifications.
No need to add
another system,
Redis Data Redis offers
Gateway
pub/sub services
Redis as well!
Pub/Sub
A
Archiver
P
I
Worker
Postgre Worker
Notificatio
SQL n Worker
Event
Event App Django SMTP
Monitor
Monitor DB App
63. Conclusions
• Redis is a powerful tool for collecting large
amounts of data in real-time
• In addition to maintaining a rapid pace of
data insertion, we were able to concurrently
query, monitor, and detect events on our
Redis data collection system
• Bonus: Redis also enabled a robust, scalable
real-time notification system using pub/sub
64. Things to watch
• Data persistence
– if Redis needs to restart, it takes 10-20 seconds
per gigabyte to re-load all data into memory 1
– Redis is unresponsive during startup
1 http://oldblog.antirez.com/post/redis-persistence-demystified.html
65. Future Work
• Improve scalability through:
– Data encoding
– Data compression
– Parallel batch inserts for all nodes on a gateway
• Deep historical data analytics
66. Acknowledgements
• Project engineers Chris Taschner and Jeff
Hamed @ CMU SEI
• Prof. Anthony Rowe & CMU ECE WiSE Lab
http://wise.ece.cmu.edu/
• Our organizations
CMU https://www.cmu.edu
CERT http://www.cert.org
SEI http://www.sei.cmu.edu
Cylab https://www.cylab.cmu.edu
Welcome. <Introductions, who we are, where we’re from>
AARON LAST SLIDELet’s start with some background. We’ve been working with a CMU research group on applications of a research project called Sensor Andrew. The vision of Sensor Andrew is to provide a generalized environmental sensor network, capable of being leveraged for a wide variety of applications, both academic and commercial.
START TIM
A Sensor Andrew system consists primarily of nodes, like this <hold one up if possible>, each of which contains a variety of embedded sensors, and a gateway with a specialized receiver, allowing it to receive wireless messages from each of up to 64 nodes concurrently. Our collaborators have provided hardware design and gear, firmware on all embedded components, and some baseline software to work from when interfacing with the hardware systems.
Let’s look at some more detail on the type of data we are collecting. We currently have two types of nodes, environmental and power nodes <show samples>. Environmental nodes can be set anywhere, and will detect measures of light, audio, humidity, pressure, motion, temperature, and acceleration (in x,y,z components) relative to the environment immediately surrounding the node. Power nodes must be plugged in to a wall outlet, with a current-drawing device using it to draw power. This allows the power node to detect and transmit numerous measures of data involving current, voltage, power, etc. Data is transmitted from the nodes in UDP format. For reference, an environmental data packet is ____ bytes in size, and a power data packet is ____ bytes.
Packets are UDP and the information is stored as an encoded string, so the network load is already pretty small.Compression, in addition to the encoding of the data, might be an option in the future, but that’s a small hurdle if we need it.
A terabyte per week isn’t tera-bly big, but it adds up when the data needs to stick around for a long time. Compression can ease the pain. Again, not expensive to implement if necessary.
This is an interesting part of the architecture. The nodes are pinging only once per second, and even at the gateway and collector stage, we’re actually limited to 64 pings per second. This pushes the the point of convergence to..
.. storage. We need fast writing, but we also need fast reading.
and we also need fast reading, simultaneously.
120 loaded gateways = 7680 nodes. 1 record/sec => 27.2 million records / hour. 300 kb / record => 8GB/hour / 184GB/day
There are two primary I/O bottlenecks in all network applications: 1) Network I/O and 2) Filesystem I/O. In general, we will have no control over the network infrastructures of deployment sites, so we really can’t do anything about Network I/O. That leaves Filesystem I/O.
The best way to mitigate the Filesystem I/O bottleneck is to avoid the filesystem altogether.
TIM LAST SLIDE
START AARON
AARON LAST SLIDEWe originally tried separating out each data value into a separate key (you can talk more about this on the next slide, when you have the example in datapoint front of you). This allowed extremely efficient querying, as we could query ‘motion’ data independently from ‘audio’ data. However, the overhead was significant in two respects:We had to store metadata (timestamp, nodetype, node mac address, etc) with each record, so a lot more data duplication and space inefficiency.The number of inserts per second skyrocketed. E.g. x7 inserts per second for environmental nodes.
START TIM
If two data packets had exactly the same environmental values, but with a different score, redis would update the existing set member with the new score, instead of creating a new set member. This leads to some data duplication, which adds up over millions of records.