Large-Scale Environmental Data Collection and Analysis Using Redis

Large-Scale Data Collection Using
Redis
C. Aaron Cois, Ph.D. -- Tim Palko
CMU Software Engineering Institute

© 2011 Carnegie Mellon University

Us
C. Aaron Cois, Ph.D. Tim Palko
Software Architect, Team Lead Senior Software Engineer
CMU Software Engineering CMU Software Engineering
Institute Institute
Digital Intelligence and Digital Intelligence and
Investigations Directorate Investigations Directorate

@aaroncois

© 2011 Carnegie Mellon University

Overview
• Problem Statement
• Sensor Hardware & System Requirements
• System Overview
– Data Collection
– Data Modeling
– Data Access
– Event Monitoring and Notification
• Conclusions and Future Work

The Goal
Critical infrastructure/facility
protection

via

Environmental Monitoring

Why?
Stuxnet
• Two major components:
1) Send centrifuges spinning wildly out of control
2) Record ‘normal operations’ and play them back
to operators during the attack 1
• Environmental monitoring provides secondary
indicators, such as abnormal
heat/motion/sound

1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2&

The Broader Vision
Quick, flexible out-of-band monitoring
• Set up monitoring in minutes
• Versatile sensors, easily repurposed
• Data communication is secure (P2P VPN) and
requires no existing systems other than
outbound networking

The Platform

A CMU research project called Sensor Andrew

• Features:
– Open-source sensor platform
– Scalable and generalist system supporting a
wide variety of applications
– Extensible architecture
• Can integrate diverse sensor types

End
Users
Nodes

Gateway

Server

Gateway

Sensor Andrew Overview

What is a Node?
A node collects data and sends it to a collector, or gateway

Environment Node Power Node Radiation Node
Sensors Sensors Sensors
• Light • Current • Alpha particle
• Voltage count per minute
• Audio
• Humidity • True Power
• Pressure • Energy Particulate
• Motion Node Sensors
• Temperature • Small Part. Count
• Acceleration • Large Part. Count

What is a Gateway?

• A gateway receives UDP data
from all nodes registered to Gateway

it
• An internal service:
– Receives data continuously
– Opens a server on a specified
port
– Continually transmits UDP
data over this port

Requirements
We need to..
1. Collect data from nodes once per second
2. Scale to 100 gateways each with 64 nodes
3. Detect events in real-time
4. Notify users about events in real-time
5. Retain all data collected for years, at least

What Is Big Data?

“When your data sets become so
large that you have to start innovating
around how to collect, store,
organize, analyze and share it.”

Problems

Size Transmission

Rate Storage

Problems

Size Transmission

Rate Storage
Retrieval

Collecting Data
Problem: Store and retrieve immense amounts of data at a high rate.
Constraints: Data cannot remain on the nodes or gateways due to
security concerns.
Limited infrastructure.

8 GB / hour
Gateway

?

We Tried PostgreSQL…

• Advantages:
– Reliable, tested and scalable
– Relational => complex queries => analytics
• Problems:
– Performance problems reading while writing at a
high rate; real-time event detection suffers
– ‘COPY FROM’ doesn’t permit horizontal scaling

Q: How can we decrease I/O load?

Q: How can we decrease I/O load?

A: Read and write collected data directly
from memory

Enter Redis

Redis is an in-memory
NoSQL database

Commonly used as a web application cache or
pub/sub server

Redis
• Created in 2009
• Fully In-memory key-value store
– Fast I/O: R/W operations are equally fast
– Advanced data structures
• Publish/Subscribe Functionality
– In addition to data store functions
– Separate from stored key-value data

Persistence
• Snapshotting
– Data is asynchronously transferred from memory
to disk
• AOF (Append Only File)
– Each modifying operation is written to a file
– Can recreate data store by replaying operations
– Without interrupting service, will rebuild AOF as
the shortest sequence of commands needed to
rebuild the current dataset in memory

Replication

• Redis supports master-slave replication
• Master-slave replication can be chained
• Be careful:
– Slaves are writeable!
– Potential for data inconsistency
• Fully compatible with Pub/Sub features

Redis Features Advanced Data
Structures

List Set Sorted Set Hash

A:3 field1 “A”
“A”
A
field2 “B”
“B” B C:1 B:4
D
field3 “C”
“C”
C D:2
“D” field4 “D”

{value:score} {key:value}
[A, B, C, D] {A, B, C, D} {C:1, D:2, A:3, D:4} {field1:“A”, field2:“B”…}

Constraints
Our data store must:

– Hold time-series data
– Be flexible in querying (by time, node, sensor)
– Allow efficient querying of many records
– Accept data out of order

Tradeoffs: Efficiency vs. Flexibility
One record per One record per
timestamp sensor data type
VS
Motion Light
Motion
Audio
Light Temperature
Pressure
Humidity Audio Humidity
Acceleration
Temperature Pressure Acceleration

A

Our Solution: Sorted Set

Datapoint sensor:env:101
Score 1357542004000
{“bat”: 192, "temp": 523, "digital_temp": 216,
"mac_address": "20f", "humidity": 22, "motion":
203, "pressure": 99007, "node_type": "env",
Value
"timestamp": 1357542004000, "audio_p2p":
460, "light": 820, "acc_z": 464, "acc_y": 351,
"acc_x": 311}

Sorted Set
1357542004000: {“temp”:523,..}
1357542005000: {“temp”:523,..}

1357542007000: {“temp”:530,..}
1357542008000: {“temp”:531,..}
1357542009000: {“temp”:540,..}
1357542001000: {“temp”:545,..}
…

Sorted Set
1357542004000: {“temp”:523,..}
1357542005000: {“temp”:523,..}
1357542006000: {“temp”:527,..} <- fits nicely
1357542007000: {“temp”:530,..}
1357542008000: {“temp”:531,..}
1357542009000: {“temp”:540,..}
1357542001000: {“temp”:545,..}
…

Know your data structure!
A set is still a set…

Datapoint
Score 1357542004000
{“bat”: 192, "temp": 523, "digital_temp": 216,
"mac_address": "20f", "humidity": 22, "motion":
203, "pressure": 99007, "node_type": "env",
Value
"timestamp": 1357542004000, "audio_p2p":
460, "light": 820, "acc_z": 464, "acc_y": 351,
"acc_x": 311}

Requirement Satisfied

Gateway
Redis

There is a disturbance in the Force..

Collecting Data

Gateway
Redis

“In Memory” Means Many Things

• The data store capacity is aggressively
capped
– Redis can only store as much data as the server
has RAM

Collecting Big Data

Gateway
Redis

We could throw away data…
• If we only cared about current values
• However, our data
– Must be stored for 1+ years for compliance
– Must be able to be queried for historical/trend
analysis

We Still Need Long-term Data Storage

Solution? Migrate data to an archive with
expansive storage capacity

Winning

Redis
Gateway
Archiver

Postgre
SQL

Winning?

Redis
Gateway
Archiver

?
? Postgre
Some Poor Client SQL
?

Yes, Winning

Redis
Gateway A
Archiver
P
I
Postgre
Some Happy Client SQL

Gateway
Redi
Best of both worlds
s
Redis allows quick access to
A real-time data, for
Archiver
P monitoring and event
I detection

Postg PostgreSQL allows complex
reSQL queries and scalable storage
for deep and historical
analysis

We Have the Data, Now What?

Incoming data must be monitored and
analyzed, to detect significant events



What is “significant”?



What is “significant”?

What about new data types?

Gateway

Redis

A
Archiver
P
I

Postgre motion > x
SQL
&& pressure < y
&& audio > z

New guy: provide a way
App Django to read the data and
DB App create rules

Gateway

Redis

A
Archiver
P
I

motion > x Postgre
SQL
All true? pressure < y
audio > z

New guy:
Event
read the rules and Event App Django
Monitor
data, trigger Monitor DB App
alarms

Gateway

Redis

A
Archiver
P
I

Postgre
SQL
Event monitor
services can be
scaled
independently Event
Event App Django
Monitor
Monitor DB App

Getting The Message Out
Considerations

• Event monitor already has a job, avoid re-
tasking as a notification engine

Considerations

• Notifications most efficiently should be a
“push” instead of needing to poll

Considerations

• Notifications most efficiently should be a
“push” instead of needing to poll
• Notification system should be generalized,
e.g. SMTP, SMS

Pub/Sub with synchronized
workers is an optimal solution to
real-time event notifications.

No need to add
another system,
Redis Data Redis offers
Gateway
pub/sub services
Redis as well!
Pub/Sub

A
Archiver
P
I
Worker
Postgre Worker
Notificatio
SQL n Worker

Event
Event App Django SMTP
Monitor
Monitor DB App

Conclusions

• Redis is a powerful tool for collecting large
amounts of data in real-time
• In addition to maintaining a rapid pace of
data insertion, we were able to concurrently
query, monitor, and detect events on our
Redis data collection system
• Bonus: Redis also enabled a robust, scalable
real-time notification system using pub/sub

Things to watch

• Data persistence
– if Redis needs to restart, it takes 10-20 seconds
per gigabyte to re-load all data into memory 1
– Redis is unresponsive during startup

1 http://oldblog.antirez.com/post/redis-persistence-demystified.html

Future Work

• Improve scalability through:
– Data encoding
– Data compression
– Parallel batch inserts for all nodes on a gateway
• Deep historical data analytics

Acknowledgements

• Project engineers Chris Taschner and Jeff
Hamed @ CMU SEI
• Prof. Anthony Rowe & CMU ECE WiSE Lab
http://wise.ece.cmu.edu/
• Our organizations
CMU https://www.cmu.edu
CERT http://www.cert.org
SEI http://www.sei.cmu.edu
Cylab https://www.cylab.cmu.edu

A Closer Look at Redis Data
redis> keys *

1)"sensor:environment:f80”
2)"sensor:environment:f81”
3)"sensor:environment:f82"
6)"sensor:power:f85"
8)"sensor:radiation:f87"
9)"sensor:particulate:f88"


redis> keys sensor:power:*

2)"sensor:power:f86”


redis> zcount sensor:power:f85 –inf +inf

(integer) 3565958
(45.38s)


redis> zcount sensor:power:f85 1359728113000 +inf

(integer) 47

redis> zrange sensor:power:f85 -1000 -1

1) "{"long_energy1": 73692453, "total_secs":
6784, "energy": [49, 175, 62, 0, 0, 0],
"c2_center": 485, "socket_state": 1,
"node_type": "power", "c_p2p_low2": 437,
"socket_state1": 0, "mac_address": "103",
"c_p2p_low": 494, "rms_current": 6,
"true_power": 1158, "timestamp":
1359728143000, "v_p2p_low": 170, "c_p2p_high":
511, "rms_current1": 113, "freq": 60,
"long_energy": 4108081, "v_center": 530,
"c_p2p_high2": 719, "energy1": [37, 117, 100,
4, 0, 0], "v_p2p_high": 883, "c_center": 509,
"rms_voltage": 255, "true_power1": 23235}”
2) …

Redis Python API
import redis

pool = redis.ConnectionPool(host=127.0.0.1, port=6379, db=0)
r = redis.Redis(connection_pool=pool)

byindex = r.zrange(“sensor:env:f85”, -50, -1)
# ['{"acc_z":663,"bat":0,"gpio_state":1,"temp":663,"light”:…

byscore = r.zrangebyscore(“sensor:env:f85”,
1361423071000,
1361423072000)
# ['{"acc_z":734,"bat":0,"gpio_state":1,"temp":734,"light”:…

size = r.zcount(“sensor:env:f85”, "-inf", "+inf")
# 237327L

Large-Scale Environmental Data Collection and Analysis Using Redis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Large-Scale Environmental Data Collection and Analysis Using Redis

Similar to Large-Scale Environmental Data Collection and Analysis Using Redis (20)

More from cacois

More from cacois (7)

Large-Scale Environmental Data Collection and Analysis Using Redis

Editor's Notes