SlideShare a Scribd company logo
1 of 75
Large-Scale Data Collection Using
             Redis
      C. Aaron Cois, Ph.D. -- Tim Palko
      CMU Software Engineering Institute




                                    © 2011 Carnegie Mellon University
Us
C. Aaron Cois, Ph.D.             Tim Palko
Software Architect, Team Lead    Senior Software Engineer
CMU Software Engineering         CMU Software Engineering
Institute                        Institute
Digital Intelligence and         Digital Intelligence and
Investigations Directorate       Investigations Directorate


@aaroncois


                                           © 2011 Carnegie Mellon University
Overview
• Problem Statement
• Sensor Hardware & System Requirements
• System Overview
  – Data Collection
  – Data Modeling
  – Data Access
  – Event Monitoring and Notification
• Conclusions and Future Work
The Goal
Critical infrastructure/facility
           protection

              via

Environmental Monitoring
Why?
Stuxnet
• Two major components:
   1) Send centrifuges spinning wildly out of control
   2) Record ‘normal operations’ and play them back
       to operators during the attack 1
• Environmental monitoring provides secondary
  indicators, such as abnormal
  heat/motion/sound

 1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2&
The Broader Vision
Quick, flexible out-of-band monitoring
• Set up monitoring in minutes
• Versatile sensors, easily repurposed
• Data communication is secure (P2P VPN) and
  requires no existing systems other than
  outbound networking
The Platform

A CMU research project called Sensor Andrew

• Features:
  – Open-source sensor platform
  – Scalable and generalist system supporting a
    wide variety of applications
  – Extensible architecture
    •   Can integrate diverse sensor types
Sensor Andrew
End
                            Users
           Nodes

Gateway




                          Server


Gateway




          Sensor Andrew Overview
What is a Node?
A node collects data and sends it to a collector, or gateway

 Environment Node      Power Node        Radiation Node
 Sensors               Sensors           Sensors
 • Light               • Current         • Alpha particle
                       • Voltage           count per minute
 • Audio
 • Humidity            • True Power
 • Pressure            • Energy          Particulate
 • Motion                                Node Sensors
 • Temperature                           • Small Part. Count
 • Acceleration                          • Large Part. Count
What is a Gateway?

• A gateway receives UDP data
  from all nodes registered to      Gateway

  it
• An internal service:
  – Receives data continuously
  – Opens a server on a specified
    port
  – Continually transmits UDP
    data over this port
Requirements
We need to..
1.   Collect data from nodes once per second
2.   Scale to 100 gateways each with 64 nodes
3.   Detect events in real-time
4.   Notify users about events in real-time
5.   Retain all data collected for years, at least
What Is Big Data?
What Is Big Data?

   “When your data sets become so
large that you have to start innovating
     around how to collect, store,
    organize, analyze and share it.”
Problems

Size         Transmission



Rate          Storage
Problems

Size         Transmission



Rate          Storage
Problems

Size         Transmission



Rate          Storage
Problems

Size         Transmission



Rate          Storage
Problems

Size         Transmission



Rate          Storage
Problems

Size         Transmission



Rate          Storage
              Retrieval
Collecting Data
  Problem: Store and retrieve immense amounts of data at a high rate.
Constraints: Data cannot remain on the nodes or gateways due to
             security concerns.
             Limited infrastructure.



                                          8 GB / hour
            Gateway

                                                          ?
We Tried PostgreSQL…

• Advantages:
  – Reliable, tested and scalable
  – Relational => complex queries => analytics
• Problems:
  – Performance problems reading while writing at a
    high rate; real-time event detection suffers
  – ‘COPY FROM’ doesn’t permit horizontal scaling
Q: How can we decrease I/O load?
Q: How can we decrease I/O load?

A: Read and write collected data directly
            from memory
Enter Redis

Redis is an in-memory
NoSQL database



Commonly used as a web application cache or
pub/sub server
Redis
• Created in 2009
• Fully In-memory key-value store
  – Fast I/O: R/W operations are equally fast
  – Advanced data structures
• Publish/Subscribe Functionality
  – In addition to data store functions
  – Separate from stored key-value data
Persistence
• Snapshotting
  – Data is asynchronously transferred from memory
    to disk
• AOF (Append Only File)
  – Each modifying operation is written to a file
  – Can recreate data store by replaying operations
  – Without interrupting service, will rebuild AOF as
    the shortest sequence of commands needed to
    rebuild the current dataset in memory
Replication

• Redis supports master-slave replication
• Master-slave replication can be chained
• Be careful:
  – Slaves are writeable!
  – Potential for data inconsistency
• Fully compatible with Pub/Sub features
Redis Features Advanced Data
                     Structures

  List              Set          Sorted Set                      Hash

                                       A:3              field1           “A”
     “A”
                 A
                                                        field2           “B”
    “B”                   B      C:1         B:4
                D
                                                        field3           “C”
    “C”
                          C            D:2
    “D”                                                 field4           “D”


                                {value:score}              {key:value}
[A, B, C, D]   {A, B, C, D}   {C:1, D:2, A:3, D:4}   {field1:“A”, field2:“B”…}
Our Data Model
Constraints
Our data store must:


– Hold time-series data
– Be flexible in querying (by time, node, sensor)
– Allow efficient querying of many records
– Accept data out of order
Tradeoffs: Efficiency vs. Flexibility
One record per             One record per
 timestamp                sensor data type
                 VS
                         Motion       Light
    Motion
    Audio
     Light                   Temperature
   Pressure
   Humidity              Audio     Humidity
 Acceleration
 Temperature          Pressure    Acceleration

                                                 A
Our Solution: Sorted Set

                 Datapoint sensor:env:101
Score   1357542004000
        {“bat”: 192, "temp": 523, "digital_temp": 216,
        "mac_address": "20f", "humidity": 22, "motion":
        203, "pressure": 99007, "node_type": "env",
Value
        "timestamp": 1357542004000, "audio_p2p":
        460, "light": 820, "acc_z": 464, "acc_y": 351,
        "acc_x": 311}
Our Solution: Sorted Set

                 Datapoint sensor:env:101
Score   1357542004000
        {“bat”: 192, "temp": 523, "digital_temp": 216,
        "mac_address": "20f", "humidity": 22, "motion":
        203, "pressure": 99007, "node_type": "env",
Value
        "timestamp": 1357542004000, "audio_p2p":
        460, "light": 820, "acc_z": 464, "acc_y": 351,
        "acc_x": 311}
Our Solution: Sorted Set

                 Datapoint sensor:env:101
Score   1357542004000
        {“bat”: 192, "temp": 523, "digital_temp": 216,
        "mac_address": "20f", "humidity": 22, "motion":
        203, "pressure": 99007, "node_type": "env",
Value
        "timestamp": 1357542004000, "audio_p2p":
        460, "light": 820, "acc_z": 464, "acc_y": 351,
        "acc_x": 311}
Our Solution: Sorted Set

                 Datapoint sensor:env:101
Score   1357542004000
        {“bat”: 192, "temp": 523, "digital_temp": 216,
        "mac_address": "20f", "humidity": 22, "motion":
        203, "pressure": 99007, "node_type": "env",
Value
        "timestamp": 1357542004000, "audio_p2p":
        460, "light": 820, "acc_z": 464, "acc_y": 351,
        "acc_x": 311}
Sorted Set
1357542004000: {“temp”:523,..}
1357542005000: {“temp”:523,..}

1357542007000: {“temp”:530,..}
1357542008000: {“temp”:531,..}
1357542009000: {“temp”:540,..}
1357542001000: {“temp”:545,..}
…
Sorted Set
1357542004000: {“temp”:523,..}
1357542005000: {“temp”:523,..}
1357542006000: {“temp”:527,..} <- fits nicely
1357542007000: {“temp”:530,..}
1357542008000: {“temp”:531,..}
1357542009000: {“temp”:540,..}
1357542001000: {“temp”:545,..}
…
Know your data structure!
                A set is still a set…

                         Datapoint
Score   1357542004000
        {“bat”: 192, "temp": 523, "digital_temp": 216,
        "mac_address": "20f", "humidity": 22, "motion":
        203, "pressure": 99007, "node_type": "env",
Value
        "timestamp": 1357542004000, "audio_p2p":
        460, "light": 820, "acc_z": 464, "acc_y": 351,
        "acc_x": 311}
Requirement Satisfied



Gateway
                         Redis
There is a disturbance in the Force..
Collecting Data



Gateway
                            Redis
“In Memory” Means Many Things

• The data store capacity is aggressively
  capped
  – Redis can only store as much data as the server
    has RAM
Collecting Big Data



Gateway
                                Redis
We could throw away data…
• If we only cared about current values
• However, our data
  – Must be stored for 1+ years for compliance
  – Must be able to be queried for historical/trend
    analysis
We Still Need Long-term Data Storage


  Solution? Migrate data to an archive with
         expansive storage capacity
Winning

                    Redis
Gateway
                         Archiver




                    Postgre
                     SQL
Winning?

                                     Redis
Gateway
                                          Archiver



                             ?
                                 ?   Postgre
          Some Poor Client            SQL
                             ?
Yes, Winning

                                  Redis
Gateway                       A
                                       Archiver
                              P
                              I
                                  Postgre
          Some Happy Client        SQL
Gateway
               Redi
                         Best of both worlds
                s
                         Redis allows quick access to
          A              real-time data, for
              Archiver
          P              monitoring and event
          I              detection

              Postg      PostgreSQL allows complex
              reSQL      queries and scalable storage
                         for deep and historical
                         analysis
We Have the Data, Now What?

 Incoming data must be monitored and
  analyzed, to detect significant events
We Have the Data, Now What?

 Incoming data must be monitored and
  analyzed, to detect significant events


         What is “significant”?
We Have the Data, Now What?

 Incoming data must be monitored and
  analyzed, to detect significant events


         What is “significant”?

     What about new data types?
Gateway

              Redis


          A
              Archiver
          P
          I

              Postgre                  motion > x
               SQL
                                       && pressure < y
                                       && audio > z

                                  New guy: provide a way
              App        Django   to read the data and
              DB          App     create rules
Gateway

                                     Redis


                                 A
                                     Archiver
                                 P
                                 I

                motion > x           Postgre
                                      SQL
All true?       pressure < y
                audio > z

         New guy:
                       Event
read the rules and      Event        App        Django
                      Monitor
      data, trigger    Monitor       DB          App
            alarms
Gateway

                                 Redis


                             A
                                 Archiver
                             P
                             I

                                 Postgre
                                  SQL
Event monitor
services can be
scaled
independently      Event
                    Event        App        Django
                  Monitor
                   Monitor       DB          App
Getting The Message Out
Getting The Message Out
Considerations

• Event monitor already has a job, avoid re-
  tasking as a notification engine
Getting The Message Out
Considerations

• Event monitor already has a job, avoid re-
  tasking as a notification engine
• Notifications most efficiently should be a
  “push” instead of needing to poll
Getting The Message Out
Considerations

• Event monitor already has a job, avoid re-
  tasking as a notification engine
• Notifications most efficiently should be a
  “push” instead of needing to poll
• Notification system should be generalized,
  e.g. SMTP, SMS
If only…
Pub/Sub with synchronized
                               workers is an optimal solution to
                               real-time event notifications.

                                                No need to add
                                                another system,
                                Redis Data      Redis offers
Gateway
                                                pub/sub services
                                  Redis         as well!
                                 Pub/Sub

                A
                    Archiver
                P
                I
                                                    Worker
                    Postgre                          Worker
                                                     Notificatio
                     SQL                             n Worker



      Event
       Event         App       Django                  SMTP
     Monitor
      Monitor        DB         App
Conclusions

• Redis is a powerful tool for collecting large
  amounts of data in real-time
• In addition to maintaining a rapid pace of
  data insertion, we were able to concurrently
  query, monitor, and detect events on our
  Redis data collection system
• Bonus: Redis also enabled a robust, scalable
  real-time notification system using pub/sub
Things to watch

   • Data persistence
      – if Redis needs to restart, it takes 10-20 seconds
        per gigabyte to re-load all data into memory 1
      – Redis is unresponsive during startup




1 http://oldblog.antirez.com/post/redis-persistence-demystified.html
Future Work

• Improve scalability through:
  – Data encoding
  – Data compression
  – Parallel batch inserts for all nodes on a gateway
• Deep historical data analytics
Acknowledgements

• Project engineers Chris Taschner and Jeff
  Hamed @ CMU SEI
• Prof. Anthony Rowe & CMU ECE WiSE Lab
     http://wise.ece.cmu.edu/
• Our organizations
     CMU       https://www.cmu.edu
     CERT      http://www.cert.org
     SEI       http://www.sei.cmu.edu
     Cylab     https://www.cylab.cmu.edu
Thank You
Thank You

Questions?
Slides of Live Redis Demo
A Closer Look at Redis Data
 redis> keys *

 1)"sensor:environment:f80”
 2)"sensor:environment:f81”
 3)"sensor:environment:f82"
 4)"sensor:environment:f83"
 5)"sensor:environment:f84"
 6)"sensor:power:f85"
 7)"sensor:power:f86"
 8)"sensor:radiation:f87"
 9)"sensor:particulate:f88"
A Closer Look at Redis Data

  redis> keys sensor:power:*

  1)"sensor:power:f85"
  2)"sensor:power:f86”
A Closer Look at Redis Data


redis> zcount sensor:power:f85 –inf +inf

(integer) 3565958
(45.38s)
A Closer Look at Redis Data


redis> zcount sensor:power:f85 1359728113000 +inf

(integer) 47
A Closer Look at Redis Data
redis> zrange sensor:power:f85 -1000 -1

1) "{"long_energy1": 73692453, "total_secs":
   6784, "energy": [49, 175, 62, 0, 0, 0],
   "c2_center": 485, "socket_state": 1,
   "node_type": "power", "c_p2p_low2": 437,
   "socket_state1": 0, "mac_address": "103",
   "c_p2p_low": 494, "rms_current": 6,
   "true_power": 1158, "timestamp":
   1359728143000, "v_p2p_low": 170, "c_p2p_high":
   511, "rms_current1": 113, "freq": 60,
   "long_energy": 4108081, "v_center": 530,
   "c_p2p_high2": 719, "energy1": [37, 117, 100,
   4, 0, 0], "v_p2p_high": 883, "c_center": 509,
   "rms_voltage": 255, "true_power1": 23235}”
2) …
Redis Python API
import redis

pool = redis.ConnectionPool(host=127.0.0.1, port=6379, db=0)
r = redis.Redis(connection_pool=pool)

byindex = r.zrange(“sensor:env:f85”, -50, -1)
# ['{"acc_z":663,"bat":0,"gpio_state":1,"temp":663,"light”:…

byscore = r.zrangebyscore(“sensor:env:f85”,
                           1361423071000,
                           1361423072000)
# ['{"acc_z":734,"bat":0,"gpio_state":1,"temp":734,"light”:…

size = r.zcount(“sensor:env:f85”, "-inf", "+inf")
# 237327L

More Related Content

What's hot

Caching solutions with Redis
Caching solutions   with RedisCaching solutions   with Redis
Caching solutions with RedisGeorge Platon
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache KylinShi Shao Feng
 
Getting started with Amazon ElastiCache
Getting started with Amazon ElastiCacheGetting started with Amazon ElastiCache
Getting started with Amazon ElastiCacheAmazon Web Services
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon RedshiftAmazon Web Services
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Redis Streams for Event-Driven Microservices
Redis Streams for Event-Driven MicroservicesRedis Streams for Event-Driven Microservices
Redis Streams for Event-Driven MicroservicesRedis Labs
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...Amazon Web Services
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture ForumChristopher Spring
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Microservices: Decomposing Applications for Deployability and Scalability (ja...
Microservices: Decomposing Applications for Deployability and Scalability (ja...Microservices: Decomposing Applications for Deployability and Scalability (ja...
Microservices: Decomposing Applications for Deployability and Scalability (ja...Chris Richardson
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Introduction to Cassandra Architecture
Introduction to Cassandra ArchitectureIntroduction to Cassandra Architecture
Introduction to Cassandra Architecturenickmbailey
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
How Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBHow Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBMongoDB
 

What's hot (20)

Caching solutions with Redis
Caching solutions   with RedisCaching solutions   with Redis
Caching solutions with Redis
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache Kylin
 
Getting started with Amazon ElastiCache
Getting started with Amazon ElastiCacheGetting started with Amazon ElastiCache
Getting started with Amazon ElastiCache
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Redis Streams for Event-Driven Microservices
Redis Streams for Event-Driven MicroservicesRedis Streams for Event-Driven Microservices
Redis Streams for Event-Driven Microservices
 
Amazon ElastiCache and Redis
Amazon ElastiCache and RedisAmazon ElastiCache and Redis
Amazon ElastiCache and Redis
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture Forum
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Microservices: Decomposing Applications for Deployability and Scalability (ja...
Microservices: Decomposing Applications for Deployability and Scalability (ja...Microservices: Decomposing Applications for Deployability and Scalability (ja...
Microservices: Decomposing Applications for Deployability and Scalability (ja...
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Introduction to Cassandra Architecture
Introduction to Cassandra ArchitectureIntroduction to Cassandra Architecture
Introduction to Cassandra Architecture
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
How Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBHow Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDB
 

Viewers also liked

Kicking ass with redis
Kicking ass with redisKicking ass with redis
Kicking ass with redisDvir Volk
 
Scaling Crashlytics: Building Analytics on Redis 2.6
Scaling Crashlytics: Building Analytics on Redis 2.6Scaling Crashlytics: Building Analytics on Redis 2.6
Scaling Crashlytics: Building Analytics on Redis 2.6Crashlytics
 
Redis data modeling examples
Redis data modeling examplesRedis data modeling examples
Redis data modeling examplesTerry Cho
 
Redis in Practice
Redis in PracticeRedis in Practice
Redis in PracticeNoah Davis
 
Redis data design by usecase
Redis data design by usecaseRedis data design by usecase
Redis data design by usecaseKris Jeong
 
Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)Itamar Haber
 
Everything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to askEverything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to askCarlos Abalde
 

Viewers also liked (7)

Kicking ass with redis
Kicking ass with redisKicking ass with redis
Kicking ass with redis
 
Scaling Crashlytics: Building Analytics on Redis 2.6
Scaling Crashlytics: Building Analytics on Redis 2.6Scaling Crashlytics: Building Analytics on Redis 2.6
Scaling Crashlytics: Building Analytics on Redis 2.6
 
Redis data modeling examples
Redis data modeling examplesRedis data modeling examples
Redis data modeling examples
 
Redis in Practice
Redis in PracticeRedis in Practice
Redis in Practice
 
Redis data design by usecase
Redis data design by usecaseRedis data design by usecase
Redis data design by usecase
 
Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)
 
Everything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to askEverything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to ask
 

Similar to Large-Scale Environmental Data Collection and Analysis Using Redis

Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...Facultad de Informática UCM
 
Alerting mechanism and algorithms introduction
Alerting mechanism and algorithms introductionAlerting mechanism and algorithms introduction
Alerting mechanism and algorithms introductionFEG
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Meteo I/O Introduction
Meteo I/O IntroductionMeteo I/O Introduction
Meteo I/O IntroductionRiccardo Rigon
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentationlilyco
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 
A Century Of Weather Data - Midwest.io
A Century Of Weather Data - Midwest.ioA Century Of Weather Data - Midwest.io
A Century Of Weather Data - Midwest.ioRandall Hunt
 
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Caserta
 
[Feb 2020] Cours IoT - CentraleSupelec - Master SIO
[Feb 2020] Cours IoT - CentraleSupelec - Master SIO[Feb 2020] Cours IoT - CentraleSupelec - Master SIO
[Feb 2020] Cours IoT - CentraleSupelec - Master SIONicolas Lesconnec
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at ScaleElasticsearch
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehousePrecisely
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!Ian Foster
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
Remote Log Analytics Using DDS, ELK, and RxJS
Remote Log Analytics Using DDS, ELK, and RxJSRemote Log Analytics Using DDS, ELK, and RxJS
Remote Log Analytics Using DDS, ELK, and RxJSSumant Tambe
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 

Similar to Large-Scale Environmental Data Collection and Analysis Using Redis (20)

Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
 
Alerting mechanism and algorithms introduction
Alerting mechanism and algorithms introductionAlerting mechanism and algorithms introduction
Alerting mechanism and algorithms introduction
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Meteo I/O Introduction
Meteo I/O IntroductionMeteo I/O Introduction
Meteo I/O Introduction
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
A Century Of Weather Data - Midwest.io
A Century Of Weather Data - Midwest.ioA Century Of Weather Data - Midwest.io
A Century Of Weather Data - Midwest.io
 
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
 
[Feb 2020] Cours IoT - CentraleSupelec - Master SIO
[Feb 2020] Cours IoT - CentraleSupelec - Master SIO[Feb 2020] Cours IoT - CentraleSupelec - Master SIO
[Feb 2020] Cours IoT - CentraleSupelec - Master SIO
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at Scale
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Remote Log Analytics Using DDS, ELK, and RxJS
Remote Log Analytics Using DDS, ELK, and RxJSRemote Log Analytics Using DDS, ELK, and RxJS
Remote Log Analytics Using DDS, ELK, and RxJS
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 

More from cacois

Devopssecfail
DevopssecfailDevopssecfail
Devopssecfailcacois
 
Machine Learning for Modern Developers
Machine Learning for Modern DevelopersMachine Learning for Modern Developers
Machine Learning for Modern Developerscacois
 
Avoiding Callback Hell with Async.js
Avoiding Callback Hell with Async.jsAvoiding Callback Hell with Async.js
Avoiding Callback Hell with Async.jscacois
 
Node.js Patterns for Discerning Developers
Node.js Patterns for Discerning DevelopersNode.js Patterns for Discerning Developers
Node.js Patterns for Discerning Developerscacois
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the roomcacois
 
Automate your Development Environments with Vagrant
Automate your Development Environments with VagrantAutomate your Development Environments with Vagrant
Automate your Development Environments with Vagrantcacois
 
Node.js: A Guided Tour
Node.js: A Guided TourNode.js: A Guided Tour
Node.js: A Guided Tourcacois
 

More from cacois (7)

Devopssecfail
DevopssecfailDevopssecfail
Devopssecfail
 
Machine Learning for Modern Developers
Machine Learning for Modern DevelopersMachine Learning for Modern Developers
Machine Learning for Modern Developers
 
Avoiding Callback Hell with Async.js
Avoiding Callback Hell with Async.jsAvoiding Callback Hell with Async.js
Avoiding Callback Hell with Async.js
 
Node.js Patterns for Discerning Developers
Node.js Patterns for Discerning DevelopersNode.js Patterns for Discerning Developers
Node.js Patterns for Discerning Developers
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the room
 
Automate your Development Environments with Vagrant
Automate your Development Environments with VagrantAutomate your Development Environments with Vagrant
Automate your Development Environments with Vagrant
 
Node.js: A Guided Tour
Node.js: A Guided TourNode.js: A Guided Tour
Node.js: A Guided Tour
 

Large-Scale Environmental Data Collection and Analysis Using Redis

  • 1. Large-Scale Data Collection Using Redis C. Aaron Cois, Ph.D. -- Tim Palko CMU Software Engineering Institute © 2011 Carnegie Mellon University
  • 2. Us C. Aaron Cois, Ph.D. Tim Palko Software Architect, Team Lead Senior Software Engineer CMU Software Engineering CMU Software Engineering Institute Institute Digital Intelligence and Digital Intelligence and Investigations Directorate Investigations Directorate @aaroncois © 2011 Carnegie Mellon University
  • 3. Overview • Problem Statement • Sensor Hardware & System Requirements • System Overview – Data Collection – Data Modeling – Data Access – Event Monitoring and Notification • Conclusions and Future Work
  • 4. The Goal Critical infrastructure/facility protection via Environmental Monitoring
  • 5. Why? Stuxnet • Two major components: 1) Send centrifuges spinning wildly out of control 2) Record ‘normal operations’ and play them back to operators during the attack 1 • Environmental monitoring provides secondary indicators, such as abnormal heat/motion/sound 1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2&
  • 6. The Broader Vision Quick, flexible out-of-band monitoring • Set up monitoring in minutes • Versatile sensors, easily repurposed • Data communication is secure (P2P VPN) and requires no existing systems other than outbound networking
  • 7. The Platform A CMU research project called Sensor Andrew • Features: – Open-source sensor platform – Scalable and generalist system supporting a wide variety of applications – Extensible architecture • Can integrate diverse sensor types
  • 9. End Users Nodes Gateway Server Gateway Sensor Andrew Overview
  • 10. What is a Node? A node collects data and sends it to a collector, or gateway Environment Node Power Node Radiation Node Sensors Sensors Sensors • Light • Current • Alpha particle • Voltage count per minute • Audio • Humidity • True Power • Pressure • Energy Particulate • Motion Node Sensors • Temperature • Small Part. Count • Acceleration • Large Part. Count
  • 11. What is a Gateway? • A gateway receives UDP data from all nodes registered to Gateway it • An internal service: – Receives data continuously – Opens a server on a specified port – Continually transmits UDP data over this port
  • 12. Requirements We need to.. 1. Collect data from nodes once per second 2. Scale to 100 gateways each with 64 nodes 3. Detect events in real-time 4. Notify users about events in real-time 5. Retain all data collected for years, at least
  • 13. What Is Big Data?
  • 14. What Is Big Data? “When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze and share it.”
  • 15. Problems Size Transmission Rate Storage
  • 16. Problems Size Transmission Rate Storage
  • 17. Problems Size Transmission Rate Storage
  • 18. Problems Size Transmission Rate Storage
  • 19. Problems Size Transmission Rate Storage
  • 20. Problems Size Transmission Rate Storage Retrieval
  • 21. Collecting Data Problem: Store and retrieve immense amounts of data at a high rate. Constraints: Data cannot remain on the nodes or gateways due to security concerns. Limited infrastructure. 8 GB / hour Gateway ?
  • 22. We Tried PostgreSQL… • Advantages: – Reliable, tested and scalable – Relational => complex queries => analytics • Problems: – Performance problems reading while writing at a high rate; real-time event detection suffers – ‘COPY FROM’ doesn’t permit horizontal scaling
  • 23. Q: How can we decrease I/O load?
  • 24. Q: How can we decrease I/O load? A: Read and write collected data directly from memory
  • 25. Enter Redis Redis is an in-memory NoSQL database Commonly used as a web application cache or pub/sub server
  • 26. Redis • Created in 2009 • Fully In-memory key-value store – Fast I/O: R/W operations are equally fast – Advanced data structures • Publish/Subscribe Functionality – In addition to data store functions – Separate from stored key-value data
  • 27. Persistence • Snapshotting – Data is asynchronously transferred from memory to disk • AOF (Append Only File) – Each modifying operation is written to a file – Can recreate data store by replaying operations – Without interrupting service, will rebuild AOF as the shortest sequence of commands needed to rebuild the current dataset in memory
  • 28. Replication • Redis supports master-slave replication • Master-slave replication can be chained • Be careful: – Slaves are writeable! – Potential for data inconsistency • Fully compatible with Pub/Sub features
  • 29. Redis Features Advanced Data Structures List Set Sorted Set Hash A:3 field1 “A” “A” A field2 “B” “B” B C:1 B:4 D field3 “C” “C” C D:2 “D” field4 “D” {value:score} {key:value} [A, B, C, D] {A, B, C, D} {C:1, D:2, A:3, D:4} {field1:“A”, field2:“B”…}
  • 31. Constraints Our data store must: – Hold time-series data – Be flexible in querying (by time, node, sensor) – Allow efficient querying of many records – Accept data out of order
  • 32. Tradeoffs: Efficiency vs. Flexibility One record per One record per timestamp sensor data type VS Motion Light Motion Audio Light Temperature Pressure Humidity Audio Humidity Acceleration Temperature Pressure Acceleration A
  • 33. Our Solution: Sorted Set Datapoint sensor:env:101 Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  • 34. Our Solution: Sorted Set Datapoint sensor:env:101 Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  • 35. Our Solution: Sorted Set Datapoint sensor:env:101 Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  • 36. Our Solution: Sorted Set Datapoint sensor:env:101 Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  • 37. Sorted Set 1357542004000: {“temp”:523,..} 1357542005000: {“temp”:523,..} 1357542007000: {“temp”:530,..} 1357542008000: {“temp”:531,..} 1357542009000: {“temp”:540,..} 1357542001000: {“temp”:545,..} …
  • 38. Sorted Set 1357542004000: {“temp”:523,..} 1357542005000: {“temp”:523,..} 1357542006000: {“temp”:527,..} <- fits nicely 1357542007000: {“temp”:530,..} 1357542008000: {“temp”:531,..} 1357542009000: {“temp”:540,..} 1357542001000: {“temp”:545,..} …
  • 39. Know your data structure! A set is still a set… Datapoint Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  • 41. There is a disturbance in the Force..
  • 43. “In Memory” Means Many Things • The data store capacity is aggressively capped – Redis can only store as much data as the server has RAM
  • 45. We could throw away data… • If we only cared about current values • However, our data – Must be stored for 1+ years for compliance – Must be able to be queried for historical/trend analysis
  • 46. We Still Need Long-term Data Storage Solution? Migrate data to an archive with expansive storage capacity
  • 47. Winning Redis Gateway Archiver Postgre SQL
  • 48. Winning? Redis Gateway Archiver ? ? Postgre Some Poor Client SQL ?
  • 49. Yes, Winning Redis Gateway A Archiver P I Postgre Some Happy Client SQL
  • 50. Gateway Redi Best of both worlds s Redis allows quick access to A real-time data, for Archiver P monitoring and event I detection Postg PostgreSQL allows complex reSQL queries and scalable storage for deep and historical analysis
  • 51. We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events
  • 52. We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”?
  • 53. We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”? What about new data types?
  • 54. Gateway Redis A Archiver P I Postgre motion > x SQL && pressure < y && audio > z New guy: provide a way App Django to read the data and DB App create rules
  • 55. Gateway Redis A Archiver P I motion > x Postgre SQL All true? pressure < y audio > z New guy: Event read the rules and Event App Django Monitor data, trigger Monitor DB App alarms
  • 56. Gateway Redis A Archiver P I Postgre SQL Event monitor services can be scaled independently Event Event App Django Monitor Monitor DB App
  • 58. Getting The Message Out Considerations • Event monitor already has a job, avoid re- tasking as a notification engine
  • 59. Getting The Message Out Considerations • Event monitor already has a job, avoid re- tasking as a notification engine • Notifications most efficiently should be a “push” instead of needing to poll
  • 60. Getting The Message Out Considerations • Event monitor already has a job, avoid re- tasking as a notification engine • Notifications most efficiently should be a “push” instead of needing to poll • Notification system should be generalized, e.g. SMTP, SMS
  • 62. Pub/Sub with synchronized workers is an optimal solution to real-time event notifications. No need to add another system, Redis Data Redis offers Gateway pub/sub services Redis as well! Pub/Sub A Archiver P I Worker Postgre Worker Notificatio SQL n Worker Event Event App Django SMTP Monitor Monitor DB App
  • 63. Conclusions • Redis is a powerful tool for collecting large amounts of data in real-time • In addition to maintaining a rapid pace of data insertion, we were able to concurrently query, monitor, and detect events on our Redis data collection system • Bonus: Redis also enabled a robust, scalable real-time notification system using pub/sub
  • 64. Things to watch • Data persistence – if Redis needs to restart, it takes 10-20 seconds per gigabyte to re-load all data into memory 1 – Redis is unresponsive during startup 1 http://oldblog.antirez.com/post/redis-persistence-demystified.html
  • 65. Future Work • Improve scalability through: – Data encoding – Data compression – Parallel batch inserts for all nodes on a gateway • Deep historical data analytics
  • 66. Acknowledgements • Project engineers Chris Taschner and Jeff Hamed @ CMU SEI • Prof. Anthony Rowe & CMU ECE WiSE Lab http://wise.ece.cmu.edu/ • Our organizations CMU https://www.cmu.edu CERT http://www.cert.org SEI http://www.sei.cmu.edu Cylab https://www.cylab.cmu.edu
  • 69. Slides of Live Redis Demo
  • 70. A Closer Look at Redis Data redis> keys * 1)"sensor:environment:f80” 2)"sensor:environment:f81” 3)"sensor:environment:f82" 4)"sensor:environment:f83" 5)"sensor:environment:f84" 6)"sensor:power:f85" 7)"sensor:power:f86" 8)"sensor:radiation:f87" 9)"sensor:particulate:f88"
  • 71. A Closer Look at Redis Data redis> keys sensor:power:* 1)"sensor:power:f85" 2)"sensor:power:f86”
  • 72. A Closer Look at Redis Data redis> zcount sensor:power:f85 –inf +inf (integer) 3565958 (45.38s)
  • 73. A Closer Look at Redis Data redis> zcount sensor:power:f85 1359728113000 +inf (integer) 47
  • 74. A Closer Look at Redis Data redis> zrange sensor:power:f85 -1000 -1 1) "{"long_energy1": 73692453, "total_secs": 6784, "energy": [49, 175, 62, 0, 0, 0], "c2_center": 485, "socket_state": 1, "node_type": "power", "c_p2p_low2": 437, "socket_state1": 0, "mac_address": "103", "c_p2p_low": 494, "rms_current": 6, "true_power": 1158, "timestamp": 1359728143000, "v_p2p_low": 170, "c_p2p_high": 511, "rms_current1": 113, "freq": 60, "long_energy": 4108081, "v_center": 530, "c_p2p_high2": 719, "energy1": [37, 117, 100, 4, 0, 0], "v_p2p_high": 883, "c_center": 509, "rms_voltage": 255, "true_power1": 23235}” 2) …
  • 75. Redis Python API import redis pool = redis.ConnectionPool(host=127.0.0.1, port=6379, db=0) r = redis.Redis(connection_pool=pool) byindex = r.zrange(“sensor:env:f85”, -50, -1) # ['{"acc_z":663,"bat":0,"gpio_state":1,"temp":663,"light”:… byscore = r.zrangebyscore(“sensor:env:f85”, 1361423071000, 1361423072000) # ['{"acc_z":734,"bat":0,"gpio_state":1,"temp":734,"light”:… size = r.zcount(“sensor:env:f85”, "-inf", "+inf") # 237327L

Editor's Notes

  1. START AARON
  2. Welcome. &lt;Introductions, who we are, where we’re from&gt;
  3. AARON LAST SLIDELet’s start with some background. We’ve been working with a CMU research group on applications of a research project called Sensor Andrew. The vision of Sensor Andrew is to provide a generalized environmental sensor network, capable of being leveraged for a wide variety of applications, both academic and commercial.
  4. START TIM
  5. A Sensor Andrew system consists primarily of nodes, like this &lt;hold one up if possible&gt;, each of which contains a variety of embedded sensors, and a gateway with a specialized receiver, allowing it to receive wireless messages from each of up to 64 nodes concurrently. Our collaborators have provided hardware design and gear, firmware on all embedded components, and some baseline software to work from when interfacing with the hardware systems.
  6. Let’s look at some more detail on the type of data we are collecting. We currently have two types of nodes, environmental and power nodes &lt;show samples&gt;. Environmental nodes can be set anywhere, and will detect measures of light, audio, humidity, pressure, motion, temperature, and acceleration (in x,y,z components) relative to the environment immediately surrounding the node. Power nodes must be plugged in to a wall outlet, with a current-drawing device using it to draw power. This allows the power node to detect and transmit numerous measures of data involving current, voltage, power, etc. Data is transmitted from the nodes in UDP format. For reference, an environmental data packet is ____ bytes in size, and a power data packet is ____ bytes.
  7. Packets are UDP and the information is stored as an encoded string, so the network load is already pretty small.Compression, in addition to the encoding of the data, might be an option in the future, but that’s a small hurdle if we need it.
  8. A terabyte per week isn’t tera-bly big, but it adds up when the data needs to stick around for a long time. Compression can ease the pain. Again, not expensive to implement if necessary.
  9. This is an interesting part of the architecture. The nodes are pinging only once per second, and even at the gateway and collector stage, we’re actually limited to 64 pings per second. This pushes the the point of convergence to..
  10. .. storage. We need fast writing, but we also need fast reading.
  11. and we also need fast reading, simultaneously.
  12. 120 loaded gateways = 7680 nodes. 1 record/sec =&gt; 27.2 million records / hour. 300 kb / record =&gt; 8GB/hour / 184GB/day
  13. There are two primary I/O bottlenecks in all network applications: 1) Network I/O and 2) Filesystem I/O. In general, we will have no control over the network infrastructures of deployment sites, so we really can’t do anything about Network I/O. That leaves Filesystem I/O.
  14. The best way to mitigate the Filesystem I/O bottleneck is to avoid the filesystem altogether.
  15. TIM LAST SLIDE
  16. START AARON
  17. AARON LAST SLIDEWe originally tried separating out each data value into a separate key (you can talk more about this on the next slide, when you have the example in datapoint front of you). This allowed extremely efficient querying, as we could query ‘motion’ data independently from ‘audio’ data. However, the overhead was significant in two respects:We had to store metadata (timestamp, nodetype, node mac address, etc) with each record, so a lot more data duplication and space inefficiency.The number of inserts per second skyrocketed. E.g. x7 inserts per second for environmental nodes.
  18. START TIM
  19. If two data packets had exactly the same environmental values, but with a different score, redis would update the existing set member with the new score, instead of creating a new set member. This leads to some data duplication, which adds up over millions of records.
  20. TIM LAST SLIDE
  21. START AARON
  22. Which can make Redis burst at the seams…
  23. AARON LAST SLIDE
  24. START TIM
  25. AARON