8257 interfacing 2 in microprocessor for btech students
How to extract valueable information from real time data feeds
1. How to extract valuable
information from real-
time data feeds
Gene Leybzon, February 2016
2. “The critical challenge is using
this data when it is still in
motion – and extracting
valuable information from it.”
- Frédéric Combaneyre, SAS
IoT Challenge
3. Detect events of interest and trigger appropriate
actions
Aggregate information for monitoring
Sensor data cleansing and validation
Real-time predictive and optimized operations
(support for real-time decision making)
Role of Data Streams
8. Transform data — convert the data into another format, for example,
converting a captured device signal voltage to a calibrated unit measure of
temperature
Aggregate and compute data — By combining data you can add checks:
such as averaging data across multiple devices to avoid acting on a single,
spurious, device; or ensure you have actionable data if a single device goes
offline. By adding computation to your pipeline, you can apply streaming
analytics to data while it is still in the processing pipeline.
Enrich data — You can combine the device-generated data with other
metadata about the device, or with other datasets, such as weather or
traffic data, for use in subsequent analysis.
Move data — You can store the processed data in one or more final storage
locations.
Role of “Pipelines”
10. Fault-tolerance against hardware failures and human errors
Support for a variety of use cases that include low latency
querying as well as updates
Linear scale-out capabilities, meaning that throwing more
machines at the problem should help with getting the job done
Extensibility so that the system is manageable and can
accommodate newer features easily
Consistency - data is the same across the cluster
Availability - ability to access the cluster even if a node in the
cluster goes down
Partition-tolerance - cluster continues to function even if there is
a "partition" (communications break) between two nodes
What we want from stream
architecture?
11. “It is impossible for a distributed computer system to
simultaneously provide all three of the following
guarantees:
Consistency (all nodes see the same data at the same
time)
Availability (a guarantee that every request receives a
response about whether it succeeded or failed)
Partition tolerance (the system continues to operate
despite arbitrary partitioning due to network
failures)”
CAP Theorem
12. Facing the Cap Theorem
Consistency Availability
Partition
Tolerance
∅
Cassandra
Riak
CouchBase
MongoDB
λ
Poxos
Zab
Raft
14. One-way data flow (doesn’t transact and make per-
event decisions on the streaming data, nor does it
respond immediately to the events coming in)
Eventual consistency
NoSQL
Complexity
Limitations of the λ-Architecture
18. Distributed stream processing framework
Simple API
Fault tolerance
Manages stream state
Fault tolerance
Guarantee that messages are processed in the order
they were written to a partition, and that no
messages are ever lost.
Apache Samza
25. Apache Cassandra
Decentralized (Every node in the cluster has the same role.)
No single point of failure.
Scalable
Read and write throughput both increase linearly as new machines
are added, with no downtime or interruption to applications.
Fault-tolerant
Tunable level of consistency, all the way from "writes never fail" to
"block for all replicas to be readable”
Hadoop integration, integration with MapReduce
Query language
26. Apache Flink
• High performance
• Low latency
• Support for out-of
order events
• Flexible streaming
window
• Fault tolerance
30. Take into account recent history
ML Model is updatable (“evolves”
as new data comes in)
How ML from stream data is
different from traditional ML
techniques?
31. Incremental algorithms (both support vector
machines and neural networks can work
incrementally)
Periodic retraining with new data batch
Two Approaches to Adopt ML to
Stream Data