Deck36 is a small team of engineers who specialize in designing, implementing, and operating complex web systems. They discuss their approach to logging everything through a data pipeline that ingests data from producers, transports it via RabbitMQ, stores it in Hadoop HDFS and Amazon S3, runs analytics with Hadoop MapReduce and Amazon EMR, and performs real-time stream processing with Twitter Storm. They also live demo their JavaScript data collector client and a PHP/Storm example that processes click stream data.
3. Stefan & Mike
Dr. Stefan Schadwinkel
Mike Lohmann
Co-Founder / Analytics Engineer
Co-Founder / Software Engineer
stefan.schadwinkel@deck36.de
mike.lohmann@deck36.de
4. ABOUT DECK36
Who We Are
– DECK36 is a young spin-off from ICANS
– Small team of 7 engineers
– Longstanding expertise in designing, implementing and operating complex web
systems
– Developing own data intelligence-focused tools and web services
– Offering our expert knowledge in Automation & Operations, Architecture &
Engineering, Analytics & Data Logistics
5. WHAT WE WILL TALK ABOUT
Topics
– Log everything! – The Data Pipeline.
– Tackling the Leviathan – Realtime Stream Processing with Storm.
– JS Client DataCollector: Live Demo
– Storm Processing with PHP: Live Demo
7. THE DATA PIPELINE
Requirements
Background: Building and operating multiple education communities
Baseline: PokerStrategy.com KPIs
– 6M registered users, 700k posts/month, 2.8M page impressions/day, 7.6M requests/
day
New products à New business models à New Questions
– Extendable generic solution
– Storage and accessability more important than specific, optimized applications
9. THE DATA PIPELINE
Logging Pipeline
Producer
Transport
Storage
Analytics
Realtime Stream Processing
Analytics
- Hadoop MapReduce à Amazon EMR, Python, R
- Exports to Excel (CSV), Qlikview à Amazon
Redshift
Realtime Stream Processing
- Twitter Storm
10. THE DATA PIPELINE
Unified Message Format
- Fixed, guaranteed envelope
- Processing driven by message content
- Single message gets compressed (LZOP) to about 70% of original size "
(1184 B à 817 B)
- Message bulk gets compressed to about 12-14% of original size "
(@ 42k & 325k messages)
12. THE DATA PIPELINE
Compaction
RabbitMQ consumer (Erlang) stores data to cloud
- Relatively large amount of files
- Mixed messages
We want
- A few files
- Messages grouped by „Event Type“ and „Time Partition“
- Data transformation
Determined by message content
s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo
Hive partitioning!
13. THE DATA PIPELINE
Compaction
Using Cascalog
- Based on Clojure (LISP) and Cascading
- Provides a Datalog-like query language
- Don‘t LISP? à JCascalog
Very handy features (unavailable in Hive or Pig)
- Cascading Output Taps can be parameterized by data records
- Trap location for corrupted records (job finishes for all the correct messages)
- Runs within the JVM à large available codebase, arbitrary processing is simple
14. Cacalog Query Syntax
Cascalog is Clojure, Clojure is Lisp
(?<- (stdout)
Query
Operator
Cascading
Output Tap
[?person]
Columns of
the dataset
generated
by the query
(age ?person ?age) … (< ?age 30))
„Generator“
„Predicate“
- as many as you want
- both can be any clojure function
- clojure can call anything that is
available within a JVM
15. Cacalog Query Syntax
Run the Cascalog processing on Amazon EMR:
./elastic-mapreduce [standard parameters omitted]
--jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar
--main-class icans.cascalogjobs.processing.compaction
--args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error
16. The Data Pipeline
Data Queries with Hive
Hive is table-based and provides SQL-like syntax
- Assumes one storage location (directory) per table
- Simple to use if you know SQL
- Widely used, rapid development for „simple“ queries
Hive @ Amazon
- Table locations can be S3
- „Cluster on demand“ à requires to rebuild Hive metadata
- CREATE TABLE for source and target S3 locations
- Import Table metadata (auto-discovery for partitions)
- INSERT OVERWRITE to query source table(s) and store to target S3 location
18. Hive @ Amazon (2)
We can now simply copy the data from S3
and import into any local analytical tool
e.g. Excel, Redshift, QlikView, R, etc.
19. Further Reading
- More details in the Log Everything! ebook
- Available at Amazon and DeveloperPress
20. THE DATA PIPELINE
Still: It’s Batch Processing
- While quite efficient in flight, the logistics
of getting the job started are significant.
- Only cost-efficient for long distance
travel.
21. THE DATA PIPELINE
Instant Insight through Stream Processing
- Often, only updates for the recent day,
week, or month are necessary
- Time is of importance when direct
feedback or user interaction is desired
23. REALTIME STREAM PROCESSING
Instant Insight through Stream Processing
- Distributed realtime processing
framework
- Battle-proven by Twitter
- All *BINGO-Abilities fulfilled!
- Hadoop = data batch processing; Storm
= realtime data processing
- More (and maybe new) *BINGO: DRPC,
ETL, RTET, Spouts, Bolts, Tuple,
Topology
- Easy to use (Really!)
24. Realtime Stream Processing Infrastructure with Storm
Producer
Transport
Analytics
Storage
Realtime Data Stream Analytics
Storm-Cluster
Supervisor
NodeJS
Supervisor
S3
Worker
Worker
Worker
Zabbix
Graylog
Apps
&Server
Queue
Zookeeper
Nimbus
(Master)
DB
25. REALTIME STREAM PROCESSING
JS Client Features
- Event system
- Master/Slave Tabs
- Local queuing of data
- Ability to use node modules
- Easy to extend
- Complete development suite
- Deliver bundles with vendors or not
26. Realtime Stream Processing - Loading the JS Client
<script .. src=“https://cdn.tradimo.com/js/starlog-client.min.js?5193e1ba0325c756b78d87384d2f80e9"></script>
https://../starlog-client.min.js
Create signed
cookie
starlog-client.min.js
Set-Cookie:UUID
/socket.io/1/websockets
Upgrade: websockets
Cookie: UUID
Established connection
Check cookie
HTTP 101 – Protocol Change
Connection: Upgrade
Upgrade: websocket
Collecting Data
Sending data in UMF
Sending data to the client
UMF
NodeJS
Counts
Queue
Backend
Magic
Queue
27. Realtime Stream Processing - JS Client in action
UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge
ClickEvent collector
register onclick Event
Clicked-Data
observe
localstorage
Clicked-Data
Clicked-Data-UMF
SocketConnect
NodeJS
28. Realtime Stream Processing - JS Client in action
function ClickFetcher()
{
this.collectData = function (callback)
{
var clicked = 1;
logger.debug('ClickFetcher - collectData called!');
window.onclick = function() {
var collectedData = {
key : window.location.host.toString()+window.location.pathname.toString(),
value: {
payload: clicked,
timestamp: +new Date()
}
};
localstorage.set(collectedData, function (storageResult)
{
logger.debug("err = " + storageResult.hasError());
logger.debug("storageResult = " + storageResult);
}, false, true, true);
clicked++;
};
};
}
var clickFetcher = new ClickFetcher();
starlogclient.on(starlogclient.COLLECTINGDATA, clickFetcher.collectData);