Log everything! @DC13

Stefan & Mike

Dr. Stefan Schadwinkel

Mike Lohmann

Co-Founder / Analytics Engineer

Co-Founder / Software Engineer

stefan.schadwinkel@deck36.de

mike.lohmann@deck36.de

ABOUT DECK36
Who We Are
–  DECK36 is a young spin-off from ICANS
–  Small team of 7 engineers
–  Longstanding expertise in designing, implementing and operating complex web
systems
–  Developing own data intelligence-focused tools and web services
–  Offering our expert knowledge in Automation & Operations, Architecture &
Engineering, Analytics & Data Logistics

WHAT WE WILL TALK ABOUT
Topics
–  Log everything! – The Data Pipeline.
–  Tackling the Leviathan – Realtime Stream Processing with Storm.
–  JS Client DataCollector: Live Demo
–  Storm Processing with PHP: Live Demo

Log everything!
The Data Pipeline

THE DATA PIPELINE
Requirements
Background: Building and operating multiple education communities
Baseline: PokerStrategy.com KPIs
–  6M registered users, 700k posts/month, 2.8M page impressions/day, 7.6M requests/
day
New products à New business models à New Questions
–  Extendable generic solution
–  Storage and accessability more important than speciﬁc, optimized applications

THE DATA PIPELINE
Requirements
Producer

Transport

Storage

Analytics

Realtime Stream Processing
Producer
–  Monolog Plugin, JS Client
Transport
–  Flume 0.9.4 m( à RabbitMQ, Erlang Consumer
–  Evaluated Apache Kafka
Storage
–  Hadoop HDFS (our very own) à Amazon S3

THE DATA PIPELINE
Logging Pipeline
Producer

Transport

Storage

Analytics

Analytics
-  Hadoop MapReduce à Amazon EMR, Python, R
-  Exports to Excel (CSV), Qlikview à Amazon
Redshift
-  Twitter Storm

THE DATA PIPELINE
Uniﬁed Message Format

-  Fixed, guaranteed envelope

-  Processing driven by message content
-  Single message gets compressed (LZOP) to about 70% of original size "
(1184 B à 817 B)
-  Message bulk gets compressed to about 12-14% of original size "
(@ 42k & 325k messages)

THE DATA PIPELINE
Compaction
RabbitMQ consumer (Erlang) stores data to cloud
-  Relatively large amount of ﬁles
-  Mixed messages
We want
-  A few ﬁles
-  Messages grouped by „Event Type“ and „Time Partition“
-  Data transformation
Determined by message content

s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo

Hive partitioning!

THE DATA PIPELINE
Compaction
Using Cascalog
-  Based on Clojure (LISP) and Cascading
-  Provides a Datalog-like query language
-  Don‘t LISP? à JCascalog

Very handy features (unavailable in Hive or Pig)
-  Cascading Output Taps can be parameterized by data records
-  Trap location for corrupted records (job ﬁnishes for all the correct messages)
-  Runs within the JVM à large available codebase, arbitrary processing is simple

Cacalog Query Syntax

Cascalog is Clojure, Clojure is Lisp

(?<- (stdout)
Query
Operator

Cascading
Output Tap

[?person]
Columns of
the dataset
generated
by the query

(age ?person ?age) … (< ?age 30))
„Generator“

„Predicate“

-  as many as you want
-  both can be any clojure function
-  clojure can call anything that is
available within a JVM

Cacalog Query Syntax

Run the Cascalog processing on Amazon EMR:
./elastic-mapreduce [standard parameters omitted]
--jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar
--main-class icans.cascalogjobs.processing.compaction
--args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error

The Data Pipeline
Data Queries with Hive
Hive is table-based and provides SQL-like syntax
-  Assumes one storage location (directory) per table
-  Simple to use if you know SQL
-  Widely used, rapid development for „simple“ queries
Hive @ Amazon
-  Table locations can be S3
-  „Cluster on demand“ à requires to rebuild Hive metadata
-  CREATE TABLE for source and target S3 locations
-  Import Table metadata (auto-discovery for partitions)
-  INSERT OVERWRITE to query source table(s) and store to target S3 location

Hive @ Amazon (2)

We can now simply copy the data from S3
and import into any local analytical tool
e.g. Excel, Redshift, QlikView, R, etc.

Further Reading

-  More details in the Log Everything! ebook
-  Available at Amazon and DeveloperPress

THE DATA PIPELINE
Still: It’s Batch Processing
-  While quite efficient in flight, the logistics
of getting the job started are significant.
-  Only cost-efficient for long distance
travel.

THE DATA PIPELINE

Instant Insight through Stream Processing
-  Often, only updates for the recent day,
week, or month are necessary
-  Time is of importance when direct
feedback or user interaction is desired

More Wind In The Sails
With Storm

REALTIME STREAM PROCESSING

Instant Insight through Stream Processing
-  Distributed realtime processing
framework
-  Battle-proven by Twitter
-  All *BINGO-Abilities fulﬁlled!
-  Hadoop = data batch processing; Storm
= realtime data processing
-  More (and maybe new) *BINGO: DRPC,
ETL, RTET, Spouts, Bolts, Tuple,
Topology
-  Easy to use (Really!)

Realtime Stream Processing Infrastructure with Storm

Producer

Transport

Analytics

Storage
Realtime Data Stream Analytics

Storm-Cluster
Supervisor
NodeJS

Supervisor

S3

Worker

Worker
Worker
Zabbix
Graylog

Apps
&Server

Queue

Zookeeper

Nimbus
(Master)

DB

JS Client Features
-  Event system
-  Master/Slave Tabs
-  Local queuing of data
-  Ability to use node modules
-  Easy to extend
-  Complete development suite
-  Deliver bundles with vendors or not

Realtime Stream Processing - Loading the JS Client

<script .. src=“https://cdn.tradimo.com/js/starlog-client.min.js?5193e1ba0325c756b78d87384d2f80e9"></script>

https://../starlog-client.min.js

Create signed
cookie

starlog-client.min.js
Set-Cookie:UUID
/socket.io/1/websockets
Upgrade: websockets
Cookie: UUID
Established connection

Check cookie

HTTP 101 – Protocol Change
Connection: Upgrade
Upgrade: websocket
Collecting Data

Sending data in UMF
Sending data to the client

UMF
NodeJS
Counts
Queue

Backend
Magic
Queue

Realtime Stream Processing - JS Client in action

UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge

ClickEvent collector

register onclick Event

Clicked-Data

observe

localstorage

Clicked-Data

Clicked-Data-UMF
SocketConnect
NodeJS

Realtime Stream Processing - JS Client in action
function ClickFetcher()
{
this.collectData = function (callback)
{
var clicked = 1;
logger.debug('ClickFetcher - collectData called!');
window.onclick = function() {
var collectedData = {
key : window.location.host.toString()+window.location.pathname.toString(),
value: {
payload: clicked,
timestamp: +new Date()
}
};
localstorage.set(collectedData, function (storageResult)
{
logger.debug("err = " + storageResult.hasError());
logger.debug("storageResult = " + storageResult);
}, false, true, true);
clicked++;
};
};
}
var clickFetcher = new ClickFetcher();
starlogclient.on(starlogclient.COLLECTINGDATA, clickFetcher.collectData);

Client Live Demo

https://localhost:3001/test/1-page-stub.html

Producer Libraries
-  LoggingComponent: Provides interfaces, ﬁlters and handlers
-  LoggingBundle: Glues all together for Symfony2
-  Drupal Logging Module: Using the LoggingComponent
-  JS Frontend Client: LogClient Framework for Browsers

https://github.com/ICANS/IcansLoggingComponent
https://github.com/ICANS/IcansLoggingBundle
https://github.com/ICANS/drupal-logging-module
https://github.com/DECK36/starlog-js-frontend-client

Realtime Stream Processing - PHP & Storm

UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge
Using PHP for that!
https://github.com/Lazyshot/storm-php/blob/master/lib/storm.php

Clicked-Data-UMF

Queue

Event: „Star Trek Commander“ Badge

Get Inspired!
Powered-by Storm: https://github.com/nathanmarz/storm/wiki/Powered-By
-  50+ companies (Twitter, Yahoo, Groupon, Ooyala, Baidu, Wayfair, …)
-  Ads & real-time bidding, Data-centric (Economic, Environmental, Health), User interactions
Language-agnostic backend systems (Operate Storm, Develop in PHP)
Streaming „counts“: Sentiment Analysis, Frequent Items, Multi-armed Bandits, …
DRPC: Custom user feeds, Complex Queries (i.e. trace graph links)
Realtime, distributed ETL
-  Buffering / Retries
-  Integrate Data: Third-party API, Machine Learning
-  Store to DBs, Search engines, etc

You can ﬁnd us:

github.com/DECK36

info@deck36.de

deck36.de

Log everything! @DC13

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Log everything! @DC13

Similar to Log everything! @DC13 (20)

More from DECK36

More from DECK36 (7)

Recently uploaded

Recently uploaded (20)

Log everything! @DC13