In this talk, Adam Laiacano from Tumblr gives an "Introduction to Digital Signal Processing in Hadoop". Adam introduces the concepts of digital signals, filters, and their interpretation in both the time and frequency domain, and he works through a few simple examples of low-pass filter design and application. It's much more application focused than theoretical, and there is no assumed prior knowledge of signal processing. This talk was recorded at the NYC Machine Learning Meetup at Pivotal Labs.
Adam also works through how they can be used either in a real-time stream or in batch-mode in Hadoop (with Scalding). I'll hopefully have some examples of how to detect trendy meme-ish blogs on Tumblr.
Bio: Adam Laiacano is a Data Scientist and Engineer at Tumblr, a blogging network with over 140 million blogs, where he's responsible for collecting and analyzing large volumes of data to gain a better understanding of trends and activity within the Tumblr community. He holds a Bachelor of Science degree in Electrical Engineering from Northeastern University, and designed signal detection systems for low-power atomic clocks before joining Tumblr.
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Introduction to Digital Signal Processing in Hadoop by Adam Laiacano
1. digital signal processing
in hadoop
with scalding
Adam Laiacano
adam@tumblr.com
@adamlaiacano
Thursday, October 17, 13
2. Overview
• Intro to digital signals and filters
• sampling
• frequency domain
• FIR / IIR filters
• Very quick intro to Scalding
• Filtering tons of signals at once
• Application: Finding trending blogs on tumblr
Thursday, October 17, 13
5. Some Definitions
Signal - Any series of data (Volts, posts, etc) that is measured at regular intervals.
Sampling period, Ts - Time between samples (my example was Ts = 1 day)
Sampling frequency fs - 1 / Ts
Nyquist frequency - Highest frequency that can be represented = fs/2
Filter - A system to reduce or enhance certain aspects (phase, magnitude) of a
signal.
Stopband - The frequency range we want to eliminate
Passband - The frequency range we want to preserve
Cutoff frequency, fc - The boundary of the stopband/passband
Thursday, October 17, 13
23. Recap - FIR Filters
• FIR filters are weighted sums of previous input.
• Can think of them as a generalized Moving
Average
• Required to apply:
• Filter h of length N
• Previous N inputs x
Thursday, October 17, 13
25. Super General Overview
• DSL on top of Cascading, written in scala
• Cascading: Workflow language for dealing
with lots of data. Often in hadoop.
• Similar to pig or hive, but easier to extend
(no UDFs! one language!).
• Feels like “real programming” - compiler!
types!
• Is awesome
Thursday, October 17, 13
26. Less General Overview
• Similar to split/apply/combine paradigm (plyr, pandas)
• Load data into Pipes (like data.frames)
• Each pipe has one or more Fields (columns)
• Perform row-wise operations with map (d$a+d$b)
• Perform field-wise operations in groupBy
Thursday, October 17, 13
28. Scalding Resources
• The best resource is the Scalding wiki page.
https://github.com/twitter/scalding/wiki/Fields-based-API-Reference
• Edwin Chen’s post about recommendations.
http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/
• Source code is FULL of undocumented features!
https://github.com/twitter/scalding
Thursday, October 17, 13
38. import Matrix._
• Scalding has a Matrix library!
• Stores data in a Pipe as ('row,
• Ideal for sparse matricies
'col, 'val)
• L0, L1, L2 norm, inverse, +, -, *
• QR Factorization: http://bit.ly/1hxWF17
• More!
Thursday, October 17, 13
39. Tumblr Social Graph
•
•
•
•
140+ Million Nodes
3.5 Billion Edges
About 100GB of raw text data
3 columns: fromId, toId, timestamp
GOAL: Calculate followers / day for every blog,
apply 1-week moving average.
Thursday, October 17, 13
41. Find blogs who have accelerating follower counts
for the most consecutive days.
“Accelerating”:
New Followers Today > New Followers Yesterday
Thursday, October 17, 13
45. Days of consecutive acceleration
• Binary input: 1 if more followers today than
yesterday, otherwise 0
• Filter the binary signal to produce a value
between 0 and 1
• Anything above a threshold (0.75) is
“accelerating”
Thursday, October 17, 13
46. Days of consecutive acceleration
(filtered signal)
15 days
36 days
48 days
Thursday, October 17, 13
47. Days of consecutive acceleration
(filtered signal)
15 days
36 days
48 days
Thursday, October 17, 13