Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Warp 10 - Time Series Analysis on top of Hadoop - HUG France - Paris Spark Meetup 2017-05-16
1. Spark Meetup - 2017-05-16, Paris
Mathias @Herberts - CTO, Cityzen Data
Warp 10 - Simplifying analysis of
time series data on top of
2. `whoami`
Former Senior SRE on Big Table at Google
Former head of Big Data at Crédit Mutuel Arkéa
Pioneer in the use of Hadoop & HBase in production since 2009
Co-Founder and CTO of Cityzen Data, maker of Warp 10
@herberts
20. Advanced stack based language
■ Result is a JSON array of the various stack levels
■ Support for variables and context saving
■ Code serialization
■ Loops, conditionals, macros - Data Flow model
■ Secure code execution, resource limits
21. 5 high level frameworks
■ BUCKETIZE - transform a series so it has regularly spaced ticks
■ MAP - apply a function on a sliding window
■ REDUCE - tick by tick computation on multiple series, producing a single one
■ FILTER - select series based on various criteria
■ APPLY - tick by tick application of an n-ary function
25. WarpScript Server Side Macros
<%
<’
This macro does such and such…
@param xxx
@param yyy
‘>
DOC
// Store the current context so we can create symbols freely
SAVE ‘_context’ STORE
// Insert your code here
// Restore original context
$_context RESTORE
%> ‘macro’ STORE
// Unit tests
// Leave the macro on the stack
$macro
// Use via @path/to/macro in your scripts
26. WarpScript Extensions
Import io.warp10.script.sdk.WarpScriptExtension;
import io.warp10.script.NamedWarpScriptFunction;
import io.warp10.script.WarpScriptException;
import io.warp10.script.WarpScriptStack;
import io.warp10.script.WarpScriptStackFunction;
public class MyExtension extends WarpScriptExtension {
private static Map<String,Object> functions = new HashMap<String,Object>();
private static class MyStackFunction extends NamedWarpScriptFunction
implements WarpScriptStackFunction {
@Override
public Object apply(WarpScriptStack stack) throws WarpScriptException {
….
return stack;
}
}
static { functions.put("XXX", new MyStackFunction(“XXX”)); }
@Override
public Map<String, Object> getFunctions() {
return functions;
}
}
27. CALLing external programs
#!/usr/bin/env python -u
import cPickle, sys, urllib, base64
# Output the maximum number of instances of this 'callable' to spawn
print 10
# Loop, reading stdin, doing our stuff and outputing to stdout
while True:
try:
line = sys.stdin.readline()
line = line.strip()
line = urllib.unquote(line.decode('utf-8'))
# Remove Base64 encoding
str = base64.b64decode(line)
args = cPickle.loads(str)
# Do out stuff
output = ….
# Output result (URL encoded UTF-8).
print urllib.quote(output.encode('utf-8'))
except Exception as err:
print ' ' + urllib.quote(repr(err).encode('utf-8'))
...
->PICKLE ‘UTF-8’
BYTES-> ->B64
‘path/to/file’ CALL
B64-> PICKLE->
....
43. Warp10InputFormat
■ Read data stored in Warp 10 at millions of datapoints per second
■ Standard Hadoop InputFormat
■ Compatible with any tool relying on such an InputFormat
■ Compact representation of time series, lower memory footprint
44. Integration with
■ Enable the use of WarpScript code in the Spark DAG
■ Provide both WarpScriptFunction and WarpScriptFlatMapFunction
■ Manipulate RDD/DataSet/DataFrame elements on the WarpScript stack
■ Extend WarpScript to support custom types if needed
■ Load time series data from any source (Parquet, SQL, …)
45. DataFrame df = sqlc.read().parquet(...);
RDD<Row> rdd = df.rdd();
JavaRDD<Row> jrdd = rdd.toJavaRDD();
JavaRDD<Row> out = jrdd.mapPartitions(new
WarpScriptFlatMapFunction<Iterator<Row>,Row>("@ext-macro.mc2"));
JavaPairRDD<Row, Iterable<Row>> grouped = out.groupBy(new WarpScriptFunction<Row, Row>("[ 0
1 ] SUBLIST ->SPARKROW"));
JavaRDD<Row> merged = grouped.map(new WarpScriptFunction<Tuple2<Row,Iterable<Row>>,
Row>("LIST-> DROP 0 GET [] SWAP <% SPARK-> 2 GET UNWRAP +! %> FOREACH MERGE WRAPRAW + 2 GET
1 ->LIST ->SPARKROW"));
List<StructField> fields = new ArrayList<StructField>();
fields.add(DataTypes.createStructField("wrapper", DataTypes.BinaryType, false));
StructType st = new StructType(fields.toArray(new StructField[0]));
DataFrame df2 = sqlc.createDataFrame(merged, st);
df2.write().parquet("/path/to/output/parquetfile");
Integration with
46. Integration with
■ Enable the use of WarpScript code in Pig scripts
■ Provide a WarpScriptRun UDF
■ Manipulate Pig types (tuples, bags, …) on the WarpScript stack
■ Represent time series in a very compact form to speed up processing
■ Load time series data from any source
47. REGISTER warp10-pig-0.0.10-rc2.jar;
SET warp.timeunits 'us';
DEFINE WarpScriptRun io.warp10.pig.WarpScriptRun();
GTS = LOAD '$input' USING PigStorage() AS (gts: chararray);
-- Retain only the 'frequency' GTS and chunk them by 5 minutes
FREQCHUNKS = FOREACH GTS GENERATE
FLATTEN( WarpScriptRun('DUP UNWRAPEMPTY NAME "frequency" == <% UNWRAP 0 5 m 0 0 "chunkid" false CHUNK WRAP %> <% [] %> IFTE ->V ', gts));
-- Flatten the bag
CHUNKS = FOREACH FREQCHUNKS GENERATE FLATTEN($0);
-- Generate station id, chunk id, gts
BYSTATIONCHUNK = FOREACH CHUNKS GENERATE FLATTEN( WarpScriptRun('DUP UNWRAP LABELS DUP "chunkid" GET SWAP "stationid" GET ', $0))
AS (stationid: chararray, chunkid: chararray, gts: chararray);
-- Group by station id, chunk id
STATIONCHUNKGROUP = GROUP BYSTATIONCHUNK BY (stationid, chunkid) PARALLEL 20;
-- Merge the GTS to reconstruct the chunk and emit station id, chunk id, gts
FULLCHUNKS = FOREACH STATIONCHUNKGROUP GENERATE
FLATTEN(
WarpScriptRun('V-> <% DROP 2 GET UNWRAP %> LMAP MERGE DUP LABELS SWAP WRAP SWAP DUP "chunkid" GET SWAP "stationid" GET ', BYSTATIONCHUNK))
AS (stationid: chararray, chunkid: chararray, gts: chararray);
STORE FULLCHUNKS INTO ‘$output’ USING PigStorage(‘t’);
Integration with
49. And also...
■ Integration with Flink
■ Integration with Zeppelin via a WarpScript interpreter
■ Warp 10 sink to push data to Warp 10 once it has been processed
■ Coherent approach in ad-hoc, batch, and streaming modes
■ Reduce amount of code needed to be written, focus on business problems
51. Thank you!
curl -O -L https://dl.bintray.com/cityzendata/generic/io/warp10/warp10/1.2.7/warp10-1.2.7.tar.gz
tar zxpf warp10-1.2.7.tar.gz
export JAVA_HOME=/path/to/java/home; cd warp10-1.2.7; ./bin/warp10-standalone.init start
3 steps to get you started with Warp 10
A set of resources to learn, ask and share
@warp10io
http://www.warp10.io/
http://groups.google.com/forum/#!forum/warp10-users
https://github.com/cityzendata