SlideShare a Scribd company logo
1 of 89
Download to read offline
Why Scala Is Taking Over
the Big Data World
Data Day Texas Jan. 10, 2015
dean.wampler@typesafe.com
1
Saturday, January 10, 15
Sydney Opera House photos Copyright © Dean Wampler, 2011-2015, Some Rights Reserved.
The content is free to reuse, but attribution is requested.
http://creativecommons.org/licenses/by-nc-sa/2.0/legalcode
2
<plug>
<shameless>
</shameless>
</plug>
Saturday, January 10, 15
http://programming-scala.org
Servers
Network
ETL
ServicesETL
Services
Batch
(ETL)
Ingestion
ETL
ServicesETL
ServicesEvent
Handling
Hadoop
DBs
3
You and me!
DBAs,
Analysts
Data Scientists
Saturday, January 10, 15
DBAs,	
  tradi,onal	
  data	
  analysts:	
  SQL,	
  SAS.	
  “DBs”	
  could	
  also	
  be	
  distributed	
  file	
  systems.
Data	
  Scien,sts	
  -­‐	
  Sta,s,cs	
  experts.	
  Some	
  programming,	
  especially	
  Python,	
  R,	
  Julia,	
  maybe	
  Matlab,	
  etc.
Developers	
  like	
  us,	
  who	
  figure	
  out	
  the	
  infrastructure	
  (but	
  don’t	
  usually	
  manage	
  it),	
  and	
  write	
  the	
  programs	
  that	
  do	
  batch-­‐oriented	
  ETL	
  (extract,	
  transform,	
  and	
  load),	
  and	
  more	
  real-­‐,me	
  event	
  handling.	
  
OQen	
  this	
  data	
  gets	
  pumped	
  into	
  Hadoop	
  or	
  other	
  compute	
  engines	
  and	
  data	
  stores,	
  including	
  various	
  SQL	
  and	
  NoSQL	
  DBs,	
  and	
  file	
  systems.	
  
Servers
Network
ETL
ServicesETL
Services
Batch
(ETL)
Ingestion
ETL
ServicesETL
ServicesEvent
Handling
Hadoop
DBs
4
You and me!
DBAs,
Analysts
Data Scientists
Saturday, January 10, 15
What	
  we’ll	
  focus	
  on.
You are here...
5
Saturday, January 10, 15
Let’s	
  put	
  all	
  this	
  into	
  perspec,ve...
hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-­‐_land_and_oceans_12000.jpg/1280px-­‐Whole_world_-­‐_land_and_oceans_12000.jpg
... and it’s 2008.
6
Saturday, January 10, 15
Let’s	
  put	
  all	
  this	
  into	
  perspec,ve,	
  circa	
  2008...
hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-­‐_land_and_oceans_12000.jpg/1280px-­‐Whole_world_-­‐_land_and_oceans_12000.jpg
Hadoop
7
Saturday, January 10, 15
Let’s drill down to Hadoop, which first gained widespread awareness in 2008-2009, when Yahoo! announced they were running a 10K core
cluster with it, Hadoop became a top-level Apache project, etc.
8
Quant, by
today’s standards
Tue,	
  Sep	
  30,	
  2008
16PB
4000
Saturday, January 10, 15
The	
  largest	
  clusters	
  today	
  are	
  roughly	
  10x	
  in	
  size.
9
Hadoop
master
Resource Mgr
Name Node
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
Saturday, January 10, 15
The	
  schema,c	
  view	
  of	
  a	
  Hadoop	
  v2	
  cluster,	
  with	
  YARN	
  (Yet	
  Another	
  Resource	
  Nego,ator)	
  handling	
  resource	
  alloca,on	
  and	
  job	
  scheduling.	
  (V2	
  is	
  actually	
  circa	
  2013,	
  but	
  this	
  detail	
  is	
  unimportant	
  for	
  this	
  
discussion).	
  The	
  master	
  services	
  are	
  federated	
  for	
  failover,	
  normally	
  (not	
  shown)	
  and	
  there	
  would	
  usually	
  be	
  more	
  than	
  two	
  slave	
  nodes.	
  Node	
  Managers	
  manage	
  the	
  tasks	
  
The	
  Name	
  Node	
  is	
  the	
  master	
  for	
  the	
  Hadoop	
  Distributed	
  File	
  System.	
  Blocks	
  are	
  managed	
  on	
  each	
  slave	
  by	
  Data	
  Node	
  services.
The	
  Resource	
  Manager	
  decomposes	
  each	
  job	
  in	
  to	
  tasks,	
  which	
  are	
  distributed	
  to	
  slave	
  nodes	
  and	
  managed	
  by	
  the	
  Node	
  Managers.	
  There	
  are	
  other	
  services	
  I’m	
  omicng	
  for	
  simplicity.
10
Hadoop
master
Resource Mgr
Name Node
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
HDFS
HDFS
Saturday, January 10, 15
The	
  schema,c	
  view	
  of	
  a	
  Hadoop	
  v2	
  cluster,	
  with	
  YARN	
  (Yet	
  Another	
  Resource	
  Nego,ator)	
  handling	
  resource	
  alloca,on	
  and	
  job	
  scheduling.	
  (V2	
  is	
  actually	
  circa	
  2013,	
  but	
  this	
  detail	
  is	
  unimportant	
  for	
  this	
  
discussion).	
  The	
  master	
  services	
  are	
  federated	
  for	
  failover,	
  normally	
  (not	
  shown)	
  and	
  there	
  would	
  usually	
  be	
  more	
  than	
  two	
  slave	
  nodes.	
  Node	
  Managers	
  manage	
  the	
  tasks	
  
The	
  Name	
  Node	
  is	
  the	
  master	
  for	
  the	
  Hadoop	
  Distributed	
  File	
  System.	
  Blocks	
  are	
  managed	
  on	
  each	
  slave	
  by	
  Data	
  Node	
  services.
The	
  Resource	
  Manager	
  decomposes	
  each	
  job	
  in	
  to	
  tasks,	
  which	
  are	
  distributed	
  to	
  slave	
  nodes	
  and	
  managed	
  by	
  the	
  Node	
  Managers.	
  There	
  are	
  other	
  services	
  I’m	
  omicng	
  for	
  simplicity.
11
master
Resource Mgr
Name Node
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
HDFS
HDFS
MapReduce	
  Job
MapReduce	
  Job
MapReduce	
  Job
Saturday, January 10, 15
You	
  submit	
  MapReduce	
  jobs	
  to	
  the	
  Resource	
  Manager.	
  Those	
  jobs	
  could	
  be	
  wriUen	
  in	
  the	
  Java	
  API,	
  or	
  higher-­‐level	
  APIs	
  like	
  Cascading,	
  Scalding,	
  Pig,	
  and	
  Hive.
MapReduce
The classic
compute model
for Hadoop
12
Saturday, January 10, 15
Historically,	
  up	
  to	
  2013,	
  MapReduce	
  was	
  the	
  officially-­‐supported	
  compute	
  engine	
  for	
  wri,ng	
  all	
  compute	
  jobs.
13
Example: Inverted Index
inverse index
block
hadoop (.../hadoop,1)
(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs
(.../hive,1)hive
(.../hbase,1),(.../hive,1)hbase
......
......
block
......
block
......
block
......
(.../hadoop,1),(.../hive,1)and
......
wikipedia.org/hadoop
Hadoop provides
MapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hive
Hive queries HDFS files and
HBase tables with SQL
...
...
Saturday, January 10, 15
14
Example: Inverted Index
wikipedia.org/hadoop
Hadoop provides
MapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hive
Hive queries HDFS files and
HBase tables with SQL
...
...
Web Crawl
index
block
......
Hadoop provides...wikipedia.org/hadoop
......
block
......
HBase stores...wikipedia.org/hbase
......
block
......
Hive queries...wikipedia.org/hive
......
inverse index
block
hadoop (.../hadoop,1)
(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs
(.../hive,1)hive
(.../hbase,1),(.../hive,1)hbase
......
......
block
......
block
......
block
......
(.../hadoop,1),(.../hive,1)and
......
Miracle!!
Compute Inverted Index
Saturday, January 10, 15
15
wikipedia.org/hadoop
Hadoop provides
MapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hive
Hive queries HDFS files and
HBase tables with SQL
...
...
Web Crawl
index
block
......
Hadoop provides...wikipedia.org/hadoop
......
block
......
HBase stores...wikipedia.org/hbase
......
block
......
Hive queries...wikipedia.org/hive
......
inv
bl
bl
bl
bl
Miracle!!
Compute Inverted Index
Saturday, January 10, 15
16
...
Hadoop provides....org/hadoop
...
...
HBase stores....org/hbase
...
...
Hive queries....org/hive
...
inverse index
block
hadoop (.../hadoop,1)
(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs
(.../hive,1)hive
(.../hbase,1),(.../hive,1)hbase
......
......
block
......
block
......
block
......
(.../hadoop,1),(.../hive,1)and
......
Miracle!!
Compute Inverted Index
Saturday, January 10, 15
17
1 Map step + 1 Reduce step
Saturday, January 10, 15
18
1 Map step + 1 Reduce step
Saturday, January 10, 15
Problems
Hard to
implement
algorithms...
19
Saturday, January 10, 15
Nontrivial	
  algorithms	
  are	
  hard	
  to	
  convert	
  to	
  just	
  map	
  and	
  reduce	
  steps,	
  even	
  though	
  you	
  can	
  sequence	
  mul,ple	
  map+reduce	
  “jobs”.	
  It	
  takes	
  specialized	
  exper,se	
  of	
  the	
  tricks	
  of	
  the	
  trade.
Problems
... and the
Hadoop API is
horrible:
20
Saturday, January 10, 15
The	
  Hadoop	
  API	
  is	
  very	
  low	
  level	
  and	
  tedious...
21
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
Saturday, January 10, 15
For	
  example,	
  the	
  classic	
  inverted	
  index,	
  used	
  to	
  convert	
  an	
  index	
  of	
  document	
  locaEons	
  (e.g.,	
  URLs)	
  to	
  words	
  into	
  the	
  reverse;	
  an	
  index	
  from	
  
words	
  to	
  doc	
  locaEons.	
  It’s	
  the	
  basis	
  of	
  search	
  engines.
I’m	
  not	
  going	
  to	
  explain	
  the	
  details.	
  The	
  point	
  is	
  to	
  noEce	
  all	
  the	
  boilerplate	
  that	
  obscures	
  the	
  problem	
  logic.
Everything	
  is	
  in	
  one	
  outer	
  class.	
  We	
  start	
  with	
  a	
  main	
  rouEne	
  that	
  sets	
  up	
  the	
  job.
I	
  used	
  yellow	
  for	
  method	
  calls,	
  because	
  methods	
  do	
  the	
  real	
  work!!	
  But	
  noEce	
  that	
  most	
  of	
  the	
  funcEons	
  in	
  this	
  code	
  don’t	
  really	
  do	
  a	
  whole	
  lot	
  
of	
  work	
  for	
  us...
22
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
Saturday, January 10, 15
Boilerplate
23
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
Saturday, January 10, 15
main ends with a try-catch clause to run the
job.
24
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
Saturday, January 10, 15
This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name)
tuples.
25
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
Saturday, January 10, 15
The rest of the LineIndexMapper class and map
method.
26
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
Saturday, January 10, 15
The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is
stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.
27
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Saturday, January 10, 15
EOF
28
Altogether
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Saturday, January 10, 15
The	
  whole	
  shebang	
  (6pt.	
  font)	
  This	
  would	
  take	
  a	
  few	
  hours	
  to	
  write,	
  test,	
  etc.	
  assuming	
  you	
  already	
  know	
  the	
  API	
  and	
  the	
  idioms	
  for	
  using	
  it.
Early 2012.
29
Saturday, January 10, 15
Let’s	
  put	
  all	
  this	
  into	
  perspec,ve,	
  circa	
  2012...
hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-­‐_land_and_oceans_12000.jpg/1280px-­‐Whole_world_-­‐_land_and_oceans_12000.jpg
30
Dean Wampler
“Trolling the
Hadoop community
since 2012...”
Saturday, January 10, 15
31
Saturday, January 10, 15
32
In which I claimed that:
Saturday, January 10, 15
Salvation!
Scalding
33
Saturday, January 10, 15
TwiWer	
  wrote	
  a	
  Scala	
  API,	
  hWps://github.com/twiWer/scalding,	
  to	
  hide	
  the	
  mess.	
  Actually,	
  Scalding	
  sits	
  on	
  top	
  of	
  Cascading	
  (hWp://cascading.org)	
  a	
  higher-­‐level	
  Java	
  
API	
  that	
  exposes	
  more	
  sensible	
  “combinators”	
  of	
  operaEons,	
  but	
  is	
  sEll	
  somewhat	
  verbose	
  due	
  to	
  the	
  pre-­‐Java	
  8	
  convenEons	
  it	
  must	
  use.	
  Scalding	
  gives	
  us	
  the	
  full	
  
benefits	
  of	
  Scala	
  syntax	
  and	
  funcEonal	
  operaEons,	
  “combinators”.
34
MapReduce	
  (Java)
Cascading	
  (Java)
Scalding	
  (Scala)
Saturday, January 10, 15
TwiWer	
  wrote	
  a	
  Scala	
  API,	
  hWps://github.com/twiWer/scalding,	
  to	
  hide	
  the	
  mess.	
  Actually,	
  Scalding	
  sits	
  on	
  top	
  of	
  Cascading	
  (hWp://cascading.org)	
  a	
  higher-­‐level	
  Java	
  
API	
  that	
  exposes	
  more	
  sensible	
  “combinators”	
  of	
  operaEons,	
  but	
  is	
  sEll	
  somewhat	
  verbose	
  due	
  to	
  the	
  pre-­‐Java	
  8	
  convenEons	
  it	
  must	
  use.	
  Scalding	
  gives	
  us	
  the	
  full	
  
benefits	
  of	
  Scala	
  syntax	
  and	
  funcEonal	
  operaEons,	
  “combinators”.
35
import com.twitter.scalding._
class InvertedIndex(args: Args)
extends Job(args) {
val texts = Tsv("texts.tsv", ('id, 'text))
val wordToIds = texts
.flatMap(('id, 'text) -> ('word, 'id2)) {
fields: (Long, String) =>
val (id2, text) =
text.split("s+").map {
word => (word, id2)
}
}
val invertedIndex =
wordToTweets.groupBy('word) {
_.toList[Long]('id2 -> 'ids)
}
Saturday, January 10, 15
DramaEcally	
  smaller,	
  succinct	
  code!	
  (hWps://github.com/echen/roseWa-­‐scone/blob/master/inverted-­‐index/InvertedIndex.scala)	
  Note	
  that	
  this	
  
example	
  assumes	
  a	
  slightly	
  different	
  input	
  data	
  format	
  (more	
  than	
  one	
  document	
  per	
  file,	
  with	
  each	
  document	
  id	
  followed	
  by	
  the	
  text	
  all	
  on	
  a	
  
single	
  line,	
  tab	
  separated.
36That’s it!
.flatMap(('id, 'text) -> ('word, 'id2)) {
fields: (Long, String) =>
val (id2, text) =
text.split("s+").map {
word => (word, id2)
}
}
val invertedIndex =
wordToTweets.groupBy('word) {
_.toList[Long]('id2 -> 'ids)
}
invertedIndex.write(Tsv("output.tsv"))
}
Saturday, January 10, 15
DramaEcally	
  smaller,	
  succinct	
  code!	
  (hWps://github.com/echen/roseWa-­‐scone/blob/master/inverted-­‐index/InvertedIndex.scala)	
  Note	
  that	
  this	
  
example	
  assumes	
  a	
  slightly	
  different	
  input	
  data	
  format	
  (more	
  than	
  one	
  document	
  per	
  file,	
  with	
  each	
  document	
  id	
  followed	
  by	
  the	
  text	
  all	
  on	
  a	
  
single	
  line,	
  tab	
  separated.
Problems
MapReduce is
“batch-mode”
only
37
Saturday, January 10, 15
You	
  can’t	
  do	
  event-­‐stream	
  processing	
  (a.k.a.	
  “real-­‐,me”)	
  with	
  MapReduce,	
  only	
  batch	
  mode	
  processing.
38
Storm!
Event stream
processing.
(actually 2011)
Saturday, January 10, 15
Storm	
  is	
  a	
  popular	
  framework	
  for	
  scalable,	
  resilient,	
  event-­‐stream	
  processing.
39
Storm!
Twitter wrote
Summingbird
for Storm +
Scalding
Saturday, January 10, 15
Storm	
  is	
  a	
  popular	
  framework	
  for	
  scalable,	
  resilient,	
  event-­‐stream	
  processing.
TwiUer	
  wrote	
  a	
  Scalding-­‐like	
  API	
  called	
  Summingbird	
  (hUps://github.com/twiUer/summingbird)	
  that	
  abstracts	
  over	
  Storm	
  and	
  Scalding,	
  so	
  you	
  can	
  write	
  one	
  program	
  that	
  can	
  run	
  in	
  batch	
  mode	
  or	
  process	
  
events.
(For	
  ,me’s	
  sake,	
  I	
  won’t	
  show	
  an	
  example.)
Problems
Flush to disk,
then reread
between jobs
40
100x perf. hit!
Saturday, January 10, 15
While	
  your	
  algorithm	
  may	
  be	
  implemented	
  using	
  a	
  sequence	
  of	
  MR	
  jobs	
  (which	
  takes	
  specialized	
  skills	
  to	
  write...),	
  the	
  run,me	
  system	
  doesn’t	
  understand	
  this,	
  so	
  the	
  output	
  of	
  each	
  job	
  is	
  flushed	
  to	
  disk	
  
(HDFS),	
  even	
  if	
  it’s	
  TBs	
  of	
  data.	
  Then	
  it	
  is	
  read	
  back	
  into	
  memory	
  as	
  soon	
  as	
  the	
  next	
  job	
  in	
  the	
  sequence	
  starts!
This	
  problem	
  plagues	
  Scalding	
  (and	
  Cascading),	
  too,	
  since	
  they	
  run	
  on	
  top	
  of	
  MapReduce	
  (although	
  Cascading	
  is	
  being	
  ported	
  to	
  Spark,	
  which	
  we’ll	
  discuss	
  next).	
  However,	
  as	
  of	
  mid-­‐2014,	
  Cascading	
  is	
  being	
  
ported	
  to	
  a	
  new,	
  faster	
  run,me	
  called	
  Apache	
  Tez,	
  and	
  it	
  might	
  be	
  ported	
  to	
  Spark,	
  which	
  we’ll	
  discuss.	
  TwiUer	
  is	
  working	
  on	
  its	
  own	
  op,miza,ons	
  within	
  Scalding.	
  So	
  the	
  perf.	
  issues	
  should	
  go	
  away	
  by	
  the	
  
end	
  of	
  2014.
It’s 2013.
41
Saturday, January 10, 15
Let’s	
  put	
  all	
  this	
  into	
  perspec,ve,	
  circa	
  2013...
hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-­‐_land_and_oceans_12000.jpg/1280px-­‐Whole_world_-­‐_land_and_oceans_12000.jpg
Salvation v2.0!
Use Spark
42
Saturday, January 10, 15
The	
  Hadoop	
  community	
  has	
  realized	
  over	
  the	
  last	
  several	
  years	
  that	
  a	
  replacement	
  for	
  MapReduce	
  is	
  needed.	
  While	
  MR	
  has	
  served	
  the	
  community	
  well,	
  it’s	
  a	
  
decade	
  old	
  and	
  shows	
  clear	
  limitaEons	
  and	
  problems,	
  as	
  we’ve	
  seen.	
  	
  In	
  late	
  2013,	
  Cloudera,	
  the	
  largest	
  Hadoop	
  vendor	
  officially	
  embraced	
  Spark	
  as	
  the	
  
replacement.	
  Most	
  of	
  the	
  other	
  Hadoop	
  vendors	
  followed.
43
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
Saturday, January 10, 15
This	
  implementaEon	
  is	
  more	
  sophisEcated	
  than	
  the	
  Scalding	
  example.	
  It	
  also	
  computes	
  the	
  count/document	
  of	
  each	
  word.	
  Hence,	
  there	
  are	
  
more	
  steps	
  (some	
  of	
  which	
  could	
  be	
  merged).
It	
  starts	
  with	
  imports,	
  then	
  declares	
  a	
  singleton	
  object	
  (a	
  first-­‐class	
  concept	
  in	
  Scala),	
  with	
  a	
  main	
  rouEne	
  (as	
  in	
  Java).
The	
  methods	
  are	
  colored	
  yellow	
  again.	
  Note	
  this	
  Eme	
  how	
  dense	
  with	
  meaning	
  they	
  are	
  this	
  Eme.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
44
Saturday, January 10, 15
This	
  implementaEon	
  is	
  more	
  sophisEcated	
  than	
  the	
  Scalding	
  example.	
  It	
  also	
  computes	
  the	
  count/document	
  of	
  each	
  word.	
  Hence,	
  there	
  are	
  
more	
  steps	
  (some	
  of	
  which	
  could	
  be	
  merged).
It	
  starts	
  with	
  imports,	
  then	
  declares	
  a	
  singleton	
  object	
  (a	
  first-­‐class	
  concept	
  in	
  Scala),	
  with	
  a	
  main	
  rouEne	
  (as	
  in	
  Java).
The	
  methods	
  are	
  colored	
  yellow	
  again.	
  Note	
  this	
  Eme	
  how	
  dense	
  with	
  meaning	
  they	
  are	
  this	
  Eme.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
45
Saturday, January 10, 15
You	
  being	
  the	
  workflow	
  by	
  declaring	
  a	
  SparkContext	
  (in	
  “local”	
  mode,	
  in	
  this	
  case).	
  The	
  rest	
  of	
  the	
  program	
  is	
  a	
  sequence	
  of	
  funcEon	
  calls,	
  
analogous	
  to	
  “pipes”	
  we	
  connect	
  together	
  to	
  perform	
  the	
  data	
  flow.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
46
Saturday, January 10, 15
Next	
  we	
  read	
  one	
  or	
  more	
  text	
  files.	
  If	
  “data/crawl”	
  has	
  1	
  or	
  more	
  Hadoop-­‐style	
  “part-­‐NNNNN”	
  files,	
  Spark	
  will	
  process	
  all	
  of	
  them	
  (in	
  parallel	
  if	
  
running	
  a	
  distributed	
  configuraEon;	
  they	
  will	
  be	
  processed	
  synchronously	
  in	
  local	
  mode).
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case ((w, p), n) => w
}
47
Saturday, January 10, 15
Now	
  we	
  begin	
  a	
  sequence	
  of	
  transformaEons	
  on	
  the	
  input	
  data.
First,	
  we	
  map	
  over	
  each	
  line,	
  a	
  string,	
  to	
  extract	
  the	
  original	
  document	
  id	
  (i.e.,	
  file	
  name,	
  UUID),	
  followed	
  by	
  the	
  text	
  in	
  the	
  document,	
  all	
  on	
  one	
  
line.	
  We	
  assume	
  tab	
  is	
  the	
  separator.	
  “(array(0),	
  array(1))”	
  returns	
  a	
  two-­‐element	
  “tuple”.	
  Think	
  of	
  the	
  output	
  RDD	
  has	
  having	
  a	
  schema	
  “String	
  
fileName,	
  String	
  text”.	
  	
  
flatMap	
  maps	
  over	
  each	
  of	
  these	
  2-­‐element	
  tuples.	
  We	
  split	
  the	
  text	
  into	
  words	
  on	
  non-­‐alphanumeric	
  characters,	
  then	
  output	
  collecEons	
  of	
  word	
  
(our	
  ulEmate,	
  final	
  “key”)	
  and	
  the	
  path.	
  Each	
  line	
  is	
  converted	
  to	
  a	
  collecEon	
  of	
  (word,path)	
  pairs,	
  so	
  flatMap	
  converts	
  the	
  collecEon	
  of	
  collecEons	
  
into	
  one	
  long	
  “flat”	
  collecEon	
  of	
  (word,path)	
  pairs.
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case ((w, p), n) => w
}
48
Saturday, January 10, 15
Next,	
  flatMap	
  maps	
  over	
  each	
  of	
  these	
  2-­‐element	
  tuples.	
  We	
  split	
  the	
  text	
  into	
  words	
  on	
  non-­‐alphanumeric	
  characters,	
  then	
  output	
  collecEons	
  of	
  
word	
  (our	
  ulEmate,	
  final	
  “key”)	
  and	
  the	
  path.	
  Each	
  line	
  is	
  converted	
  to	
  a	
  collecEon	
  of	
  (word,path)	
  pairs,	
  so	
  flatMap	
  converts	
  the	
  collecEon	
  of	
  
collecEons	
  into	
  one	
  long	
  “flat”	
  collecEon	
  of	
  (word,path)	
  pairs.
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case ((w, p), n) => w
}
49
Saturday, January 10, 15
Then	
  we	
  map	
  over	
  these	
  pairs	
  and	
  add	
  a	
  single	
  “seed”	
  count	
  of	
  1.	
  
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case ((w, p), n) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}.sortBy {
case (path, n) => (-n, path)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
50
((word1,	
  path1),	
  n1)
((word2,	
  path2),	
  n2)
...
Saturday, January 10, 15
reduceByKey	
  does	
  an	
  implicit	
  “group	
  by”	
  to	
  bring	
  together	
  all	
  occurrences	
  of	
  the	
  same	
  (word,	
  path)	
  and	
  then	
  sums	
  up	
  their	
  counts.
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case ((w, p), n) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}.sortBy {
case (path, n) => (-n, path)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
51
(word,	
  Seq((word,	
  (path1,	
  n1)),	
  (word,	
  (path2,	
  n2)),	
  ...))
...
Saturday, January 10, 15
Now	
  we	
  do	
  an	
  explicit	
  group	
  by	
  to	
  bring	
  all	
  the	
  same	
  words	
  together.	
  The	
  output	
  	
  will	
  be	
  (word,	
  Seq((word,	
  (path1,	
  n1)),	
  (word,	
  (path2,	
  n2)),	
  ...)),	
  
where	
  Seq	
  is	
  a	
  Scala	
  abstracEon	
  for	
  sequences,	
  e.g.,	
  Lists,	
  Vectors,	
  etc.
case ((w, p), n) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}.sortBy {
case (path, n) => (-n, path)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
52
(word,	
  “(path4,	
  n4),	
  (path3,	
  n3),	
  (path2,	
  n2),	
  ...”)
...
Saturday, January 10, 15
Back	
  to	
  the	
  code,	
  the	
  last	
  map	
  step	
  looks	
  complicated,	
  but	
  all	
  it	
  does	
  is	
  sort	
  the	
  sequence	
  by	
  the	
  count	
  descending	
  (because	
  you	
  would	
  want	
  the	
  
first	
  elements	
  in	
  the	
  list	
  to	
  be	
  the	
  documents	
  that	
  menEon	
  the	
  word	
  most	
  frequently),	
  then	
  it	
  converts	
  the	
  sequence	
  into	
  a	
  string.	
  
case ((w, p), n) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}.sortBy {
case (path, n) => (-n, path)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
53
Saturday, January 10, 15
We	
  finish	
  by	
  saving	
  the	
  output	
  as	
  text	
  file(s)	
  and	
  stopping	
  the	
  workflow.	
  
54
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case (w, (p, n)) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
Altogether
Saturday, January 10, 15
The	
  whole	
  shebang	
  (14pt.	
  font,	
  this	
  Eme)
55
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case (w, (p, n)) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
Powerful,
beautiful
combinators
Saturday, January 10, 15
Stop	
  for	
  a	
  second	
  and	
  admire	
  the	
  simplicity	
  and	
  elegance	
  of	
  this	
  code,	
  even	
  if	
  you	
  don’t	
  understand	
  the	
  details.	
  This	
  is	
  what	
  coding	
  should	
  be,	
  
IMHO,	
  very	
  concise,	
  to	
  the	
  point,	
  elegant	
  to	
  read.	
  Hence,	
  a	
  highly-­‐producEve	
  way	
  to	
  work!!
56
Saturday, January 10, 15
Another	
  example	
  of	
  a	
  beauEful	
  and	
  profound	
  DSL,	
  in	
  this	
  case	
  from	
  the	
  world	
  of	
  my	
  first	
  profession,	
  Physics:	
  Maxwell’s	
  equaEons	
  that	
  unified	
  
Electricity	
  and	
  MagneEsm:	
  hWp://upload.wikimedia.org/wikipedia/commons/c/c4/Maxwell'sEquaEons.svg
57
Spark also has a streaming
mode for “mini-batch”
event handling.
(We’ll come back to it...)
Saturday, January 10, 15
Spark	
  Streaming	
  operates	
  on	
  data	
  “slices”	
  of	
  granularity	
  as	
  small	
  as	
  0.5-­‐1	
  second.	
  Not	
  quite	
  the	
  same	
  as	
  single	
  event	
  handling,	
  but	
  possibly	
  all	
  that’s	
  needed	
  for	
  ~90%	
  (?)	
  of	
  scenarios.
Winning!
58
Saturday, January 10, 15
Let’s recap why is Scala taking over the big data
world.
59
Elegant DSLs
...
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupBy {
case (w, (p, n)) => w
}
...
Saturday, January 10, 15
You	
  would	
  be	
  hard	
  pressed	
  to	
  find	
  a	
  more	
  concise	
  DSL	
  for	
  big	
  data!!
60
The JVM
Saturday, January 10, 15
You	
  have	
  the	
  rich	
  Java	
  ecosystem	
  at	
  your	
  finger,ps.
61
Functional Combinators
CREATE TABLE inverted_index (
word CHARACTER(64),
id1 INTEGER,
count1 INTEGER,
id2 INTEGER,
count2 INTEGER);
val inverted_index:
Stream[(String,Int,Int,Int,Int)]
SQL
Analogs
Saturday, January 10, 15
You	
  have	
  func,onal	
  “combinators”,	
  side-­‐effect	
  free	
  func,ons	
  that	
  combine/compose	
  together	
  to	
  create	
  complex	
  algorithms	
  with	
  minimal	
  effort.
For	
  simplicity,	
  assume	
  we	
  only	
  keep	
  the	
  two	
  documents	
  where	
  the	
  word	
  appears	
  most	
  frequently,	
  along	
  with	
  the	
  counts	
  in	
  each	
  doc	
  and	
  we’ll	
  assume	
  integer	
  ids	
  for	
  the	
  documents..
We’ll	
  model	
  the	
  same	
  data	
  set	
  in	
  Scala	
  with	
  a	
  Stream,	
  because	
  we’re	
  going	
  to	
  process	
  it	
  in	
  “batch”.
62
Functional Combinators
Restrict
SELECT * FROM inverted_index
WHERE word LIKE 'sc%';
inverted_index.filter {
case (word, _) =>
word startsWith "sc"
}
Saturday, January 10, 15
The	
  equivalent	
  of	
  a	
  SELECT	
  ...	
  WHERE	
  query	
  in	
  both	
  SQL	
  and	
  Scala.
63
Functional Combinators
Projection
SELECT word FROM inverted_index;
inverted_index.map {
case (word, _, _, _, _) =>
word
}
Saturday, January 10, 15
Projec,ng	
  just	
  the	
  fields	
  we	
  want.
64
Functional Combinators
Group By and Order By
SELECT count1, COUNT(*) AS size
FROM inverted_index
GROUP BY count1
ORDER BY size DESC;
inverted_index.groupBy {
case (_, _, count1, _, _) => count1
} map {
case (count1, words) =>(count1,words.size)
} sortBy {
case (count, size) => -size
}
Saturday, January 10, 15
Group	
  By:	
  group	
  by	
  the	
  frequency	
  of	
  occurrence	
  for	
  the	
  first	
  document,	
  then	
  order	
  by	
  the	
  group	
  size,	
  
descending.
Unification?
65
Saturday, January 10, 15
Can we unify SQL and Spark?
Spark Core +
Spark SQL +
Spark Streaming
66
Saturday, January 10, 15
hUps://spark.apache.org/sql/
hUps://spark.apache.org/streaming/
67
val sparkContext =
new SparkContext("local[*]", "Much Wow!")
val streamingContext =
new StreamingContext(
sparkContext, Seconds(60))
val sqlContext =
new SQLContext(sparkContext)
import sqlContext._
case class Flight(
number: Int,
carrier: String,
origin: String,
destination: String,
...)
object Flight {
def parse(str: String): Option[Flight]=
{...}
}
Saturday, January 10, 15
Adapted	
  from	
  Typesafe's	
  Spark	
  Workshop	
  exercises.	
  Copyright	
  (c)	
  2014,	
  Typesafe.	
  All	
  Rights	
  Reserved.
val sparkContext =
new SparkContext("local", "connections")
val streamingContext =
new StreamingContext(
sparkContext, Seconds(60))
val sqlContext =
new SQLContext(sparkContext)
import sqlContext._
case class Flight(
number: Int,
carrier: String,
origin: String,
destination: String,
...)
object Flight {
def parse(str: String): Option[Flight]=
{...}
}
68
Saturday, January 10, 15
Create	
  the	
  SparkContext	
  that	
  manages	
  for	
  the	
  driver	
  program,	
  followed	
  by	
  context	
  object	
  for	
  streaming	
  and	
  another	
  for	
  the	
  SQL	
  extensions.
Note	
  that	
  the	
  laUer	
  two	
  take	
  the	
  SparkContext	
  as	
  an	
  argument.	
  The	
  StreamingContext	
  is	
  constructed	
  with	
  an	
  argument	
  for	
  the	
  size	
  of	
  each	
  batch	
  of	
  events	
  to	
  capture,	
  every	
  60	
  seconds	
  here.
val sparkContext =
new SparkContext("local", "connections")
val streamingContext =
new StreamingContext(
sparkContext, Seconds(60))
val sqlContext =
new SQLContext(sparkContext)
import sqlContext._
case class Flight(
number: Int,
carrier: String,
origin: String,
destination: String,
...)
object Flight {
def parse(str: String): Option[Flight]=
{...}
}
69
Saturday, January 10, 15
For	
  some	
  APIs,	
  Spark	
  uses	
  the	
  idiom	
  of	
  impor,ng	
  the	
  members	
  of	
  an	
  instance.
70
import sqlContext._
case class Flight(
number: Int,
carrier: String,
origin: String,
destination: String,
...)
object Flight {
def parse(str: String): Option[Flight]=
{...}
}
val server = ... // IP address or name
val port = ... // integer
val dStream =
streamingContext.socketTextStream(
server, port)
val flights = for {
Saturday, January 10, 15
Define	
  a	
  case	
  class	
  to	
  represent	
  a	
  schema,	
  and	
  a	
  companion	
  object	
  to	
  define	
  a	
  parse	
  method	
  for	
  parsing	
  a	
  string	
  into	
  an	
  instance	
  of	
  the	
  class.	
  Return	
  an	
  op,on	
  in	
  case	
  a	
  string	
  can’t	
  be	
  parsed.
In	
  this	
  case,	
  we’ll	
  simulate	
  a	
  data	
  stream	
  of	
  data	
  about	
  airline	
  flights,	
  where	
  the	
  records	
  contain	
  only	
  the	
  flight	
  number,	
  carrier,	
  and	
  the	
  origin	
  and	
  des,na,on	
  airports,	
  and	
  other	
  data	
  we’ll	
  ignore	
  for	
  this	
  
example,	
  like	
  ,mes.
}
val server = ... // IP address or name
val port = ... // integer
val dStream =
streamingContext.socketTextStream(
server, port)
val flights = for {
line <- dStream
flight <- Flight.parse(line)
} yield flight
flights.foreachRDD { (rdd, time) =>
rdd.registerTempTable("flights")
sql(s"""
SELECT $time, carrier, origin,
destination, COUNT(*)
FROM flights
GROUP BY carrier, origin, destination
ORDER BY c4 DESC
71
Saturday, January 10, 15
Read	
  the	
  data	
  stream	
  from	
  a	
  socket	
  origina,ng	
  at	
  a	
  given	
  server:port	
  address.
}
val server = ... // IP address or name
val port = ... // integer
val dStream =
streamingContext.socketTextStream(
server, port)
val flights = for {
line <- dStream
flight <- Flight.parse(line)
} yield flight
flights.foreachRDD { (rdd, time) =>
rdd.registerTempTable("flights")
sql(s"""
SELECT $time, carrier, origin,
destination, COUNT(*)
FROM flights
GROUP BY carrier, origin, destination
ORDER BY c4 DESC
72
Saturday, January 10, 15
Parse	
  the	
  text	
  stream	
  into	
  Flight	
  records.
flights.foreachRDD { (rdd, time) =>
rdd.registerTempTable("flights")
sql(s"""
SELECT $time, carrier, origin,
destination, COUNT(*)
FROM flights
GROUP BY carrier, origin, destination
ORDER BY c4 DESC
LIMIT 20""").foreach(println)
}
streamingContext.start()
streamingContext.awaitTermination()
streamingContext.stop()
73
Saturday, January 10, 15
A	
  DStream	
  is	
  a	
  collec,on	
  of	
  RDDs,	
  so	
  for	
  each	
  RDD	
  (effec,vely,	
  during	
  each	
  batch	
  interval),	
  invoke	
  the	
  anonymous	
  func,on,	
  which	
  takes	
  as	
  arguments	
  the	
  RDD	
  and	
  current	
  ,mestamp	
  (epoch	
  milliseconds),	
  
then	
  we	
  register	
  the	
  RDD	
  as	
  a	
  “SQL”	
  table	
  named	
  “flights”	
  and	
  run	
  a	
  query	
  over	
  it	
  that	
  groups	
  by	
  the	
  carrier,	
  origin,	
  and	
  des,na,on,	
  selects	
  for	
  those	
  fields,	
  plus	
  the	
  hard-­‐coded	
  ,mestamp	
  (i.e.,	
  “hardcoded”	
  
for	
  each	
  batch	
  interval),	
  and	
  the	
  count	
  of	
  records	
  in	
  the	
  group.	
  Also	
  order	
  by	
  the	
  count	
  descending,	
  and	
  return	
  only	
  the	
  first	
  20	
  records.	
  
74
flights.foreachRDD { (rdd, time) =>
rdd.registerTempTable("flights")
sql(s"""
SELECT $time, carrier, origin,
destination, COUNT(*)
FROM flights
GROUP BY carrier, origin, destination
ORDER BY c4 DESC
LIMIT 20""").foreach(println)
}
streamingContext.start()
streamingContext.awaitTermination()
streamingContext.stop()
Saturday, January 10, 15
Start	
  processing	
  the	
  stream,	
  await	
  termina,on,	
  then	
  close	
  it	
  all	
  down.
75
We Won!
Saturday, January 10, 15
The title of this talk is in the present tense (present participle to be precise?), but has Scala already won? Is the game
over?
dean.wampler@typesafe.com
polyglotprogramming.com/talks
@deanwampler
76
Saturday, January 10, 15
See also the bonus slides that follow.
77
Bonus
Slides
Saturday, January 10, 15
Scala for Mathematics
78
Saturday, January 10, 15
Spire and Algebird. ScalaZ also has some of these data structures and
algorithms.
Algebird
Large-scale
Analytics
79
Saturday, January 10, 15
hUps://github.com/twiUer/algebird
Algebraic types like Monoids,
which generalize addition.
–A set of elements.
–An associative binary operation.
–An identity element.
80
Saturday, January 10, 15
For	
  big	
  data,	
  you’re	
  oQen	
  willing	
  to	
  trade	
  100%	
  accuracy	
  for	
  much	
  faster	
  approxima,ons.	
  Algebird	
  implements	
  many	
  well-­‐known	
  approx.	
  algorithms,	
  all	
  of	
  which	
  can	
  be	
  modeled	
  generically	
  using	
  Monoids	
  
or	
  similar	
  data	
  structures.
Efficient approximation
algorithms.
– “Add All the Things”,
infoq.com/presentations/
abstract-algebra-analytics
81
Saturday, January 10, 15
For	
  big	
  data,	
  you’re	
  oQen	
  willing	
  to	
  trade	
  100%	
  accuracy	
  for	
  much	
  faster	
  approxima,ons.	
  Algebird	
  implements	
  many	
  well-­‐known	
  approx.	
  algorithms,	
  all	
  of	
  which	
  can	
  be	
  modeled	
  generically	
  using	
  Monoids	
  
or	
  similar	
  data	
  structures.
Hash, don’t Sample!
-- Twitter
82
Saturday, January 10, 15
Sampling	
  was	
  a	
  common	
  way	
  to	
  deal	
  with	
  excessive	
  amounts	
  of	
  data.	
  The	
  new	
  mantra,	
  exemplified	
  by	
  this	
  catch	
  phrase	
  from	
  TwiUer’s	
  data	
  teams	
  is	
  to	
  use	
  approxima,on	
  algorithms	
  where	
  the	
  data	
  is	
  
usually	
  hashed	
  into	
  space-­‐efficient	
  data	
  structures.	
  You	
  make	
  a	
  space	
  vs.	
  accuracy	
  trade	
  off.	
  OQen,	
  approximate	
  answers	
  are	
  good	
  enough.
Spire
Fast Numerics
83
Saturday, January 10, 15
hUps://github.com/non/spire
•Types: Complex, Quaternion,
Rational, Real, Interval, ...
•Algebraic types: Semigroups,
Monoids, Groups, Rings, Fields,
Vector Spaces, ...
•Trigonometric Functions.
•...
84
Saturday, January 10, 15
It	
  exploits	
  macros	
  with	
  generics	
  to	
  implement	
  efficient	
  computa,on	
  without	
  requiring	
  custom	
  types	
  for	
  par,cular	
  primi,ves.	
  
85
What to Fix?
Saturday, January 10, 15
What could be improved?
Schema
Management
Tuples limited
to 22 fields.
86
Saturday, January 10, 15
Spark	
  and	
  other	
  tools	
  use	
  Tuples	
  or	
  Case	
  classes	
  to	
  define	
  schemas.	
  In	
  2.11,	
  case	
  classes	
  are	
  no	
  longer	
  limited	
  to	
  22	
  fields,	
  but	
  it’s	
  not	
  always	
  convenient	
  to	
  define	
  a	
  case	
  class	
  when	
  a	
  tuple	
  would	
  do.
Schema
Management
Instantiate an
object for each
record?
87
Saturday, January 10, 15
Do	
  we	
  want	
  to	
  create	
  an	
  instance	
  of	
  an	
  object	
  for	
  each	
  record?	
  We	
  already	
  have	
  support	
  for	
  data	
  formats	
  like	
  Parquet	
  that	
  implement	
  columnar	
  storage.	
  Can	
  our	
  Scala	
  APIs	
  transparently	
  use	
  arrays	
  of	
  
primi,ves	
  for	
  columns,	
  for	
  beUer	
  performance?	
  Spark	
  has	
  a	
  support	
  for	
  Parquet	
  which	
  does	
  this.	
  Can	
  we	
  do	
  it	
  and	
  should	
  we	
  do	
  it	
  for	
  all	
  data	
  sets?
Schema
Management
Remaining
boxing/
specialization
limitations
88
Saturday, January 10, 15
Speaking	
  of	
  primi,ves,	
  we	
  need	
  to	
  solve	
  the	
  remaining	
  issues	
  with	
  specializa,on	
  of	
  collec,ons	
  for	
  primi,ves.	
  Miniboxing??
iPython
Notebooks
Need an
equivalent for
Scala
89
Saturday, January 10, 15
iPython	
  Notebooks	
  are	
  very	
  popular	
  with	
  data	
  scien,sts,	
  because	
  they	
  integrate	
  data	
  visualiza,on,	
  etc.	
  with	
  code.
	
  github.com/Bridgewater/scala-­‐notebook	
  	
  is	
  a	
  start.

More Related Content

What's hot

Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Scala : language of the future
Scala : language of the futureScala : language of the future
Scala : language of the futureAnsviaLab
 
The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論scalaconfjp
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and FriendsRob Vesse
 
flatMap Oslo presentation slides
flatMap Oslo presentation slidesflatMap Oslo presentation slides
flatMap Oslo presentation slidesMartin Odersky
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Why Scala for Web 2.0?
Why Scala for Web 2.0?Why Scala for Web 2.0?
Why Scala for Web 2.0?Alex Payne
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Holden Karau
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute modelDean Wampler
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to ScalaSaleem Ansari
 
Scala in practice
Scala in practiceScala in practice
Scala in practiceTomer Gabel
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to ScalaTim Underwood
 
Spark architecture
Spark architectureSpark architecture
Spark architecturedatamantra
 

What's hot (20)

Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Devoxx
DevoxxDevoxx
Devoxx
 
Scala : language of the future
Scala : language of the futureScala : language of the future
Scala : language of the future
 
The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論
 
The State of Scala
The State of ScalaThe State of Scala
The State of Scala
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
flatMap Oslo presentation slides
flatMap Oslo presentation slidesflatMap Oslo presentation slides
flatMap Oslo presentation slides
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Why Scala for Web 2.0?
Why Scala for Web 2.0?Why Scala for Web 2.0?
Why Scala for Web 2.0?
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to Scala
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 

Viewers also liked

Advanced Functional Programming in Scala
Advanced Functional Programming in ScalaAdvanced Functional Programming in Scala
Advanced Functional Programming in ScalaPatrick Nicolas
 
Scala - the good, the bad and the very ugly
Scala - the good, the bad and the very uglyScala - the good, the bad and the very ugly
Scala - the good, the bad and the very uglyBozhidar Bozhanov
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scalapramode_ce
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Jonas Bonér
 
Scala the good and bad parts
Scala the good and bad partsScala the good and bad parts
Scala the good and bad partsbenewu
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)
JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)
JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)Stephen Chin
 
Demystifying Big Data with Scala and Akka
Demystifying Big Data with Scala and AkkaDemystifying Big Data with Scala and Akka
Demystifying Big Data with Scala and AkkaKnoldus Inc.
 
Geography Project - Trip around the world!
Geography Project - Trip around the world!Geography Project - Trip around the world!
Geography Project - Trip around the world!nicolek5361
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraBreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraMichaël Figuière
 
[Nuxeo World 2013] CAPGEMINI NL AND NUXEO: ONE YEAR LATER, GREAT THINGS HAVE ...
[Nuxeo World 2013] CAPGEMINI NL AND NUXEO: ONE YEAR LATER, GREAT THINGS HAVE ...[Nuxeo World 2013] CAPGEMINI NL AND NUXEO: ONE YEAR LATER, GREAT THINGS HAVE ...
[Nuxeo World 2013] CAPGEMINI NL AND NUXEO: ONE YEAR LATER, GREAT THINGS HAVE ...Nuxeo
 
Don’t Be a Tool - Content Management Strategy
Don’t Be a Tool -  Content Management StrategyDon’t Be a Tool -  Content Management Strategy
Don’t Be a Tool - Content Management StrategyJohn Eckman
 
Using Scala for building DSLs
Using Scala for building DSLsUsing Scala for building DSLs
Using Scala for building DSLsIndicThreads
 
A Field Guide to DSL Design in Scala
A Field Guide to DSL Design in ScalaA Field Guide to DSL Design in Scala
A Field Guide to DSL Design in ScalaTomer Gabel
 
Error Handling in Reactive Systems
Error Handling in Reactive SystemsError Handling in Reactive Systems
Error Handling in Reactive SystemsDean Wampler
 

Viewers also liked (20)

Why Scala?
Why Scala?Why Scala?
Why Scala?
 
Advanced Functional Programming in Scala
Advanced Functional Programming in ScalaAdvanced Functional Programming in Scala
Advanced Functional Programming in Scala
 
Scala - the good, the bad and the very ugly
Scala - the good, the bad and the very uglyScala - the good, the bad and the very ugly
Scala - the good, the bad and the very ugly
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
 
Scala
ScalaScala
Scala
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)
 
Scala
ScalaScala
Scala
 
Scala the good and bad parts
Scala the good and bad partsScala the good and bad parts
Scala the good and bad parts
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)
JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)
JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)
 
Demystifying Big Data with Scala and Akka
Demystifying Big Data with Scala and AkkaDemystifying Big Data with Scala and Akka
Demystifying Big Data with Scala and Akka
 
Geography Project - Trip around the world!
Geography Project - Trip around the world!Geography Project - Trip around the world!
Geography Project - Trip around the world!
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraBreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
 
[Nuxeo World 2013] CAPGEMINI NL AND NUXEO: ONE YEAR LATER, GREAT THINGS HAVE ...
[Nuxeo World 2013] CAPGEMINI NL AND NUXEO: ONE YEAR LATER, GREAT THINGS HAVE ...[Nuxeo World 2013] CAPGEMINI NL AND NUXEO: ONE YEAR LATER, GREAT THINGS HAVE ...
[Nuxeo World 2013] CAPGEMINI NL AND NUXEO: ONE YEAR LATER, GREAT THINGS HAVE ...
 
Don’t Be a Tool - Content Management Strategy
Don’t Be a Tool -  Content Management StrategyDon’t Be a Tool -  Content Management Strategy
Don’t Be a Tool - Content Management Strategy
 
Using Scala for building DSLs
Using Scala for building DSLsUsing Scala for building DSLs
Using Scala for building DSLs
 
A Field Guide to DSL Design in Scala
A Field Guide to DSL Design in ScalaA Field Guide to DSL Design in Scala
A Field Guide to DSL Design in Scala
 
Error Handling in Reactive Systems
Error Handling in Reactive SystemsError Handling in Reactive Systems
Error Handling in Reactive Systems
 

Similar to Why Scala Is Taking Over the Big Data World

Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoopRexRamos9
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on HadoopDataWorks Summit
 
Viridians on Rails
Viridians on RailsViridians on Rails
Viridians on RailsViridians
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 

Similar to Why Scala Is Taking Over the Big Data World (20)

Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Apache drill
Apache drillApache drill
Apache drill
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
 
Viridians on Rails
Viridians on RailsViridians on Rails
Viridians on Rails
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 

Recently uploaded (20)

Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 

Why Scala Is Taking Over the Big Data World

  • 1. Why Scala Is Taking Over the Big Data World Data Day Texas Jan. 10, 2015 dean.wampler@typesafe.com 1 Saturday, January 10, 15 Sydney Opera House photos Copyright © Dean Wampler, 2011-2015, Some Rights Reserved. The content is free to reuse, but attribution is requested. http://creativecommons.org/licenses/by-nc-sa/2.0/legalcode
  • 3. Servers Network ETL ServicesETL Services Batch (ETL) Ingestion ETL ServicesETL ServicesEvent Handling Hadoop DBs 3 You and me! DBAs, Analysts Data Scientists Saturday, January 10, 15 DBAs,  tradi,onal  data  analysts:  SQL,  SAS.  “DBs”  could  also  be  distributed  file  systems. Data  Scien,sts  -­‐  Sta,s,cs  experts.  Some  programming,  especially  Python,  R,  Julia,  maybe  Matlab,  etc. Developers  like  us,  who  figure  out  the  infrastructure  (but  don’t  usually  manage  it),  and  write  the  programs  that  do  batch-­‐oriented  ETL  (extract,  transform,  and  load),  and  more  real-­‐,me  event  handling.   OQen  this  data  gets  pumped  into  Hadoop  or  other  compute  engines  and  data  stores,  including  various  SQL  and  NoSQL  DBs,  and  file  systems.  
  • 5. You are here... 5 Saturday, January 10, 15 Let’s  put  all  this  into  perspec,ve... hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-­‐_land_and_oceans_12000.jpg/1280px-­‐Whole_world_-­‐_land_and_oceans_12000.jpg
  • 6. ... and it’s 2008. 6 Saturday, January 10, 15 Let’s  put  all  this  into  perspec,ve,  circa  2008... hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-­‐_land_and_oceans_12000.jpg/1280px-­‐Whole_world_-­‐_land_and_oceans_12000.jpg
  • 7. Hadoop 7 Saturday, January 10, 15 Let’s drill down to Hadoop, which first gained widespread awareness in 2008-2009, when Yahoo! announced they were running a 10K core cluster with it, Hadoop became a top-level Apache project, etc.
  • 8. 8 Quant, by today’s standards Tue,  Sep  30,  2008 16PB 4000 Saturday, January 10, 15 The  largest  clusters  today  are  roughly  10x  in  size.
  • 9. 9 Hadoop master Resource Mgr Name Node slave DiskDiskDiskDiskDisk Data Node Node Mgr slave DiskDiskDiskDiskDisk Data Node Node Mgr Saturday, January 10, 15 The  schema,c  view  of  a  Hadoop  v2  cluster,  with  YARN  (Yet  Another  Resource  Nego,ator)  handling  resource  alloca,on  and  job  scheduling.  (V2  is  actually  circa  2013,  but  this  detail  is  unimportant  for  this   discussion).  The  master  services  are  federated  for  failover,  normally  (not  shown)  and  there  would  usually  be  more  than  two  slave  nodes.  Node  Managers  manage  the  tasks   The  Name  Node  is  the  master  for  the  Hadoop  Distributed  File  System.  Blocks  are  managed  on  each  slave  by  Data  Node  services. The  Resource  Manager  decomposes  each  job  in  to  tasks,  which  are  distributed  to  slave  nodes  and  managed  by  the  Node  Managers.  There  are  other  services  I’m  omicng  for  simplicity.
  • 10. 10 Hadoop master Resource Mgr Name Node slave DiskDiskDiskDiskDisk Data Node Node Mgr slave DiskDiskDiskDiskDisk Data Node Node Mgr HDFS HDFS Saturday, January 10, 15 The  schema,c  view  of  a  Hadoop  v2  cluster,  with  YARN  (Yet  Another  Resource  Nego,ator)  handling  resource  alloca,on  and  job  scheduling.  (V2  is  actually  circa  2013,  but  this  detail  is  unimportant  for  this   discussion).  The  master  services  are  federated  for  failover,  normally  (not  shown)  and  there  would  usually  be  more  than  two  slave  nodes.  Node  Managers  manage  the  tasks   The  Name  Node  is  the  master  for  the  Hadoop  Distributed  File  System.  Blocks  are  managed  on  each  slave  by  Data  Node  services. The  Resource  Manager  decomposes  each  job  in  to  tasks,  which  are  distributed  to  slave  nodes  and  managed  by  the  Node  Managers.  There  are  other  services  I’m  omicng  for  simplicity.
  • 11. 11 master Resource Mgr Name Node slave DiskDiskDiskDiskDisk Data Node Node Mgr slave DiskDiskDiskDiskDisk Data Node Node Mgr HDFS HDFS MapReduce  Job MapReduce  Job MapReduce  Job Saturday, January 10, 15 You  submit  MapReduce  jobs  to  the  Resource  Manager.  Those  jobs  could  be  wriUen  in  the  Java  API,  or  higher-­‐level  APIs  like  Cascading,  Scalding,  Pig,  and  Hive.
  • 12. MapReduce The classic compute model for Hadoop 12 Saturday, January 10, 15 Historically,  up  to  2013,  MapReduce  was  the  officially-­‐supported  compute  engine  for  wri,ng  all  compute  jobs.
  • 13. 13 Example: Inverted Index inverse index block hadoop (.../hadoop,1) (.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs (.../hive,1)hive (.../hbase,1),(.../hive,1)hbase ...... ...... block ...... block ...... block ...... (.../hadoop,1),(.../hive,1)and ...... wikipedia.org/hadoop Hadoop provides MapReduce and HDFS wikipedia.org/hbase HBase stores data in HDFS wikipedia.org/hive Hive queries HDFS files and HBase tables with SQL ... ... Saturday, January 10, 15
  • 14. 14 Example: Inverted Index wikipedia.org/hadoop Hadoop provides MapReduce and HDFS wikipedia.org/hbase HBase stores data in HDFS wikipedia.org/hive Hive queries HDFS files and HBase tables with SQL ... ... Web Crawl index block ...... Hadoop provides...wikipedia.org/hadoop ...... block ...... HBase stores...wikipedia.org/hbase ...... block ...... Hive queries...wikipedia.org/hive ...... inverse index block hadoop (.../hadoop,1) (.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs (.../hive,1)hive (.../hbase,1),(.../hive,1)hbase ...... ...... block ...... block ...... block ...... (.../hadoop,1),(.../hive,1)and ...... Miracle!! Compute Inverted Index Saturday, January 10, 15
  • 15. 15 wikipedia.org/hadoop Hadoop provides MapReduce and HDFS wikipedia.org/hbase HBase stores data in HDFS wikipedia.org/hive Hive queries HDFS files and HBase tables with SQL ... ... Web Crawl index block ...... Hadoop provides...wikipedia.org/hadoop ...... block ...... HBase stores...wikipedia.org/hbase ...... block ...... Hive queries...wikipedia.org/hive ...... inv bl bl bl bl Miracle!! Compute Inverted Index Saturday, January 10, 15
  • 16. 16 ... Hadoop provides....org/hadoop ... ... HBase stores....org/hbase ... ... Hive queries....org/hive ... inverse index block hadoop (.../hadoop,1) (.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs (.../hive,1)hive (.../hbase,1),(.../hive,1)hbase ...... ...... block ...... block ...... block ...... (.../hadoop,1),(.../hive,1)and ...... Miracle!! Compute Inverted Index Saturday, January 10, 15
  • 17. 17 1 Map step + 1 Reduce step Saturday, January 10, 15
  • 18. 18 1 Map step + 1 Reduce step Saturday, January 10, 15
  • 19. Problems Hard to implement algorithms... 19 Saturday, January 10, 15 Nontrivial  algorithms  are  hard  to  convert  to  just  map  and  reduce  steps,  even  though  you  can  sequence  mul,ple  map+reduce  “jobs”.  It  takes  specialized  exper,se  of  the  tricks  of  the  trade.
  • 20. Problems ... and the Hadoop API is horrible: 20 Saturday, January 10, 15 The  Hadoop  API  is  very  low  level  and  tedious...
  • 21. 21 import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class LineIndexer { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); Saturday, January 10, 15 For  example,  the  classic  inverted  index,  used  to  convert  an  index  of  document  locaEons  (e.g.,  URLs)  to  words  into  the  reverse;  an  index  from   words  to  doc  locaEons.  It’s  the  basis  of  search  engines. I’m  not  going  to  explain  the  details.  The  point  is  to  noEce  all  the  boilerplate  that  obscures  the  problem  logic. Everything  is  in  one  outer  class.  We  start  with  a  main  rouEne  that  sets  up  the  job. I  used  yellow  for  method  calls,  because  methods  do  the  real  work!!  But  noEce  that  most  of  the  funcEons  in  this  code  don’t  really  do  a  whole  lot   of  work  for  us...
  • 22. 22 JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass( LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { Saturday, January 10, 15 Boilerplate
  • 23. 23 new Path("output")); conf.setMapperClass( LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); Saturday, January 10, 15 main ends with a try-catch clause to run the job.
  • 24. 24 public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); String line = val.toString(); Saturday, January 10, 15 This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name) tuples.
  • 25. 25 FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, location); } } } public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { Saturday, January 10, 15 The rest of the LineIndexMapper class and map method.
  • 26. 26 public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } Saturday, January 10, 15 The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.
  • 27. 27 Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } } } Saturday, January 10, 15 EOF
  • 28. 28 Altogether import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class LineIndexer { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass( LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, location); } } } public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } } } Saturday, January 10, 15 The  whole  shebang  (6pt.  font)  This  would  take  a  few  hours  to  write,  test,  etc.  assuming  you  already  know  the  API  and  the  idioms  for  using  it.
  • 29. Early 2012. 29 Saturday, January 10, 15 Let’s  put  all  this  into  perspec,ve,  circa  2012... hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-­‐_land_and_oceans_12000.jpg/1280px-­‐Whole_world_-­‐_land_and_oceans_12000.jpg
  • 30. 30 Dean Wampler “Trolling the Hadoop community since 2012...” Saturday, January 10, 15
  • 32. 32 In which I claimed that: Saturday, January 10, 15
  • 33. Salvation! Scalding 33 Saturday, January 10, 15 TwiWer  wrote  a  Scala  API,  hWps://github.com/twiWer/scalding,  to  hide  the  mess.  Actually,  Scalding  sits  on  top  of  Cascading  (hWp://cascading.org)  a  higher-­‐level  Java   API  that  exposes  more  sensible  “combinators”  of  operaEons,  but  is  sEll  somewhat  verbose  due  to  the  pre-­‐Java  8  convenEons  it  must  use.  Scalding  gives  us  the  full   benefits  of  Scala  syntax  and  funcEonal  operaEons,  “combinators”.
  • 34. 34 MapReduce  (Java) Cascading  (Java) Scalding  (Scala) Saturday, January 10, 15 TwiWer  wrote  a  Scala  API,  hWps://github.com/twiWer/scalding,  to  hide  the  mess.  Actually,  Scalding  sits  on  top  of  Cascading  (hWp://cascading.org)  a  higher-­‐level  Java   API  that  exposes  more  sensible  “combinators”  of  operaEons,  but  is  sEll  somewhat  verbose  due  to  the  pre-­‐Java  8  convenEons  it  must  use.  Scalding  gives  us  the  full   benefits  of  Scala  syntax  and  funcEonal  operaEons,  “combinators”.
  • 35. 35 import com.twitter.scalding._ class InvertedIndex(args: Args) extends Job(args) { val texts = Tsv("texts.tsv", ('id, 'text)) val wordToIds = texts .flatMap(('id, 'text) -> ('word, 'id2)) { fields: (Long, String) => val (id2, text) = text.split("s+").map { word => (word, id2) } } val invertedIndex = wordToTweets.groupBy('word) { _.toList[Long]('id2 -> 'ids) } Saturday, January 10, 15 DramaEcally  smaller,  succinct  code!  (hWps://github.com/echen/roseWa-­‐scone/blob/master/inverted-­‐index/InvertedIndex.scala)  Note  that  this   example  assumes  a  slightly  different  input  data  format  (more  than  one  document  per  file,  with  each  document  id  followed  by  the  text  all  on  a   single  line,  tab  separated.
  • 36. 36That’s it! .flatMap(('id, 'text) -> ('word, 'id2)) { fields: (Long, String) => val (id2, text) = text.split("s+").map { word => (word, id2) } } val invertedIndex = wordToTweets.groupBy('word) { _.toList[Long]('id2 -> 'ids) } invertedIndex.write(Tsv("output.tsv")) } Saturday, January 10, 15 DramaEcally  smaller,  succinct  code!  (hWps://github.com/echen/roseWa-­‐scone/blob/master/inverted-­‐index/InvertedIndex.scala)  Note  that  this   example  assumes  a  slightly  different  input  data  format  (more  than  one  document  per  file,  with  each  document  id  followed  by  the  text  all  on  a   single  line,  tab  separated.
  • 37. Problems MapReduce is “batch-mode” only 37 Saturday, January 10, 15 You  can’t  do  event-­‐stream  processing  (a.k.a.  “real-­‐,me”)  with  MapReduce,  only  batch  mode  processing.
  • 38. 38 Storm! Event stream processing. (actually 2011) Saturday, January 10, 15 Storm  is  a  popular  framework  for  scalable,  resilient,  event-­‐stream  processing.
  • 39. 39 Storm! Twitter wrote Summingbird for Storm + Scalding Saturday, January 10, 15 Storm  is  a  popular  framework  for  scalable,  resilient,  event-­‐stream  processing. TwiUer  wrote  a  Scalding-­‐like  API  called  Summingbird  (hUps://github.com/twiUer/summingbird)  that  abstracts  over  Storm  and  Scalding,  so  you  can  write  one  program  that  can  run  in  batch  mode  or  process   events. (For  ,me’s  sake,  I  won’t  show  an  example.)
  • 40. Problems Flush to disk, then reread between jobs 40 100x perf. hit! Saturday, January 10, 15 While  your  algorithm  may  be  implemented  using  a  sequence  of  MR  jobs  (which  takes  specialized  skills  to  write...),  the  run,me  system  doesn’t  understand  this,  so  the  output  of  each  job  is  flushed  to  disk   (HDFS),  even  if  it’s  TBs  of  data.  Then  it  is  read  back  into  memory  as  soon  as  the  next  job  in  the  sequence  starts! This  problem  plagues  Scalding  (and  Cascading),  too,  since  they  run  on  top  of  MapReduce  (although  Cascading  is  being  ported  to  Spark,  which  we’ll  discuss  next).  However,  as  of  mid-­‐2014,  Cascading  is  being   ported  to  a  new,  faster  run,me  called  Apache  Tez,  and  it  might  be  ported  to  Spark,  which  we’ll  discuss.  TwiUer  is  working  on  its  own  op,miza,ons  within  Scalding.  So  the  perf.  issues  should  go  away  by  the   end  of  2014.
  • 41. It’s 2013. 41 Saturday, January 10, 15 Let’s  put  all  this  into  perspec,ve,  circa  2013... hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-­‐_land_and_oceans_12000.jpg/1280px-­‐Whole_world_-­‐_land_and_oceans_12000.jpg
  • 42. Salvation v2.0! Use Spark 42 Saturday, January 10, 15 The  Hadoop  community  has  realized  over  the  last  several  years  that  a  replacement  for  MapReduce  is  needed.  While  MR  has  served  the  community  well,  it’s  a   decade  old  and  shows  clear  limitaEons  and  problems,  as  we’ve  seen.    In  late  2013,  Cloudera,  the  largest  Hadoop  vendor  officially  embraced  Spark  as  the   replacement.  Most  of  the  other  Hadoop  vendors  followed.
  • 43. 43 import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(args: Array[String]) = { val sc = new SparkContext( "local", "Inverted Index") sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } Saturday, January 10, 15 This  implementaEon  is  more  sophisEcated  than  the  Scalding  example.  It  also  computes  the  count/document  of  each  word.  Hence,  there  are   more  steps  (some  of  which  could  be  merged). It  starts  with  imports,  then  declares  a  singleton  object  (a  first-­‐class  concept  in  Scala),  with  a  main  rouEne  (as  in  Java). The  methods  are  colored  yellow  again.  Note  this  Eme  how  dense  with  meaning  they  are  this  Eme.
  • 44. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(args: Array[String]) = { val sc = new SparkContext( "local", "Inverted Index") sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } 44 Saturday, January 10, 15 This  implementaEon  is  more  sophisEcated  than  the  Scalding  example.  It  also  computes  the  count/document  of  each  word.  Hence,  there  are   more  steps  (some  of  which  could  be  merged). It  starts  with  imports,  then  declares  a  singleton  object  (a  first-­‐class  concept  in  Scala),  with  a  main  rouEne  (as  in  Java). The  methods  are  colored  yellow  again.  Note  this  Eme  how  dense  with  meaning  they  are  this  Eme.
  • 45. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(args: Array[String]) = { val sc = new SparkContext( "local", "Inverted Index") sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } 45 Saturday, January 10, 15 You  being  the  workflow  by  declaring  a  SparkContext  (in  “local”  mode,  in  this  case).  The  rest  of  the  program  is  a  sequence  of  funcEon  calls,   analogous  to  “pipes”  we  connect  together  to  perform  the  data  flow.
  • 46. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(args: Array[String]) = { val sc = new SparkContext( "local", "Inverted Index") sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } 46 Saturday, January 10, 15 Next  we  read  one  or  more  text  files.  If  “data/crawl”  has  1  or  more  Hadoop-­‐style  “part-­‐NNNNN”  files,  Spark  will  process  all  of  them  (in  parallel  if   running  a  distributed  configuraEon;  they  will  be  processed  synchronously  in  local  mode).
  • 47. sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .groupBy { case ((w, p), n) => w } 47 Saturday, January 10, 15 Now  we  begin  a  sequence  of  transformaEons  on  the  input  data. First,  we  map  over  each  line,  a  string,  to  extract  the  original  document  id  (i.e.,  file  name,  UUID),  followed  by  the  text  in  the  document,  all  on  one   line.  We  assume  tab  is  the  separator.  “(array(0),  array(1))”  returns  a  two-­‐element  “tuple”.  Think  of  the  output  RDD  has  having  a  schema  “String   fileName,  String  text”.     flatMap  maps  over  each  of  these  2-­‐element  tuples.  We  split  the  text  into  words  on  non-­‐alphanumeric  characters,  then  output  collecEons  of  word   (our  ulEmate,  final  “key”)  and  the  path.  Each  line  is  converted  to  a  collecEon  of  (word,path)  pairs,  so  flatMap  converts  the  collecEon  of  collecEons   into  one  long  “flat”  collecEon  of  (word,path)  pairs.
  • 48. sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .groupBy { case ((w, p), n) => w } 48 Saturday, January 10, 15 Next,  flatMap  maps  over  each  of  these  2-­‐element  tuples.  We  split  the  text  into  words  on  non-­‐alphanumeric  characters,  then  output  collecEons  of   word  (our  ulEmate,  final  “key”)  and  the  path.  Each  line  is  converted  to  a  collecEon  of  (word,path)  pairs,  so  flatMap  converts  the  collecEon  of   collecEons  into  one  long  “flat”  collecEon  of  (word,path)  pairs.
  • 49. sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .groupBy { case ((w, p), n) => w } 49 Saturday, January 10, 15 Then  we  map  over  these  pairs  and  add  a  single  “seed”  count  of  1.  
  • 50. } .reduceByKey { (n1, n2) => n1 + n2 } .groupBy { case ((w, p), n) => w } .map { case (w, seq) => val seq2 = seq map { case (_, (p, n)) => (p, n) }.sortBy { case (path, n) => (-n, path) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) sc.stop() } } 50 ((word1,  path1),  n1) ((word2,  path2),  n2) ... Saturday, January 10, 15 reduceByKey  does  an  implicit  “group  by”  to  bring  together  all  occurrences  of  the  same  (word,  path)  and  then  sums  up  their  counts.
  • 51. } .reduceByKey { (n1, n2) => n1 + n2 } .groupBy { case ((w, p), n) => w } .map { case (w, seq) => val seq2 = seq map { case (_, (p, n)) => (p, n) }.sortBy { case (path, n) => (-n, path) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) sc.stop() } } 51 (word,  Seq((word,  (path1,  n1)),  (word,  (path2,  n2)),  ...)) ... Saturday, January 10, 15 Now  we  do  an  explicit  group  by  to  bring  all  the  same  words  together.  The  output    will  be  (word,  Seq((word,  (path1,  n1)),  (word,  (path2,  n2)),  ...)),   where  Seq  is  a  Scala  abstracEon  for  sequences,  e.g.,  Lists,  Vectors,  etc.
  • 52. case ((w, p), n) => w } .map { case (w, seq) => val seq2 = seq map { case (_, (p, n)) => (p, n) }.sortBy { case (path, n) => (-n, path) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) sc.stop() } } 52 (word,  “(path4,  n4),  (path3,  n3),  (path2,  n2),  ...”) ... Saturday, January 10, 15 Back  to  the  code,  the  last  map  step  looks  complicated,  but  all  it  does  is  sort  the  sequence  by  the  count  descending  (because  you  would  want  the   first  elements  in  the  list  to  be  the  documents  that  menEon  the  word  most  frequently),  then  it  converts  the  sequence  into  a  string.  
  • 53. case ((w, p), n) => w } .map { case (w, seq) => val seq2 = seq map { case (_, (p, n)) => (p, n) }.sortBy { case (path, n) => (-n, path) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) sc.stop() } } 53 Saturday, January 10, 15 We  finish  by  saving  the  output  as  text  file(s)  and  stopping  the  workflow.  
  • 54. 54 import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(args: Array[String]) = { val sc = new SparkContext( "local", "Inverted Index") sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .groupBy { case (w, (p, n)) => w } .map { case (w, seq) => val seq2 = seq map { case (_, (p, n)) => (p, n) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) sc.stop() } } Altogether Saturday, January 10, 15 The  whole  shebang  (14pt.  font,  this  Eme)
  • 55. 55 text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .groupBy { case (w, (p, n)) => w } .map { case (w, seq) => val seq2 = seq map { case (_, (p, n)) => (p, n) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) Powerful, beautiful combinators Saturday, January 10, 15 Stop  for  a  second  and  admire  the  simplicity  and  elegance  of  this  code,  even  if  you  don’t  understand  the  details.  This  is  what  coding  should  be,   IMHO,  very  concise,  to  the  point,  elegant  to  read.  Hence,  a  highly-­‐producEve  way  to  work!!
  • 56. 56 Saturday, January 10, 15 Another  example  of  a  beauEful  and  profound  DSL,  in  this  case  from  the  world  of  my  first  profession,  Physics:  Maxwell’s  equaEons  that  unified   Electricity  and  MagneEsm:  hWp://upload.wikimedia.org/wikipedia/commons/c/c4/Maxwell'sEquaEons.svg
  • 57. 57 Spark also has a streaming mode for “mini-batch” event handling. (We’ll come back to it...) Saturday, January 10, 15 Spark  Streaming  operates  on  data  “slices”  of  granularity  as  small  as  0.5-­‐1  second.  Not  quite  the  same  as  single  event  handling,  but  possibly  all  that’s  needed  for  ~90%  (?)  of  scenarios.
  • 58. Winning! 58 Saturday, January 10, 15 Let’s recap why is Scala taking over the big data world.
  • 59. 59 Elegant DSLs ... .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupBy { case (w, (p, n)) => w } ... Saturday, January 10, 15 You  would  be  hard  pressed  to  find  a  more  concise  DSL  for  big  data!!
  • 60. 60 The JVM Saturday, January 10, 15 You  have  the  rich  Java  ecosystem  at  your  finger,ps.
  • 61. 61 Functional Combinators CREATE TABLE inverted_index ( word CHARACTER(64), id1 INTEGER, count1 INTEGER, id2 INTEGER, count2 INTEGER); val inverted_index: Stream[(String,Int,Int,Int,Int)] SQL Analogs Saturday, January 10, 15 You  have  func,onal  “combinators”,  side-­‐effect  free  func,ons  that  combine/compose  together  to  create  complex  algorithms  with  minimal  effort. For  simplicity,  assume  we  only  keep  the  two  documents  where  the  word  appears  most  frequently,  along  with  the  counts  in  each  doc  and  we’ll  assume  integer  ids  for  the  documents.. We’ll  model  the  same  data  set  in  Scala  with  a  Stream,  because  we’re  going  to  process  it  in  “batch”.
  • 62. 62 Functional Combinators Restrict SELECT * FROM inverted_index WHERE word LIKE 'sc%'; inverted_index.filter { case (word, _) => word startsWith "sc" } Saturday, January 10, 15 The  equivalent  of  a  SELECT  ...  WHERE  query  in  both  SQL  and  Scala.
  • 63. 63 Functional Combinators Projection SELECT word FROM inverted_index; inverted_index.map { case (word, _, _, _, _) => word } Saturday, January 10, 15 Projec,ng  just  the  fields  we  want.
  • 64. 64 Functional Combinators Group By and Order By SELECT count1, COUNT(*) AS size FROM inverted_index GROUP BY count1 ORDER BY size DESC; inverted_index.groupBy { case (_, _, count1, _, _) => count1 } map { case (count1, words) =>(count1,words.size) } sortBy { case (count, size) => -size } Saturday, January 10, 15 Group  By:  group  by  the  frequency  of  occurrence  for  the  first  document,  then  order  by  the  group  size,   descending.
  • 65. Unification? 65 Saturday, January 10, 15 Can we unify SQL and Spark?
  • 66. Spark Core + Spark SQL + Spark Streaming 66 Saturday, January 10, 15 hUps://spark.apache.org/sql/ hUps://spark.apache.org/streaming/
  • 67. 67 val sparkContext = new SparkContext("local[*]", "Much Wow!") val streamingContext = new StreamingContext( sparkContext, Seconds(60)) val sqlContext = new SQLContext(sparkContext) import sqlContext._ case class Flight( number: Int, carrier: String, origin: String, destination: String, ...) object Flight { def parse(str: String): Option[Flight]= {...} } Saturday, January 10, 15 Adapted  from  Typesafe's  Spark  Workshop  exercises.  Copyright  (c)  2014,  Typesafe.  All  Rights  Reserved.
  • 68. val sparkContext = new SparkContext("local", "connections") val streamingContext = new StreamingContext( sparkContext, Seconds(60)) val sqlContext = new SQLContext(sparkContext) import sqlContext._ case class Flight( number: Int, carrier: String, origin: String, destination: String, ...) object Flight { def parse(str: String): Option[Flight]= {...} } 68 Saturday, January 10, 15 Create  the  SparkContext  that  manages  for  the  driver  program,  followed  by  context  object  for  streaming  and  another  for  the  SQL  extensions. Note  that  the  laUer  two  take  the  SparkContext  as  an  argument.  The  StreamingContext  is  constructed  with  an  argument  for  the  size  of  each  batch  of  events  to  capture,  every  60  seconds  here.
  • 69. val sparkContext = new SparkContext("local", "connections") val streamingContext = new StreamingContext( sparkContext, Seconds(60)) val sqlContext = new SQLContext(sparkContext) import sqlContext._ case class Flight( number: Int, carrier: String, origin: String, destination: String, ...) object Flight { def parse(str: String): Option[Flight]= {...} } 69 Saturday, January 10, 15 For  some  APIs,  Spark  uses  the  idiom  of  impor,ng  the  members  of  an  instance.
  • 70. 70 import sqlContext._ case class Flight( number: Int, carrier: String, origin: String, destination: String, ...) object Flight { def parse(str: String): Option[Flight]= {...} } val server = ... // IP address or name val port = ... // integer val dStream = streamingContext.socketTextStream( server, port) val flights = for { Saturday, January 10, 15 Define  a  case  class  to  represent  a  schema,  and  a  companion  object  to  define  a  parse  method  for  parsing  a  string  into  an  instance  of  the  class.  Return  an  op,on  in  case  a  string  can’t  be  parsed. In  this  case,  we’ll  simulate  a  data  stream  of  data  about  airline  flights,  where  the  records  contain  only  the  flight  number,  carrier,  and  the  origin  and  des,na,on  airports,  and  other  data  we’ll  ignore  for  this   example,  like  ,mes.
  • 71. } val server = ... // IP address or name val port = ... // integer val dStream = streamingContext.socketTextStream( server, port) val flights = for { line <- dStream flight <- Flight.parse(line) } yield flight flights.foreachRDD { (rdd, time) => rdd.registerTempTable("flights") sql(s""" SELECT $time, carrier, origin, destination, COUNT(*) FROM flights GROUP BY carrier, origin, destination ORDER BY c4 DESC 71 Saturday, January 10, 15 Read  the  data  stream  from  a  socket  origina,ng  at  a  given  server:port  address.
  • 72. } val server = ... // IP address or name val port = ... // integer val dStream = streamingContext.socketTextStream( server, port) val flights = for { line <- dStream flight <- Flight.parse(line) } yield flight flights.foreachRDD { (rdd, time) => rdd.registerTempTable("flights") sql(s""" SELECT $time, carrier, origin, destination, COUNT(*) FROM flights GROUP BY carrier, origin, destination ORDER BY c4 DESC 72 Saturday, January 10, 15 Parse  the  text  stream  into  Flight  records.
  • 73. flights.foreachRDD { (rdd, time) => rdd.registerTempTable("flights") sql(s""" SELECT $time, carrier, origin, destination, COUNT(*) FROM flights GROUP BY carrier, origin, destination ORDER BY c4 DESC LIMIT 20""").foreach(println) } streamingContext.start() streamingContext.awaitTermination() streamingContext.stop() 73 Saturday, January 10, 15 A  DStream  is  a  collec,on  of  RDDs,  so  for  each  RDD  (effec,vely,  during  each  batch  interval),  invoke  the  anonymous  func,on,  which  takes  as  arguments  the  RDD  and  current  ,mestamp  (epoch  milliseconds),   then  we  register  the  RDD  as  a  “SQL”  table  named  “flights”  and  run  a  query  over  it  that  groups  by  the  carrier,  origin,  and  des,na,on,  selects  for  those  fields,  plus  the  hard-­‐coded  ,mestamp  (i.e.,  “hardcoded”   for  each  batch  interval),  and  the  count  of  records  in  the  group.  Also  order  by  the  count  descending,  and  return  only  the  first  20  records.  
  • 74. 74 flights.foreachRDD { (rdd, time) => rdd.registerTempTable("flights") sql(s""" SELECT $time, carrier, origin, destination, COUNT(*) FROM flights GROUP BY carrier, origin, destination ORDER BY c4 DESC LIMIT 20""").foreach(println) } streamingContext.start() streamingContext.awaitTermination() streamingContext.stop() Saturday, January 10, 15 Start  processing  the  stream,  await  termina,on,  then  close  it  all  down.
  • 75. 75 We Won! Saturday, January 10, 15 The title of this talk is in the present tense (present participle to be precise?), but has Scala already won? Is the game over?
  • 78. Scala for Mathematics 78 Saturday, January 10, 15 Spire and Algebird. ScalaZ also has some of these data structures and algorithms.
  • 79. Algebird Large-scale Analytics 79 Saturday, January 10, 15 hUps://github.com/twiUer/algebird
  • 80. Algebraic types like Monoids, which generalize addition. –A set of elements. –An associative binary operation. –An identity element. 80 Saturday, January 10, 15 For  big  data,  you’re  oQen  willing  to  trade  100%  accuracy  for  much  faster  approxima,ons.  Algebird  implements  many  well-­‐known  approx.  algorithms,  all  of  which  can  be  modeled  generically  using  Monoids   or  similar  data  structures.
  • 81. Efficient approximation algorithms. – “Add All the Things”, infoq.com/presentations/ abstract-algebra-analytics 81 Saturday, January 10, 15 For  big  data,  you’re  oQen  willing  to  trade  100%  accuracy  for  much  faster  approxima,ons.  Algebird  implements  many  well-­‐known  approx.  algorithms,  all  of  which  can  be  modeled  generically  using  Monoids   or  similar  data  structures.
  • 82. Hash, don’t Sample! -- Twitter 82 Saturday, January 10, 15 Sampling  was  a  common  way  to  deal  with  excessive  amounts  of  data.  The  new  mantra,  exemplified  by  this  catch  phrase  from  TwiUer’s  data  teams  is  to  use  approxima,on  algorithms  where  the  data  is   usually  hashed  into  space-­‐efficient  data  structures.  You  make  a  space  vs.  accuracy  trade  off.  OQen,  approximate  answers  are  good  enough.
  • 83. Spire Fast Numerics 83 Saturday, January 10, 15 hUps://github.com/non/spire
  • 84. •Types: Complex, Quaternion, Rational, Real, Interval, ... •Algebraic types: Semigroups, Monoids, Groups, Rings, Fields, Vector Spaces, ... •Trigonometric Functions. •... 84 Saturday, January 10, 15 It  exploits  macros  with  generics  to  implement  efficient  computa,on  without  requiring  custom  types  for  par,cular  primi,ves.  
  • 85. 85 What to Fix? Saturday, January 10, 15 What could be improved?
  • 86. Schema Management Tuples limited to 22 fields. 86 Saturday, January 10, 15 Spark  and  other  tools  use  Tuples  or  Case  classes  to  define  schemas.  In  2.11,  case  classes  are  no  longer  limited  to  22  fields,  but  it’s  not  always  convenient  to  define  a  case  class  when  a  tuple  would  do.
  • 87. Schema Management Instantiate an object for each record? 87 Saturday, January 10, 15 Do  we  want  to  create  an  instance  of  an  object  for  each  record?  We  already  have  support  for  data  formats  like  Parquet  that  implement  columnar  storage.  Can  our  Scala  APIs  transparently  use  arrays  of   primi,ves  for  columns,  for  beUer  performance?  Spark  has  a  support  for  Parquet  which  does  this.  Can  we  do  it  and  should  we  do  it  for  all  data  sets?
  • 88. Schema Management Remaining boxing/ specialization limitations 88 Saturday, January 10, 15 Speaking  of  primi,ves,  we  need  to  solve  the  remaining  issues  with  specializa,on  of  collec,ons  for  primi,ves.  Miniboxing??
  • 89. iPython Notebooks Need an equivalent for Scala 89 Saturday, January 10, 15 iPython  Notebooks  are  very  popular  with  data  scien,sts,  because  they  integrate  data  visualiza,on,  etc.  with  code.  github.com/Bridgewater/scala-­‐notebook    is  a  start.