SlideShare a Scribd company logo
1 of 37
Spark	
  and	
  Shark	
  
High-­‐Speed	
  In-­‐Memory	
  Analytics	
  
over	
  Hadoop	
  and	
  Hive	
  Data	
  
Matei	
  Zaharia,	
  in	
  collaboration	
  with	
  
Mosharaf	
  Chowdhury,	
  Tathagata	
  Das,	
  Ankur	
  Dave,	
  Cliff	
  Engle,	
  
Michael	
  Franklin,	
  Haoyuan	
  Li,	
  Antonio	
  Lupher,	
  Justin	
  Ma,	
  Murphy	
  
McCauley,	
  Scott	
  Shenker,	
  Ion	
  Stoica,	
  Reynold	
  Xin	
  
	
  
UC	
  Berkeley	
  
spark-­‐project.org	
                                                 UC	
  BERKELEY	
  
What	
  is	
  Spark?	
  
Not	
  a	
  modified	
  version	
  of	
  Hadoop	
  
Separate,	
  fast,	
  MapReduce-­‐like	
  engine	
  
 » In-­‐memory	
  data	
  storage	
  for	
  very	
  fast	
  iterative	
  queries	
  
 » General	
  execution	
  graphs	
  and	
  powerful	
  optimizations	
  
 » Up	
  to	
  40x	
  faster	
  than	
  Hadoop	
  

Compatible	
  with	
  Hadoop’s	
  storage	
  APIs	
  
 » Can	
  read/write	
  to	
  any	
  Hadoop-­‐supported	
  system,	
  
   including	
  HDFS,	
  HBase,	
  SequenceFiles,	
  etc	
  
What	
  is	
  Shark?	
  
Port	
  of	
  Apache	
  Hive	
  to	
  run	
  on	
  Spark	
  
Compatible	
  with	
  existing	
  Hive	
  data,	
  metastores,	
  
and	
  queries	
  (HiveQL,	
  UDFs,	
  etc)	
  
Similar	
  speedups	
  of	
  up	
  to	
  40x	
  
Project	
  History	
  
Spark	
  project	
  started	
  in	
  2009,	
  open	
  sourced	
  2010	
  
Shark	
  started	
  summer	
  2011,	
  alpha	
  April	
  2012	
  
In	
  use	
  at	
  Berkeley,	
  Princeton,	
  Klout,	
  Foursquare,	
  
Conviva,	
  Quantifind,	
  Yahoo!	
  Research	
  &	
  others	
  
200+	
  member	
  meetup,	
  500+	
  watchers	
  on	
  GitHub	
  
This	
  Talk	
  
Spark	
  programming	
  model	
  
User	
  applications	
  
Shark	
  overview	
  
Demo	
  
Next	
  major	
  addition:	
  Streaming	
  Spark	
  
Why	
  a	
  New	
  Programming	
  Model?	
  
MapReduce	
  greatly	
  simplified	
  big	
  data	
  analysis	
  
But	
  as	
  soon	
  as	
  it	
  got	
  popular,	
  users	
  wanted	
  more:	
  
  » More	
  complex,	
  multi-­‐stage	
  applications	
  (e.g.	
  
    iterative	
  graph	
  algorithms	
  and	
  machine	
  learning)	
  
  » More	
  interactive	
  ad-­‐hoc	
  queries	
  

Both	
  multi-­‐stage	
  and	
  interactive	
  apps	
  require	
  
faster	
  data	
  sharing	
  across	
  parallel	
  jobs	
  
Data	
  Sharing	
  in	
  MapReduce	
  
               HDFS	
                      HDFS	
                                HDFS	
                      HDFS	
  
               read	
                      write	
                               read	
                      write	
  
                          iter.	
  1	
                                                      iter.	
  2	
                         .	
  	
  .	
  	
  .	
  

   Input	
  

                HDFS	
                             query	
  1	
                                                result	
  1	
  
                read	
  
                                                   query	
  2	
                                                result	
  2	
  


                                                   query	
  3	
                                                result	
  3	
  
   Input	
  
                                                       .	
  	
  .	
  	
  .	
  

Slow	
  due	
  to	
  replication,	
  serialization,	
  and	
  disk	
  IO	
  
Data	
  Sharing	
  in	
  Spark	
  

                       iter.	
  1	
           iter.	
  2	
                .	
  	
  .	
  	
  .	
  

   Input	
  

                                              query	
  1	
  
                one-­‐time	
  
               processing	
  
                                              query	
  2	
  

                                              query	
  3	
  
   Input	
                  Distributed	
  
                             memory	
           .	
  	
  .	
  	
  .	
  

         10-­‐100×	
  faster	
  than	
  network	
  and	
  disk	
  
Spark	
  Programming	
  Model	
  
Key	
  idea:	
  resilient	
  distributed	
  datasets	
  (RDDs)	
  
  » Distributed	
  collections	
  of	
  objects	
  that	
  can	
  be	
  cached	
  
    in	
  memory	
  across	
  cluster	
  nodes	
  
  » Manipulated	
  through	
  various	
  parallel	
  operators	
  
  » Automatically	
  rebuilt	
  on	
  failure	
  

Interface	
  
  » Clean	
  language-­‐integrated	
  API	
  in	
  Scala	
  
  » Can	
  be	
  used	
  interactively	
  from	
  Scala	
  console	
  
Example:	
  Log	
  Mining	
  
 Load	
  error	
  messages	
  from	
  a	
  log	
  into	
  memory,	
  then	
  
 interactively	
  search	
  for	
  various	
  patterns	
  
                                                                         Base	
  RDD	
                                        Cache	
  1	
  
lines = spark.textFile(“hdfs://...”)                                         Transformed	
  RDD	
  
                                                                                                                    Worker	
  
                                                                                               results	
  
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))                                                                  tasks	
   Block	
  1	
  
                                                                                     Driver	
  
cachedMsgs = messages.cache()

                                                                                    Action	
  
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count                                                                                   Cache	
  2	
  
                                                                                                                  Worker	
  
. . .
                                                                                        Cache	
  3	
  
                                                                             Worker	
                              Block	
  2	
  
 Result:	
  sull-­‐text	
  s1	
  TB	
  data	
  in	
  5-­‐7	
  sec	
  
               fcaled	
  to	
   earch	
  of	
  Wikipedia	
  
 in	
  <1	
  sec	
  (vs	
  ec	
  for	
  on-­‐disk	
  data)	
  ata)	
  
        (vs	
  170	
  s 20	
  sec	
  for	
  on-­‐disk	
  d                    Block	
  3	
  
Fault	
  Tolerance	
  
RDDs	
  track	
  the	
  series	
  of	
  transformations	
  used	
  to	
  
build	
  them	
  (their	
  lineage)	
  to	
  recompute	
  lost	
  data	
  
E.g:	
   messages             = textFile(...).filter(_.contains(“error”))
                                             .map(_.split(‘t’)(2))
	
  
	
  
       HadoopRDD	
                      FilteredRDD	
                    MappedRDD	
  
       path	
  =	
  hdfs://…	
        func	
  =	
  _.contains(...)	
     func	
  =	
  _.split(…)	
  
Example:	
  Logistic	
  Regression	
  
val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)                            Load	
  data	
  in	
  memory	
  once	
  
                                    Initial	
  parameter	
  vector	
  
for (i <- 1 to ITERATIONS) {
  val gradient = data.map(p =>
    (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
  ).reduce(_ + _)
  w -= gradient              Repeated	
  MapReduce	
  steps	
  
}                              to	
  do	
  gradient	
  descent	
  

println("Final w: " + w)
Logistic	
  Regression	
  Performance	
  
                             4500	
  
                             4000	
  
                                                                                               127	
  s	
  /	
  iteration	
  
                             3500	
  
Running	
  Time	
  (s)	
  




                             3000	
  
                             2500	
                                                           Hadoop	
  
                             2000	
  
                                                                                              Spark	
  
                             1500	
  
                             1000	
  
                              500	
                                                        first	
  iteration	
  174	
  s	
  
                                0	
                                                       further	
  iterations	
  6	
  s	
  
                                        1	
      5	
       10	
        20	
      30	
  
                                                Number	
  of	
  Iterations	
  
Supported	
  Operators	
  
map              reduce        sample

filter           count         cogroup

groupBy          reduceByKey   take

sort             groupByKey    partitionBy

join             first         pipe

leftOuterJoin    union         save

rightOuterJoin   cross         ...
Other	
  Engine	
  Features	
  
General	
  graphs	
  of	
  operators	
  (e.g.	
  map-­‐reduce-­‐reduce)	
  
Hash-­‐based	
  reduces	
  (faster	
  than	
  Hadoop’s	
  sort)	
  
Controlled	
  data	
  partitioning	
  to	
  lower	
  communication	
  

                                                    PageRank	
  Performance	
  
                                        200	
       171	
  
         Iteration	
  time	
  (s)	
  




                                                                                Hadoop	
  
                                        150	
  
                                                                                Basic	
  Spark	
  
                                        100	
                 72	
  
                                          50	
                         23	
     Spark	
  +	
  Controlled	
  
                                                                                Partitioning	
  
                                            0	
  
Spark	
  Users	
  
User	
  Applications	
  
In-­‐memory	
  analytics	
  &	
  anomaly	
  detection	
  (Conviva)	
  
Interactive	
  queries	
  on	
  data	
  streams	
  (Quantifind)	
  
Exploratory	
  log	
  analysis	
  (Foursquare)	
  
Traffic	
  estimation	
  w/	
  GPS	
  data	
  (Mobile	
  Millennium)	
  
Twitter	
  spam	
  classification	
  (Monarch)	
  
.	
  .	
  .	
  
Conviva	
  GeoReport	
  
  Hive	
                                                   20	
  

Spark	
              0.5	
  
                                                             Time	
  (hours)	
  
             0	
               5	
     10	
     15	
     20	
  


Group	
  aggregations	
  on	
  many	
  keys	
  w/	
  same	
  filter	
  
40×	
  gain	
  over	
  Hive	
  from	
  avoiding	
  repeated	
  
reading,	
  deserialization	
  and	
  filtering	
  
Mobile	
  Millennium	
  Project	
  
Estimate	
  city	
  traffic	
  from	
  crowdsourced	
  GPS	
  data	
  
                                                                     Iterative	
  EM	
  algorithm	
  
                                                                     scaling	
  to	
  160	
  nodes	
  




          Credit:	
  Tim	
  Hunter,	
  with	
  support	
  of	
  the	
  Mobile	
  Millennium	
  team;	
  P.I.	
  Alex	
  Bayen;	
  traffic.berkeley.edu	
  
Shark:	
  Hive	
  on	
  Spark	
  
Motivation	
  
Hive	
  is	
  great,	
  but	
  Hadoop’s	
  execution	
  engine	
  
makes	
  even	
  the	
  smallest	
  queries	
  take	
  minutes	
  
Scala	
  is	
  good	
  for	
  programmers,	
  but	
  many	
  data	
  
users	
  only	
  know	
  SQL	
  
Can	
  we	
  extend	
  Hive	
  to	
  run	
  on	
  Spark?	
  
Hive	
  Architecture	
  
              	
  


              	
  	
  Client	
                   CLI	
             JDBC	
  
                                              Driver	
  
  Meta	
                                                   Physical	
  Plan	
  
  store	
             SQL	
      Query	
  
                     Parser	
   Optimizer	
                  Execution	
  

                                        MapReduce	
  

                                   HDFS	
  
Shark	
  Architecture	
  
              	
  


              	
  	
  Client	
                   CLI	
             JDBC	
  
                                              Driver	
        Cache	
  Mgr.	
  
  Meta	
                                                   Physical	
  Plan	
  
  store	
             SQL	
      Query	
  
                     Parser	
   Optimizer	
                  Execution	
  

                                              Spark	
  

                                   HDFS	
  
                                                      [Engle	
  et	
  al,	
  SIGMOD	
  2012]	
  
Efficient	
  In-­‐Memory	
  Storage	
  
Simply	
  caching	
  Hive	
  records	
  as	
  Java	
  objects	
  is	
  
inefficient	
  due	
  to	
  high	
  per-­‐object	
  overhead	
  
Instead,	
  Shark	
  employs	
  column-­‐oriented	
  
storage	
  using	
  arrays	
  of	
  primitive	
  types	
  
            Row	
  Storage	
               Column	
  Storage	
  
             1	
     john	
      4.1	
        1	
       2	
       3	
  

             2	
     mike	
      3.5	
      john	
   mike	
   sally	
  

             3	
     sally	
     6.4	
       4.1	
     3.5	
     6.4	
  
Efficient	
  In-­‐Memory	
  Storage	
  
Simply	
  caching	
  Hive	
  records	
  as	
  Java	
  objects	
  is	
  
inefficient	
  due	
  to	
  high	
  per-­‐object	
  overhead	
  
Instead,	
  Shark	
  employs	
  column-­‐oriented	
  
storage	
  using	
  arrays	
  of	
  primitive	
  types	
  
            Row	
  Storage	
               Column	
  Storage	
  
             1	
     john	
      4.1	
        1	
       2	
       3	
  
Benefit:	
  similarly	
  compact	
  size	
  to	
  serialized	
  data,	
  
          2	
   mike	
   3.5	
   faster	
  to	
  access	
   sally	
  
                 but	
  >5x	
                john	
   mike	
  

             3	
     sally	
     6.4	
       4.1	
     3.5	
     6.4	
  
Using	
  Shark	
  
CREATE TABLE mydata_cached AS SELECT …
	
  




Run	
  standard	
  HiveQL	
  on	
  it,	
  including	
  UDFs	
  
       » A	
  few	
  esoteric	
  features	
  are	
  not	
  yet	
  supported	
  

Can	
  also	
  call	
  from	
  Scala	
  to	
  mix	
  with	
  Spark	
  
	
  
        Early	
  alpha	
  release	
  at	
  shark.cs.berkeley.edu	
  
Benchmark	
  Query	
  1	
  
SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;



  Shark (cached) 12s



          Shark                                     182s




           Hive                                            207s



                  0    50    100            150            200    250
                            Execution Time (secs)
Benchmark	
  Query	
  2	
  
SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings
FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL
WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’
GROUP BY V.sourceIP
ORDER BY earnings DESC
LIMIT 1;


   Shark (cached)        126s




           Shark                         270s




            Hive                                              447s



                    0   100      200            300     400          500
                                Execution Time (secs)
Demo	
  
What’s	
  Next?	
  
Recall	
  that	
  Spark’s	
  model	
  was	
  motivated	
  by	
  two	
  
emerging	
  uses	
  (interactive	
  and	
  multi-­‐stage	
  apps)	
  
Another	
  emerging	
  use	
  case	
  that	
  needs	
  fast	
  data	
  
sharing	
  is	
  stream	
  processing	
  
  » Track	
  and	
  update	
  state	
  in	
  memory	
  as	
  events	
  arrive	
  
  » Large-­‐scale	
  reporting,	
  click	
  analysis,	
  spam	
  filtering,	
  etc	
  
Streaming	
  Spark	
  
      Extends	
  Spark	
  to	
  perform	
  streaming	
  computations	
  
      Runs	
  as	
  a	
  series	
  of	
  small	
  (~1	
  s)	
  batch	
  jobs,	
  keeping	
  
      state	
  in	
  memory	
  as	
  fault-­‐tolerant	
  RDDs	
  
      Intermix	
  seamlessly	
  with	
  batch	
  and	
  ad-­‐hoc	
  queries	
  
                                                                   map	
     reduceByWindow	
  

     tweetStream                                         T=1	
  
      .flatMap(_.toLower.split)
      .map(word => (word, 1))
      .reduceByWindow(5, _ + _)
                                                         T=2	
  

                                                                                  …	
  
[Zaharia	
  et	
  al,	
  HotCloud	
  2012]	
  
Streaming	
  Spark	
  
      Extends	
  Spark	
  to	
  perform	
  streaming	
  computations	
  
      Runs	
  as	
  a	
  series	
  of	
  small	
  (~1	
  s)	
  batch	
  jobs,	
  keeping	
  
      state	
  in	
  memory	
  as	
  fault-­‐tolerant	
  RDDs	
  
      Intermix	
  seamlessly	
  with	
  batch	
  and	
  ad-­‐hoc	
  queries	
  
                                                                   map	
     reduceByWindow	
  

     tweetStream                                         T=1	
  
      .flatMap(_.toLower.split)
          Result:	
  can	
  process	
  42	
  million	
  records/second	
  
      .map(word => (word, 1))
           (4	
  GB/s)	
  on	
  100	
  nodes	
  at	
  sub-­‐second	
  latency	
  
      .reduceByWindow(5, _ + _)
                                                         T=2	
  

                                                                                  …	
  
[Zaharia	
  et	
  al,	
  HotCloud	
  2012]	
  
Streaming	
  Spark	
  
      Extends	
  Spark	
  to	
  perform	
  streaming	
  computations	
  
      Runs	
  as	
  a	
  series	
  of	
  small	
  (~1	
  s)	
  batch	
  jobs,	
  keeping	
  
      state	
  in	
  memory	
  as	
  fault-­‐tolerant	
  RDDs	
  
      Intermix	
  seamlessly	
  with	
  batch	
  and	
  ad-­‐hoc	
  queries	
  
                                                                   map	
     reduceByWindow	
  

     tweetStream                                         T=1	
  
      .flatMap(_.toLower.split)
      .map(word => (word, 1))Alpha	
  coming	
  this	
  summer	
  
      .reduceByWindow(5, _ + _)
                                                         T=2	
  

                                                                                  …	
  
[Zaharia	
  et	
  al,	
  HotCloud	
  2012]	
  
Conclusion	
  
Spark	
  and	
  Shark	
  speed	
  up	
  your	
  interactive	
  and	
  
complex	
  analytics	
  on	
  Hadoop	
  data	
  
Download	
  and	
  docs:	
  www.spark-­‐project.org	
  	
  
  » Easy	
  to	
  run	
  locally,	
  on	
  EC2,	
  or	
  on	
  Mesos	
  and	
  soon	
  YARN	
  

User	
  meetup:	
  meetup.com/spark-­‐users	
  	
  
Training	
  camp	
  at	
  Berkeley	
  in	
  August!	
  

                                   matei@berkeley.edu	
  /	
  @matei_zaharia	
  
Behavior	
  with	
  Not	
  Enough	
  RAM	
  
                               100	
  
                                              68.8	
  
Iteration	
  time	
  (s)	
  




                                                           58.1	
  
                                80	
  




                                                                              40.7	
  
                                60	
  




                                                                                                29.7	
  
                                40	
  




                                                                                                             11.5	
  
                                20	
  
                                   0	
  
                                            Cache	
       25%	
             50%	
              75%	
        Fully	
  
                                           disabled	
                                                      cached	
  
                                                          %	
  of	
  working	
  set	
  in	
  memory	
  
Software	
  Stack	
  
          Shark	
                      Bagel	
                Streaming	
  
    (Hive	
  on	
  Spark)	
     (Pregel	
  on	
  Spark)	
       Spark	
  
                                                                              …	
  
                                           Spark	
  

      Local	
                                       Apache	
  
                                EC2	
                                 YARN	
  
      mode	
                                        Mesos	
  
Thank You!


             Page 37

More Related Content

What's hot

BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsSatya Narayan
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 

What's hot (20)

BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 

Viewers also liked

Big Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend StoryBig Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend StoryAmazon Web Services
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­TimeSeven Nguyen
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseDataWorks Summit
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitGuido Schmutz
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBGeoffrey Anderson
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013Nathan Bijnens
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaDataStax Academy
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibpumaranikar
 

Viewers also liked (20)

Big Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend StoryBig Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend Story
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBase
 
Intro to Pig UDF
Intro to Pig UDFIntro to Pig UDF
Intro to Pig UDF
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in Echtzeit
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 

Similar to Spark and shark

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305mjfrankli
 
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012Amazon Web Services
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 

Similar to Spark and shark (20)

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
 
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Scala+data
Scala+dataScala+data
Scala+data
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Spark 101
Spark 101Spark 101
Spark 101
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Hadoop
HadoopHadoop
Hadoop
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Spark and shark

  • 1. Spark  and  Shark   High-­‐Speed  In-­‐Memory  Analytics   over  Hadoop  and  Hive  Data   Matei  Zaharia,  in  collaboration  with   Mosharaf  Chowdhury,  Tathagata  Das,  Ankur  Dave,  Cliff  Engle,   Michael  Franklin,  Haoyuan  Li,  Antonio  Lupher,  Justin  Ma,  Murphy   McCauley,  Scott  Shenker,  Ion  Stoica,  Reynold  Xin     UC  Berkeley   spark-­‐project.org   UC  BERKELEY  
  • 2. What  is  Spark?   Not  a  modified  version  of  Hadoop   Separate,  fast,  MapReduce-­‐like  engine   » In-­‐memory  data  storage  for  very  fast  iterative  queries   » General  execution  graphs  and  powerful  optimizations   » Up  to  40x  faster  than  Hadoop   Compatible  with  Hadoop’s  storage  APIs   » Can  read/write  to  any  Hadoop-­‐supported  system,   including  HDFS,  HBase,  SequenceFiles,  etc  
  • 3. What  is  Shark?   Port  of  Apache  Hive  to  run  on  Spark   Compatible  with  existing  Hive  data,  metastores,   and  queries  (HiveQL,  UDFs,  etc)   Similar  speedups  of  up  to  40x  
  • 4. Project  History   Spark  project  started  in  2009,  open  sourced  2010   Shark  started  summer  2011,  alpha  April  2012   In  use  at  Berkeley,  Princeton,  Klout,  Foursquare,   Conviva,  Quantifind,  Yahoo!  Research  &  others   200+  member  meetup,  500+  watchers  on  GitHub  
  • 5. This  Talk   Spark  programming  model   User  applications   Shark  overview   Demo   Next  major  addition:  Streaming  Spark  
  • 6. Why  a  New  Programming  Model?   MapReduce  greatly  simplified  big  data  analysis   But  as  soon  as  it  got  popular,  users  wanted  more:   » More  complex,  multi-­‐stage  applications  (e.g.   iterative  graph  algorithms  and  machine  learning)   » More  interactive  ad-­‐hoc  queries   Both  multi-­‐stage  and  interactive  apps  require   faster  data  sharing  across  parallel  jobs  
  • 7. Data  Sharing  in  MapReduce   HDFS   HDFS   HDFS   HDFS   read   write   read   write   iter.  1   iter.  2   .    .    .   Input   HDFS   query  1   result  1   read   query  2   result  2   query  3   result  3   Input   .    .    .   Slow  due  to  replication,  serialization,  and  disk  IO  
  • 8. Data  Sharing  in  Spark   iter.  1   iter.  2   .    .    .   Input   query  1   one-­‐time   processing   query  2   query  3   Input   Distributed   memory   .    .    .   10-­‐100×  faster  than  network  and  disk  
  • 9. Spark  Programming  Model   Key  idea:  resilient  distributed  datasets  (RDDs)   » Distributed  collections  of  objects  that  can  be  cached   in  memory  across  cluster  nodes   » Manipulated  through  various  parallel  operators   » Automatically  rebuilt  on  failure   Interface   » Clean  language-­‐integrated  API  in  Scala   » Can  be  used  interactively  from  Scala  console  
  • 10. Example:  Log  Mining   Load  error  messages  from  a  log  into  memory,  then   interactively  search  for  various  patterns   Base  RDD   Cache  1   lines = spark.textFile(“hdfs://...”) Transformed  RDD   Worker   results   errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) tasks   Block  1   Driver   cachedMsgs = messages.cache() Action   cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Cache  2   Worker   . . . Cache  3   Worker   Block  2   Result:  sull-­‐text  s1  TB  data  in  5-­‐7  sec   fcaled  to   earch  of  Wikipedia   in  <1  sec  (vs  ec  for  on-­‐disk  data)  ata)   (vs  170  s 20  sec  for  on-­‐disk  d Block  3  
  • 11. Fault  Tolerance   RDDs  track  the  series  of  transformations  used  to   build  them  (their  lineage)  to  recompute  lost  data   E.g:   messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2))     HadoopRDD   FilteredRDD   MappedRDD   path  =  hdfs://…   func  =  _.contains(...)   func  =  _.split(…)  
  • 12. Example:  Logistic  Regression   val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) Load  data  in  memory  once   Initial  parameter  vector   for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient Repeated  MapReduce  steps   } to  do  gradient  descent   println("Final w: " + w)
  • 13. Logistic  Regression  Performance   4500   4000   127  s  /  iteration   3500   Running  Time  (s)   3000   2500   Hadoop   2000   Spark   1500   1000   500   first  iteration  174  s   0   further  iterations  6  s   1   5   10   20   30   Number  of  Iterations  
  • 14. Supported  Operators   map reduce sample filter count cogroup groupBy reduceByKey take sort groupByKey partitionBy join first pipe leftOuterJoin union save rightOuterJoin cross ...
  • 15. Other  Engine  Features   General  graphs  of  operators  (e.g.  map-­‐reduce-­‐reduce)   Hash-­‐based  reduces  (faster  than  Hadoop’s  sort)   Controlled  data  partitioning  to  lower  communication   PageRank  Performance   200   171   Iteration  time  (s)   Hadoop   150   Basic  Spark   100   72   50   23   Spark  +  Controlled   Partitioning   0  
  • 17. User  Applications   In-­‐memory  analytics  &  anomaly  detection  (Conviva)   Interactive  queries  on  data  streams  (Quantifind)   Exploratory  log  analysis  (Foursquare)   Traffic  estimation  w/  GPS  data  (Mobile  Millennium)   Twitter  spam  classification  (Monarch)   .  .  .  
  • 18. Conviva  GeoReport   Hive   20   Spark   0.5   Time  (hours)   0   5   10   15   20   Group  aggregations  on  many  keys  w/  same  filter   40×  gain  over  Hive  from  avoiding  repeated   reading,  deserialization  and  filtering  
  • 19. Mobile  Millennium  Project   Estimate  city  traffic  from  crowdsourced  GPS  data   Iterative  EM  algorithm   scaling  to  160  nodes   Credit:  Tim  Hunter,  with  support  of  the  Mobile  Millennium  team;  P.I.  Alex  Bayen;  traffic.berkeley.edu  
  • 20. Shark:  Hive  on  Spark  
  • 21. Motivation   Hive  is  great,  but  Hadoop’s  execution  engine   makes  even  the  smallest  queries  take  minutes   Scala  is  good  for  programmers,  but  many  data   users  only  know  SQL   Can  we  extend  Hive  to  run  on  Spark?  
  • 22. Hive  Architecture        Client   CLI   JDBC   Driver   Meta   Physical  Plan   store   SQL   Query   Parser   Optimizer   Execution   MapReduce   HDFS  
  • 23. Shark  Architecture        Client   CLI   JDBC   Driver   Cache  Mgr.   Meta   Physical  Plan   store   SQL   Query   Parser   Optimizer   Execution   Spark   HDFS   [Engle  et  al,  SIGMOD  2012]  
  • 24. Efficient  In-­‐Memory  Storage   Simply  caching  Hive  records  as  Java  objects  is   inefficient  due  to  high  per-­‐object  overhead   Instead,  Shark  employs  column-­‐oriented   storage  using  arrays  of  primitive  types   Row  Storage   Column  Storage   1   john   4.1   1   2   3   2   mike   3.5   john   mike   sally   3   sally   6.4   4.1   3.5   6.4  
  • 25. Efficient  In-­‐Memory  Storage   Simply  caching  Hive  records  as  Java  objects  is   inefficient  due  to  high  per-­‐object  overhead   Instead,  Shark  employs  column-­‐oriented   storage  using  arrays  of  primitive  types   Row  Storage   Column  Storage   1   john   4.1   1   2   3   Benefit:  similarly  compact  size  to  serialized  data,   2   mike   3.5   faster  to  access   sally   but  >5x   john   mike   3   sally   6.4   4.1   3.5   6.4  
  • 26. Using  Shark   CREATE TABLE mydata_cached AS SELECT …   Run  standard  HiveQL  on  it,  including  UDFs   » A  few  esoteric  features  are  not  yet  supported   Can  also  call  from  Scala  to  mix  with  Spark     Early  alpha  release  at  shark.cs.berkeley.edu  
  • 27. Benchmark  Query  1   SELECT * FROM grep WHERE field LIKE ‘%XYZ%’; Shark (cached) 12s Shark 182s Hive 207s 0 50 100 150 200 250 Execution Time (secs)
  • 28. Benchmark  Query  2   SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’ GROUP BY V.sourceIP ORDER BY earnings DESC LIMIT 1; Shark (cached) 126s Shark 270s Hive 447s 0 100 200 300 400 500 Execution Time (secs)
  • 30. What’s  Next?   Recall  that  Spark’s  model  was  motivated  by  two   emerging  uses  (interactive  and  multi-­‐stage  apps)   Another  emerging  use  case  that  needs  fast  data   sharing  is  stream  processing   » Track  and  update  state  in  memory  as  events  arrive   » Large-­‐scale  reporting,  click  analysis,  spam  filtering,  etc  
  • 31. Streaming  Spark   Extends  Spark  to  perform  streaming  computations   Runs  as  a  series  of  small  (~1  s)  batch  jobs,  keeping   state  in  memory  as  fault-­‐tolerant  RDDs   Intermix  seamlessly  with  batch  and  ad-­‐hoc  queries   map   reduceByWindow   tweetStream T=1   .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _) T=2   …   [Zaharia  et  al,  HotCloud  2012]  
  • 32. Streaming  Spark   Extends  Spark  to  perform  streaming  computations   Runs  as  a  series  of  small  (~1  s)  batch  jobs,  keeping   state  in  memory  as  fault-­‐tolerant  RDDs   Intermix  seamlessly  with  batch  and  ad-­‐hoc  queries   map   reduceByWindow   tweetStream T=1   .flatMap(_.toLower.split) Result:  can  process  42  million  records/second   .map(word => (word, 1)) (4  GB/s)  on  100  nodes  at  sub-­‐second  latency   .reduceByWindow(5, _ + _) T=2   …   [Zaharia  et  al,  HotCloud  2012]  
  • 33. Streaming  Spark   Extends  Spark  to  perform  streaming  computations   Runs  as  a  series  of  small  (~1  s)  batch  jobs,  keeping   state  in  memory  as  fault-­‐tolerant  RDDs   Intermix  seamlessly  with  batch  and  ad-­‐hoc  queries   map   reduceByWindow   tweetStream T=1   .flatMap(_.toLower.split) .map(word => (word, 1))Alpha  coming  this  summer   .reduceByWindow(5, _ + _) T=2   …   [Zaharia  et  al,  HotCloud  2012]  
  • 34. Conclusion   Spark  and  Shark  speed  up  your  interactive  and   complex  analytics  on  Hadoop  data   Download  and  docs:  www.spark-­‐project.org     » Easy  to  run  locally,  on  EC2,  or  on  Mesos  and  soon  YARN   User  meetup:  meetup.com/spark-­‐users     Training  camp  at  Berkeley  in  August!   matei@berkeley.edu  /  @matei_zaharia  
  • 35. Behavior  with  Not  Enough  RAM   100   68.8   Iteration  time  (s)   58.1   80   40.7   60   29.7   40   11.5   20   0   Cache   25%   50%   75%   Fully   disabled   cached   %  of  working  set  in  memory  
  • 36. Software  Stack   Shark   Bagel   Streaming   (Hive  on  Spark)   (Pregel  on  Spark)   Spark   …   Spark   Local   Apache   EC2   YARN   mode   Mesos  
  • 37. Thank You! Page 37