SlideShare a Scribd company logo
1 of 121
Best	
  Prac?ces	
  for	
  Building	
  and	
  Deploying	
  
    Predic?ve	
  Models	
  Over	
  Big	
  Data	
  
                    Robert	
  Grossman	
  
         Open	
  Data	
  Group	
  &	
  Univ.	
  of	
  Chicago	
  
                                   	
  
                         Collin	
  Benne=	
  
                    Open	
  Data	
  Group	
  

                      October	
  23,	
  2012	
  
1.  Introduc?on	
                                    7.	
  	
  	
  Building	
  Models	
  over	
  Hadoop	
  
2.  Building	
  Predic?ve	
  Models	
                Using	
  R	
  
    –	
  EDA,	
  Features	
                          8.	
  	
  	
  Case	
  Study:	
  Building	
  Trees	
  
3.  Case	
  Study:	
  MalStone	
                     over	
  Big	
  Data	
  
                                                     9.	
  	
  	
  	
  Analy?c	
  Opera?ons	
  –	
  SAMS	
  
4.  Deploying	
  Predic?ve	
  
    Models	
                                         10. 	
  Case	
  Study:	
  AdReady	
  
5.  Working	
  with	
  Mul?ple	
                     11. 	
  Building	
  Predic?ve	
  Models	
  –	
  
    Models	
                                                        Li	
  of	
  a	
  Model	
  
6.  Case	
  Study:	
  CTA	
                          12.  	
  Case	
  Study:	
  Matsu	
  


The	
  tutorial	
  is	
  divided	
  into	
  12	
  Modules.	
  	
  You	
  can	
  download	
  all	
  the	
  
modules	
  and	
  related	
  materials	
  from	
  tutorials.opendatagroup.com.	
  
Best	
  Prac?ces	
  for	
  Building	
  and	
  	
  
Deploying	
  Predic?ve	
  Models	
  Over	
  Big	
  Data	
  
                         	
  
         Module	
  1:	
  Introduc?on	
  
                         	
  
                     Robert	
  Grossman	
  
          Open	
  Data	
  Group	
  &	
  Univ.	
  of	
  Chicago	
  
                          Collin	
  Benne=	
  
                     Open	
  Data	
  Group	
  
                                    	
  
                     October	
  23,	
  2012	
  
                                    	
  
                                    	
  
What	
  Is	
  Small	
  Data?	
  
•    100	
  million	
  movie	
  ra?ngs	
  
•    480	
  thousand	
  customers	
  
•    17,000	
  movies	
  
•    From	
  1998	
  to	
  2005	
  
•    Less	
  than	
  2	
  GB	
  data.	
  
•    Fits	
  into	
  memory,	
  but	
  
     very	
  sophis?cated	
  
     models	
  required	
  to	
  win.	
  
What	
  is	
  Big	
  Data?	
  
     •  The	
  leaders	
  in	
  big	
  data	
  
        measure	
  data	
  in	
  Megawa=s.	
  	
  	
  	
  
         –  As	
  in,	
  Facebook’s	
  leased	
  data	
  
            centers	
  are	
  typically	
  between	
  2.5	
  
            MW	
  and	
  6.0	
  MW.	
  
         –  Facebook’s	
  new	
  Pineville	
  data	
  
            center	
  is	
  30	
  MW.	
  


     •  At	
  the	
  other	
  end	
  of	
  the	
  
        spectrum,	
  databases	
  can	
  
        manage	
  the	
  data	
  required	
  for	
  
        many	
  projects.	
  
An	
  algorithm	
  and	
  
compu?ng	
  
infrastructure	
  is	
  “big-­‐
data	
  scalable”	
  if	
  adding	
  
a	
  rack	
  of	
  data	
  (and	
  
corresponding	
  
processors)	
  allows	
  you	
  
to	
  compute	
  over	
  more	
  
data	
  in	
  the	
  same	
  ?me.	
  	
  	
  
What	
  is	
  Predic?ve	
  Analy?cs?	
  
Short	
  Defini5on	
  
•  Using	
  data	
  to	
  make	
  decisions.	
  
Longer	
  Defini5on	
  
•  Using	
  data	
  to	
  take	
  ac?ons	
  and	
  make	
  decisions	
  
   using	
  models	
  that	
  are	
  sta?s?cally	
  valid	
  and	
  
   empirically	
  derived.	
  



                                                                       7	
  
Predic?ve	
  
                                          Analy?cs	
  
                     Data	
  Mining	
  
                     &	
  KDD	
  
Computa?onally	
  
Intensive	
  
Sta?s?cs	
  

                                            2004	
  
                         1993	
  



    1984	
                                             8	
  
Key	
  Modeling	
  Ac?vi?es	
  
§  Exploring	
  data	
  analysis	
  –	
  goal	
  is	
  to	
  understand	
  the	
  
    data	
  so	
  that	
  features	
  can	
  be	
  generated	
  and	
  model	
  
    evaluated	
  
§  Genera?ng	
  features	
  –	
  goal	
  is	
  to	
  shape	
  events	
  into	
  
    meaningful	
  features	
  that	
  enable	
  meaningful	
  
    predic?on	
  of	
  the	
  dependent	
  variable	
  
§  Building	
  models	
  –	
  goal	
  is	
  to	
  derive	
  sta?s?cally	
  valid	
  
    models	
  from	
  a	
  learning	
  set	
  
§  Evalua?ng	
  models	
  –	
  goal	
  is	
  to	
  develop	
  meaningful	
  
    report	
  so	
  that	
  the	
  performance	
  of	
  one	
  model	
  can	
  be	
  
    compared	
  to	
  another	
  model	
  	
  
Analy?c	
  	
  
                                              strategy	
  &	
  
                                              governance	
  




                               Analy?c	
  	
  
                               Diamond	
  
Analy?c	
  algorithms	
  
                                                 Analy?c	
  opera?ons,	
  
&	
  models	
  
                                                 security	
  &	
  compliance	
  


                            Analy?c	
  Infrastructure	
  
IT	
  Organiza?on	
  

Analy?c	
  Infrastructure	
                   11	
  
What	
  Is	
  Analy?c	
  Infrastructure?	
  
                                1




                                                 D A T A   M I N I N G   G R O U P




Analy&c	
  infrastructure	
  refers	
  to	
  the	
  soware	
  
components,	
  soware	
  services,	
  applica?ons	
  and	
  
plahorms	
  for	
  managing	
  data,	
  processing	
  data,	
  
producing	
  analy?c	
  models,	
  and	
  using	
  analy?c	
  
models	
  to	
  generate	
  alerts,	
  take	
  ac?ons	
  and	
  make	
  
decisions.	
  
                                                                                     12	
  
Build	
  models	
  




Analy?c	
  models	
  &	
  
algorithms	
  

Modeling	
  Group	
  
                             13	
  
Building	
  models:	
  



                              Model	
  
   Data	
                                                     Model	
  
                             Producer	
  

                    CART,	
  SVM,	
  k-­‐means,	
  etc.	
     PMML	
  




                                                                          14	
  
Opera?ons,	
  product	
  




Analy?c	
  Opera?ons	
  

 Deployed	
  models	
  

                      15	
  
Measures	
             Dashboards	
  



               Ac?ons	
  



Models	
        Scores	
  

             SAM	
            Deployed	
  model	
  
ETL	
                             Deployment	
  




                                                                        Measures	
  
                                                                        Ac?ons	
  
                                   Validated	
  Model	
                 Scores	
  
                                   Candidate	
  Model	
  
                                   Build	
  Features	
  
    Analy?c	
  
    Datamart	
                     Cleaning	
  Data	
  &	
  
                                   Exploratory	
  Data	
  
 Data	
  Warehouse	
  
                                   Analysis	
  
Opera?onal	
  data	
               	
  	
  
                                   	
  


Analy?c	
  Infrastructure	
        Analy?c	
  Modeling	
          Analy?c	
  Opera?ons	
  
Life	
  Cycle	
  of	
  Predic?ve	
  Model	
  
                              Exploratory	
  Data	
  Analysis	
  (R)	
       Process	
  the	
  data	
  (Hadoop)	
  

      Build	
  model	
  in	
  dev/
      modeling	
  environment	
  
 PMML	
  
                                                                                    Analy?c	
  modeling	
  

                                                                                     Opera?ons	
  

Deploy	
  model	
  in	
                                                                                     Log	
  
opera?onal	
  systems	
  with	
                                                                             files	
  
scoring	
  applica?on	
               Refine	
  model	
  
(Hadoop,	
  streams	
  &	
  
Augustus)	
                                                                Re?re	
  model	
  and	
  deploy	
  
                                                                           improved	
  model	
  
Key	
  Abstrac?ons	
  
Abstrac5on	
     Credit	
  Card	
  Fraud	
      On	
  line	
                   Change	
  Detec5on	
  
                                                adver5sing	
  
Events	
         Credit	
  card	
               Impressions	
  and	
           Status	
  updates	
  
                 transac?ons	
                  clicks	
  
En??es	
         Credit	
  card	
               Cookies,	
  user	
  IDs	
      Computers,	
  devices,	
  etc.	
  
                 accounts	
  
Features	
       #	
  transac?ons	
  past	
     #	
  impressions	
  per	
   #	
  events	
  in	
  a	
  category	
  
(examples)	
     n	
  minutes	
                 category	
  past	
  n	
     previous	
  ?me	
  period	
  
                                                minutes,	
  RFM	
  
                                                related,	
  etc.	
  
Models	
         CART	
                         Clusters,	
  trees,	
   Baseline	
  models	
  
                                                recommenda?on,	
  
                                                etc.	
  	
  
Scores	
         Indicates	
  likelihood	
   Likelihood	
  of	
                Indicates	
  likelihood	
  that	
  a	
  
                 of	
  fraudulent	
          clicking	
                        change	
  has	
  taken	
  place	
  
                 account	
  
Event	
  Based	
  Upda?ng	
  of	
  	
  
                      Feature	
  Vectors	
  	
  

                  events                                     FVs

                  events
Event	
  loop:	
                                               updates

1.  take	
  next	
  event	
  
2.  update	
  one	
  or	
  more	
  
    associated	
  FVs	
                                        R
                                                                n
3.  rescore	
  updated	
  FVs	
  
    to	
  compute	
  alerts	
  
                                                            scores
Example	
  Architecture	
  
•  Data	
  lives	
  in	
  Hadoop	
  Distributed	
  File	
  System	
  
   (HDFS)	
  or	
  other	
  distributed	
  file	
  systems.	
  
•  Hadoop	
  streams	
  and	
  MapReduce	
  are	
  used	
  to	
  
   process	
  the	
  data	
  
•  R	
  is	
  used	
  for	
  Exploratory	
  Data	
  Analysis	
  (EDA)	
  
   and	
  models	
  are	
  exported	
  as	
  PMML.	
  
•  PMML	
  models	
  are	
  imported	
  and	
  Hadoop,	
  
   Python	
  Streams	
  and	
  Augustus	
  are	
  used	
  to	
  
   score	
  data	
  in	
  produc?on	
  
Ques?ons?	
  


For	
  the	
  most	
  current	
  version	
  of	
  these	
  slides,	
  
please	
  see	
  tutorials.opendatagroup.com	
  
Best	
  Prac?ces	
  for	
  Building	
  and	
  	
  
 Deploying	
  Predic?ve	
  Models	
  Over	
  Big	
  Data	
  
                              	
  
Module	
  2:	
  Building	
  Models	
  –	
  Data	
  Explora?on	
  	
  
              and	
  Features	
  Genera?on	
  
                        Robert	
  Grossman	
  
             Open	
  Data	
  Group	
  &	
  Univ.	
  of	
  Chicago	
  
                             Collin	
  Benne=	
  
                        Open	
  Data	
  Group	
  
                                       	
  
                        October	
  23,	
  2012	
  
                                       	
  
Key	
  Modeling	
  Ac?vi?es	
  
§  Exploring	
  data	
  analysis	
  –	
  goal	
  is	
  to	
  understand	
  the	
  
    data	
  so	
  that	
  features	
  can	
  be	
  generated	
  and	
  model	
  
    evaluated	
  
§  Genera?ng	
  features	
  –	
  goal	
  is	
  to	
  shape	
  events	
  into	
  
    meaningful	
  features	
  that	
  enable	
  meaningful	
  
    predic?on	
  of	
  the	
  dependent	
  variable	
  
§  Building	
  models	
  –	
  goal	
  is	
  to	
  derive	
  sta?s?cally	
  valid	
  
    models	
  from	
  learning	
  set	
  
§  Evalua?ng	
  models	
  –	
  goal	
  is	
  to	
  develop	
  meaningful	
  
    report	
  so	
  that	
  the	
  performance	
  of	
  one	
  model	
  can	
  be	
  
    compared	
  to	
  another	
  model	
  	
  
? 	
   	
              	
  Naïve	
  Bayes	
  
      ? 	
   	
              	
  K	
  nearest	
  neighbor	
  
      1957 	
                	
  K-­‐means	
  
      1977	
                 	
  EM	
  Algorithm	
  
                                                             Beware	
  of	
  any	
  vendor	
  
      1984	
                 	
  CART	
  
                                                             whose	
  unique	
  value	
  
      1993	
                 	
  C4.5	
                      proposi?on	
  includes	
  
      1994	
                 	
  A	
  Priori	
               some	
  “secret	
  analy?c	
  
      1995	
                 	
  SVM	
                       sauce.”	
  
      1997	
                 	
  AdaBoost	
  
      1998	
                 	
  PageRank	
  
Source:	
  X.	
  Wu,	
  et.	
  al.,	
  Top	
  10	
  algorithms	
  in	
  data	
  mining,	
  Knowledge	
  and	
  Informa?on	
  Systems,	
  
Volume	
  14,	
  2008,	
  pages	
  1-­‐37.	
                                                                                                25	
  
Pessimis?c	
  View:	
  We	
  Get	
  a	
  Significantly	
  
  New	
  Algorithm	
  Every	
  Decade	
  or	
  So	
  

1970’s	
   	
     	
  neural	
  networks	
  
1980’s	
   	
     	
  classifica?ons	
  and	
  regression	
  trees	
  
1990’s	
   	
     	
  support	
  vector	
  machine	
  
2000’s	
   	
     	
  graph	
  algorithms	
  
In	
  general,	
  understanding	
  the	
  data	
  through	
  
exploratory	
  analysis	
  and	
  genera?ng	
  good	
  
features	
  is	
  much	
  more	
  important	
  than	
  the	
  
type	
  of	
  predic?ve	
  model	
  you	
  use.	
  
Ques?ons	
  About	
  Datasets	
  
•  Number	
  of	
  records	
  
•  Number	
  of	
  data	
  fields	
  
•  Size	
  
      –  Does	
  it	
  fit	
  into	
  memory?	
  
      –  Does	
  it	
  fit	
  on	
  a	
  single	
  disk	
  or	
  disk	
  array?	
  
•    How	
  many	
  missing	
  values?	
  
•    How	
  many	
  duplicates?	
  
•    Are	
  there	
  labels	
  (for	
  the	
  dependent	
  variable)?	
  
•    Are	
  there	
  keys?	
  
Ques?ons	
  About	
  Data	
  Fields	
  
•  Data	
  fields	
  
    –  What	
  is	
  the	
  mean,	
  variance,	
  …?	
  
    –  What	
  is	
  the	
  distribu?on?	
  	
  Plot	
  the	
  histograms.	
  
    –  What	
  are	
  extreme	
  values,	
  missing	
  values,	
  unusual	
  
       values?	
  
•  Pairs	
  of	
  data	
  fields	
  
    –  What	
  is	
  the	
  correla?on?	
  	
  Plot	
  the	
  sca=er	
  plots.	
  
•  Visualize	
  the	
  data	
  
    –  Histograms,	
  sca=er	
  plots,	
  etc.	
  
Best practices for building and deploying predictive models over big data presentation
…   	
  
Compu?ng	
  Count	
  Distribu?ons	
  
def	
  valueDistrib(b,	
  field,	
  select=None):	
  	
  
   	
  #	
  given	
  a	
  dataframe	
  b	
  and	
  an	
  a=ribute	
  field,	
  	
  
   	
  #	
  returns	
  the	
  class	
  distribu?on	
  of	
  the	
  field	
  	
  
   	
  ccdict	
  =	
  {}	
  
   	
  if	
  select:	
  	
  
   	
   	
  for	
  i	
  in	
  select:	
  
   	
   	
   	
  count	
  =	
  ccdict.get(b[i][field],	
  0)	
  	
  
   	
   	
   	
  ccdict[b[i][field]]	
  =	
  count	
  +	
  1	
  
   	
   	
  else:	
  	
  
   	
   	
   	
  for	
  i	
  in	
  range(len(b)):	
  
   	
   	
   	
   	
  count	
  =	
  ccdict.get(b[i][field],	
  0)	
  
   	
   	
   	
   	
  ccdict[b[i][field]]	
  =	
  count	
  +	
  1	
  	
  
   	
  return	
  ccdict	
  
Features	
  
•  Normalize	
  
   –  [0,	
  1]	
  or	
  [-­‐1,	
  1]	
  
•  Aggregate	
  
•  Discre?ze	
  or	
  bin	
  
•  Transform	
  con?nuous	
  to	
  con?nuous	
  or	
  
   discrete	
  to	
  discrete	
  
•  Compute	
  ra?os	
  
•  Use	
  percen?les	
  
•  Use	
  indicator	
  variables	
  
Predic?ng	
  Office	
  Building	
  Renewal	
  	
  
    Using	
  Text-­‐based	
  Features	
  
Classification   Avg R    Classification   Avg R
imports            5.17   technology        2.03
clothing           4.17   apparel           2.03
van                3.18   law               2.00
fashion            2.50   casualty          1.91
investment         2.42   bank              1.88
marketing          2.38   technologies      1.84
oil                2.11   trading           1.82
air                2.09   associates        1.81
system             2.06   staffing          1.77
foundation         2.04   securities        1.77
Twi=er	
  Inten?on	
  &	
  Predic?on	
  
                     •  Use	
  machine	
  learning	
  
                        classifica?on	
  
                        techniques:	
  
                         –  Support	
  Vector	
  
                            Machines	
  
                         –  Trees	
  
                         –  Naïve	
  Bayes	
  
                     •  Models	
  require	
  clean	
  
                        data	
  and	
  lots	
  of	
  hand-­‐
                        scored	
  examples	
  	
  
What	
  is	
  the	
  Size	
  of	
  Your	
  Data?	
  

•  Small	
  
    –  Fits	
  into	
  memory	
  
•  Medium	
  
    –  Too	
  large	
  for	
  memory	
  
    –  But	
  fits	
  into	
  a	
  database	
  
•  Large	
  
    –  To	
  large	
  for	
  a	
  database	
  
    –  But	
  you	
  can	
  use	
  NoSQL	
  database	
  (HBase,	
  etc.)	
  
    –  Or	
  DFS	
  (Hadoop	
  DFS)	
  
                                                                               36	
  
If	
  You	
  Can	
  Ask	
  For	
  Something	
  
•  Ask	
  for	
  more	
  data	
  
•  Ask	
  for	
  “orthogonal	
  data”	
  
Ques?ons?	
  


For	
  the	
  most	
  current	
  version	
  of	
  these	
  slides,	
  
please	
  see	
  tutorials.opendatagroup.com	
  
Best	
  Prac?ces	
  for	
  Building	
  and	
  	
  
Deploying	
  Predic?ve	
  Models	
  Over	
  Big	
  Data	
  
                            	
  
    Module	
  3:	
  Case	
  Study	
  -­‐	
  MalStone	
  

                     Robert	
  Grossman	
  
          Open	
  Data	
  Group	
  &	
  Univ.	
  of	
  Chicago	
  
                          Collin	
  Benne=	
  
                     Open	
  Data	
  Group	
  
                                    	
  
                     October	
  23,	
  2012	
  
                                    	
  
Part	
  1.	
  Sta?s?cs	
  over	
  Log	
  Files	
  
•  Log	
  files	
  are	
  everywhere	
  
•  Adver?sing	
  systems	
  
•  Network	
  and	
  system	
  logs	
  for	
  intrusion	
  
   detec?on	
  
•  Health	
  and	
  status	
  monitoring	
  
What	
  are	
  the	
  Common	
  Elements?	
  
•  Time	
  stamps	
  
•  Sites	
  
      –  e.g.	
  Web	
  sites,	
  computers,	
  network	
  devices	
  
•  En??es	
  
      –  e.g.	
  visitors,	
  users,	
  flows	
  
•    Log	
  files	
  fill	
  disks,	
  many,	
  many	
  disks	
  
•    Behavior	
  occurs	
  at	
  all	
  scales	
  
•    Want	
  to	
  iden?fy	
  phenomena	
  at	
  all	
  scales	
  
•    Need	
  to	
  group	
  “similar	
  behavior”	
  
•    Need	
  to	
  do	
  sta?s?cs	
  (not	
  just	
  sor?ng)	
  
Abstract	
  the	
  Problem	
  Using	
  Site-­‐En?ty	
  Logs	
  

Example	
                   Sites	
           En55es	
  
Measuring	
  online	
   Web	
  sites	
        Consumers	
  
adver?sing	
  
Drive-­‐by	
  exploits	
   Web	
  sites	
     Computers	
  
                                              (iden?fied	
  by	
  
                                              cookies	
  or	
  IP)	
  
Compromised	
               Compromised	
     User	
  accounts	
  
systems	
                   computers	
  


                                                                         42	
  
MalStone	
  Schema	
  
•    Event	
  ID	
  
•    Time	
  stamp	
  
•    Site	
  ID	
  
•    En?ty	
  ID	
  
•    Mark	
  (categorical	
  variable)	
  
•    Fit	
  into	
  100	
  bytes	
  
Toy	
  Example	
  


               map/shuffle	
                       reduce	
  




Events	
  collected	
           Map	
  events	
  by	
         For	
  each	
  site,	
  
by	
  device	
  or	
            site	
                        compute	
  counts	
  
processor	
  in	
  ?me	
                                      and	
  ra?os	
  of	
  
order	
                                                       events	
  by	
  type	
   44	
  
Distribu?ons	
  
•    Tens	
  of	
  millions	
  of	
  sites	
  
•    Hundreds	
  of	
  millions	
  of	
  en??es	
  
•    Billions	
  of	
  events	
  
•    Most	
  sites	
  have	
  a	
  few	
  number	
  of	
  events	
  
•    Some	
  sites	
  have	
  many	
  events	
  
•    Most	
  en??es	
  visit	
  a	
  few	
  sites	
  
•    Some	
  visitors	
  visit	
  many	
  sites	
  
MalStone	
  B	
   sites	
      en??es	
  




dk-­‐2	
          dk-­‐1	
         dk	
  
                                              ?me	
  
                                                         46	
  
The	
  Mark	
  Model	
  
•  Some	
  sites	
  are	
  marked	
  (percent	
  of	
  mark	
  is	
  a	
  
   parameter	
  and	
  type	
  of	
  sites	
  marked	
  is	
  a	
  draw	
  
   from	
  a	
  distribu?on)	
  
•  Some	
  en??es	
  become	
  marked	
  aer	
  visi?ng	
  a	
  
   marked	
  site	
  (this	
  is	
  a	
  draw	
  from	
  a	
  distribu?on)	
  
•  There	
  is	
  a	
  delay	
  between	
  the	
  visit	
  and	
  the	
  when	
  
   the	
  en?ty	
  becomes	
  marked	
  (this	
  is	
  a	
  draw	
  from	
  a	
  
   distribu?on)	
  
•  There	
  is	
  a	
  background	
  process	
  that	
  marks	
  some	
  
   en??es	
  independent	
  of	
  visit	
  (this	
  adds	
  noise	
  to	
  
   problem)	
  
Exposure	
  Window	
  
                                      Monitor	
  Window	
  




dk-­‐2	
                 dk-­‐1	
              dk	
  
                                                              ?me	
     48	
  
Nota?on	
  
•  Fix	
  a	
  site	
  s[j]	
  
•  Let	
  A[j]	
  be	
  en??es	
  that	
  transact	
  during	
  
               ExpWin	
  and	
  if	
  en?ty	
  is	
  marked,	
  then	
  visit	
  
               occurs	
  before	
  mark	
  
•  Let	
  B[j]	
  be	
  all	
  en??es	
  in	
  A[j]	
  that	
  become	
  
               marked	
  some?me	
  during	
  the	
  MonWin	
  
•  Subsequent	
  propor?on	
  of	
  marks	
  is	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r[j]	
  =	
  |	
  B[j]	
  |	
  	
  /	
  	
  |	
  A[j]	
  	
  |	
  
ExpWin	
  	
  
                                               MonWin	
  1	
  
                                                                         MonWin	
  2	
  




B[j,	
  t]	
  are	
  en??es	
  that	
  become	
  marked	
  during	
  MonWin[j]	
  
r[j,	
  t]	
  =	
  |	
  B[j,	
  t]	
  |	
  	
  /	
  	
  |	
  A[j]	
  	
  |	
  


       dk-­‐2	
                   dk-­‐1	
                      dk	
  
                                                                          ?me	
     50	
  
Part	
  2.	
  	
  MalStone	
  Benchmarks	
  




            code.google.com/p/malgen/	
  

MalGen	
  and	
  MalStone	
  implementa?ons	
  are	
  open	
  source	
  
MalStone	
  Benchmark	
  
•  Benchmark	
  developed	
  by	
  Open	
  Cloud	
  
   Consor?um	
  for	
  clouds	
  suppor?ng	
  data	
  
   intensive	
  compu?ng.	
  
•  Code	
  to	
  generate	
  synthe?c	
  data	
  required	
  is	
  
   available	
  from	
  code.google.com/p/malgen	
  
•  Stylized	
  analy?c	
  computa?on	
  that	
  is	
  easy	
  to	
  
   implement	
  in	
  MapReduce	
  and	
  its	
  
   generaliza?ons.	
  

                                                                       52	
  
MalStone	
  A	
  &	
  B	
  

Benchmark	
                Sta5s5c	
        #	
  records	
           Size	
  
MalStone	
  A-­‐10	
       r[j]	
           10	
  billion	
  	
      1	
  TB	
  
MalStone	
  A-­‐100	
      r[j]	
           100	
  billion	
  	
     10	
  TB	
  
MalStone	
  A-­‐1000	
     r[j]	
           1	
  trillion	
          100	
  TB	
  
MalStone	
  B-­‐10	
       r[j,	
  t]	
     10	
  billion	
  	
      1	
  TB	
  
MalStone	
  B-­‐100	
      r[j,	
  t]	
     100	
  billion	
  	
     10	
  TB	
  
MalStone	
  B-­‐1000	
     r[j,	
  t]	
     1	
  trillion	
          100	
  TB	
  
•    MalStone	
  B	
  running	
  on	
  10	
  Billion	
  100	
  byte	
  records	
  
•    Hadoop	
  version	
  0.18.3	
  
•    20	
  nodes	
  in	
  the	
  Open	
  Cloud	
  Testbed	
  
•    MapReduce	
  required	
  799	
  minutes	
  
•    Hadoop	
  streams	
  required	
  142	
  minutes	
  
Importance	
  of	
  Hadoop	
  Streams	
  

                                     MalStone	
  B	
  
Hadoop	
  	
  v0.18.3	
              799	
  min	
  
Hadoop	
  Streaming	
  v0.18.3	
     142	
  min	
  
Sector/Sphere	
                      44	
  min	
  
#	
  Nodes	
                         20	
  nodes	
  
#	
  Records	
                       10	
  Billion	
  
Size	
  of	
  Dataset	
  	
          1	
  TB	
  


                                                         55	
  
Ques?ons?	
  


For	
  the	
  most	
  current	
  version	
  of	
  these	
  slides,	
  
please	
  see	
  tutorials.opendatagroup.com	
  
Best	
  Prac?ces	
  for	
  Building	
  and	
  	
  
Deploying	
  Predic?ve	
  Models	
  Over	
  Big	
  Data	
  
                         	
  
     Module	
  4:	
  Deploying	
  Analy?cs	
  

                     Robert	
  Grossman	
  
          Open	
  Data	
  Group	
  &	
  Univ.	
  of	
  Chicago	
  
                          Collin	
  Benne=	
  
                     Open	
  Data	
  Group	
  
                                    	
  
                     October	
  23,	
  2012	
  
                                    	
  
Deploying	
  analy?c	
  models	
  




                               Analy?c	
  	
  
                               Diamond	
  
Analy?c	
  algorithms	
  
                                                 Analy?c	
  opera?ons,	
  
&	
  models	
  
                                                 security	
  &	
  compliance	
  


                            Analy?c	
  Infrastructure	
  
Problems	
  with	
  Current	
  Techniques	
  
•  Analy?cs	
  built	
  in	
  modeling	
  group	
  and	
  deployed	
  
   by	
  IT.	
  
•  Time	
  required	
  to	
  deploy	
  models	
  and	
  to	
  
   integrate	
  models	
  with	
  other	
  applica?ons	
  can	
  
   be	
  long.	
  
•  Models	
  are	
  deployed	
  in	
  proprietary	
  formats	
  
•  	
  Models	
  are	
  applica?on	
  dependent	
  
•  	
  Models	
  are	
  system	
  dependent	
  
•  	
  Models	
  are	
  architecture	
  dependent	
  	
  
Model	
  Independence	
  
•  A	
  model	
  should	
  be	
  independent	
  of	
  the	
  
   environment:	
  
        o Produc?on	
  /	
  Staging	
  /	
  Development	
  
        o MapReduce	
  /	
  Elas?c	
  MapReduce	
  /	
  Standalone	
  
        o Developer’s	
  desktop	
  /	
  Analyst’s	
  laptop	
  
        o Produc?on	
  Workflow	
  /	
  Historical	
  Data	
  Store	
  
•  A	
  model	
  should	
  be	
  independent	
  of	
  coding	
  
   language.	
  
(Very	
  Simplified)	
  Architectural	
  View	
  


                              Model	
                        PMML	
  
   Data	
                    Producer	
                      Model	
  

                   CART,	
  SVM,	
  k-­‐means,	
  etc.	
  
•  The	
  Predic?ve	
  Model	
  Markup	
  Language	
  
   (PMML)	
  is	
  an	
  XML	
  language	
  for	
  sta?s?cal	
  and	
  
   data	
  mining	
  models	
  (www.dmg.org).	
  
•  With	
  PMML,	
  it	
  is	
  easy	
  to	
  move	
  models	
  
   between	
  applica?ons	
  and	
  plahorms.	
                          61	
  
Predic?ve	
  Model	
  Markup	
  	
  
                  Language	
  (PMML)	
  
•  	
  Based	
  on	
  XML	
  	
  
•  	
  Benefits	
  of	
  PMML	
  
    –  Open	
  standard	
  for	
  Data	
  Mining	
  &	
  Sta?s?cal	
  	
  Models	
  	
  
    –  Provides	
  independence	
  from	
  applica?on,	
  plahorm,	
  
       and	
  opera?ng	
  system	
  
•  PMML	
  Producers	
  
    –  Some	
  applica?ons	
  create	
  predic?ve	
  models	
  
•  PMML	
  Consumers	
  
    –  Other	
  applica?ons	
  read	
  or	
  consume	
  models	
  
PMML	
  
•  PMML	
  is	
  the	
  leading	
  standard	
  for	
  sta?s?cal	
  
   and	
  data	
  mining	
  models	
  and	
  supported	
  by	
  
   over	
  20	
  vendors	
  and	
  organiza?ons.	
  
•  With	
  PMML,	
  it	
  is	
  easy	
  to	
  develop	
  a	
  model	
  on	
  
   one	
  system	
  using	
  one	
  applica?on	
  and	
  deploy	
  
   the	
  model	
  on	
  another	
  system	
  using	
  another	
  
   applica?on	
  
•  Applica?ons	
  exist	
  in	
  R,	
  Python,	
  Java,	
  SAS,	
  SPSS	
  
   …	
  
Vendor	
  Support	
  
•    SPSS	
                          •    R	
  
•    SAS	
                           •    Augustus	
  
•    IBM	
                           •    KNIME	
  
•    Microstrategy	
                 •    &	
  other	
  open	
  source	
  
•    Salford	
                            applica?ons	
  
•    Tibco	
  
•    Zemen?s	
  
•    &	
  other	
  vendors	
  	
  
Philosophy	
  
•  	
  Very	
  important	
  to	
  understand	
  what	
  PMML	
  
   is	
  not	
  concerned	
  with	
  …	
  
•  	
  PMML	
  is	
  a	
  specifica?on	
  of	
  a	
  model,	
  not	
  an	
  
   implementa?on	
  of	
  a	
  model	
  
•  	
  PMML	
  allows	
  a	
  simple	
  means	
  of	
  binding	
  
   parameters	
  to	
  values	
  for	
  an	
  agreed	
  upon	
  
   set	
  of	
  data	
  mining	
  models	
  &	
  
   transforma?ons	
  
•  	
  Also,	
  PMML	
  includes	
  the	
  metadata	
  
   required	
  to	
  deploy	
  models	
  
PMML	
  Document	
  Components	
  
•    Data	
  dic?onary	
  
•    Mining	
  schema	
  
•    Transforma?on	
  Dic?onary	
  
•    Mul?ple	
  models,	
  including	
  segments	
  and	
  
     ensembles.	
  
•  Model	
  verifica?on,	
  …	
  
•  Univariate	
  Sta?s?cs	
  (ModelStats)	
  
•  Op?onal	
  Extensions	
  
	
  
PMML	
  Models	
  
•  	
  trees	
                  •  	
  polynomial	
  
•  	
  associa?ons	
               regression	
  
•  	
  neural	
  nets	
         •  	
  logis?c	
  regression	
  
•  	
  naïve	
  Bayes	
  	
     •  	
  general	
  regression	
  
•  	
  sequences	
              •  	
  center	
  based	
  
•  	
  text	
  models	
            clusters	
  
•  	
  support	
  vector	
      •  	
  density	
  based	
  
   machines	
                      clusters	
  
•  	
  ruleset	
  
PMML	
  Data	
  Flow	
  
•  Data	
  Dic?onary	
  defines	
  data	
  
•  Mining	
  Schema	
  defines	
  specific	
  inputs	
  
   (MiningFields)	
  required	
  for	
  model	
  
•  Transforma?on	
  Dic?onary	
  defines	
  op?onal	
  
   addi?onal	
  derived	
  fields	
  
•  A=ributes	
  of	
  feature	
  vectors	
  are	
  of	
  two	
  
   types	
  	
  
   –  a=ributes	
  defined	
  by	
  the	
  mining	
  schema	
  
   –  derived	
  a=ributes	
  defined	
  via	
  transforma?ons	
  
•  Models	
  themselves	
  can	
  also	
  support	
  certain	
  
   transforma?ons	
  
(Simplified)	
  Architectural	
  View	
  
                                                   algorithms	
  to	
  
                                                es?mate	
  models	
  
                           Data	
  Pre-­‐                Model	
  
      Data	
              processing	
   features	
     Producer	
  




•  PMML	
  also	
  supports	
  XML	
  elements	
          PMML	
  
   to	
  describe	
  data	
  preprocessing.	
             Model	
  

                                                                       69	
  
Three	
  Important	
  Interfaces	
  
Modeling	
  Environment	
                                                                        2	
  
                                                  1	
   1	
  
                                 Data	
  Pre-­‐                  Model	
          PMML	
  
Data	
                          processing	
                    Producer	
        Model	
  



                    2	
                                   Deployment	
  Environment	
  	
  
                                 PMML	
  
                                 Model	
  
            1	
                          3	
        3	
  
                             Model	
                            Post	
  
                            Consumer	
                      Processing	
  
 data	
                                  scores	
                              ac?ons	
  
                                                                                        70	
  
Single	
  Model	
  PMML	
                                                                                             Order	
  of	
  Evalua?on:	
  
                                                                                                                               DD	
  àMS	
  à	
  TD	
  à	
  LT	
  à	
  Model	
  
               PMML	
  Document	
  (Global)	
  

                                                                                Model	
  ABC	
  

                            DD	
                                       MS	
                           Model	
  Execu?on	
  
         Data	
                 Data	
           1-­‐to-­‐1	
  
                                                                           MF	
                                                              Targets	
  
                                Field	
  
                                                                                                                                                Target	
  
                                                                                                   Processing	
  of	
  
                                                                       TD	
                        elements	
  for	
  	
  
                              TD	
  
                                                                            DF	
                   PMML	
  supported	
  
                                     DF	
                                                          models,	
  such	
  as	
  
                                                                                                   trees,	
  regression,	
                   Outputs	
  

                                                                       LT	
                        Naïve	
  Bayes,	
                                               Output	
  
                                                                                                                                                   OF	
  
                                                                            DF	
  
                                                                                                   baselines,	
  etc.	
  


                                                                                                                                           	
  Model	
  	
  
                                                                                                                                           	
  Verifica?on	
  

                                                                                                                                              MV	
  Fields	
  




              <	
  71	
  -­‐	
  PMML	
  Field	
  Diagram	
  (v8)	
  
71	
  
Augustus	
  PMML	
  Producer	
  
                                        Producer	
  Schema	
  
                                         (part	
  of	
  PMML)	
  




                                           Model	
                                    PMML	
  
                                          Producer	
                                  Model	
  


                                          Producer	
  	
  
                                       Configura?on	
  File	
  
various	
  data	
  formats,	
  
including	
  csv	
  files,	
  h=p	
  
streams,	
  &	
  databases	
  	
  
                                                              Augusuts	
  (augustus.googlecode.com)	
  
                                                                                                          72	
  
Augustus	
  PMML	
  Consumer	
  
                                           PMML	
  
                                           Model	
  

                                                                                       Post	
  
                                        Model	
                                     Processing	
  
                                       Consumer	
                scores	
  


                                          Consumer	
                                 ac?ons,	
  alerts,	
  
various	
  data	
  formats,	
          Configura?on	
  File	
                         reports,	
  etc.	
  
including	
  csv	
  files,	
  h=p	
  
streams,	
  &	
  databases	
  	
                          Augusuts	
  (augustus.googlecode.com)	
  
                                                                                                        73	
  
Best	
  Prac?ces	
  for	
  Building	
  and	
  	
  
Deploying	
  Predic?ve	
  Models	
  Over	
  Big	
  Data	
  
                         	
  
       Module	
  5:	
  Mul?ple	
  Models	
  

                     Robert	
  Grossman	
  
          Open	
  Data	
  Group	
  &	
  Univ.	
  of	
  Chicago	
  
                          Collin	
  Benne=	
  
                     Open	
  Data	
  Group	
  
                                    	
  
                     October	
  23,	
  2012	
  
                                    	
  
1.	
  Trees	
  




For CART trees: L. Breiman, J. Friedman, R. A. Olshen, C. J. Stone,
Classification and Regression Trees, 1984, Chapman & Hall.
Classifica?on	
  Trees	
  
Petal Len.        Petal Width Sepal Len.                    Sepal Width                         Species

02 	
     	
     	
  14 	
     	
     	
  33 	
     	
     	
  50 	
     	
     	
     	
     	
  A	
  
24 	
     	
     	
  56 	
     	
     	
  31 	
     	
     	
  67 	
     	
     	
     	
     	
  C	
  
23 	
     	
     	
  51 	
     	
     	
  31 	
     	
     	
  69 	
     	
     	
     	
     	
  C	
  
13 	
     	
     	
  45 	
     	
     	
  28 	
     	
     	
  57 	
     	
     	
     	
     	
  B	
  


   •  Want	
  a	
  func?on	
  Y	
  =	
  g(X),	
  which	
  predicts	
  the	
  
      red	
  variable	
  Y	
  using	
  one	
  or	
  more	
  of	
  the	
  blue	
  
      variables	
  X[1],	
  …,	
  X[4]	
  
   •  	
  Assume	
  each	
  row	
  is	
  classified	
  A,	
  B,	
  or	
  C	
  
Trees	
  Par??on	
  Feature	
  Space	
  
 Petal	
  Width	
  
                                          C	
  
                                                                                                    Petal	
  Length	
  <	
  2.45	
  
1.75	
  

               A	
                B	
                C	
                                                     Petal	
  Width	
  <	
  1.75?	
  

                                                   Petal	
  Length	
                     A	
  
                                                                         Petal	
  Len	
  <	
  4.95?	
  
                                                                                                                              C	
  

                       2.45	
                     4.95	
  
                                                                                                     B	
              C	
  


•  Trees	
  par??on	
  the	
  feature	
  space	
  into	
  regions	
  by	
  asking	
  whether	
  an	
  
   a=ribute	
  is	
  less	
  than	
  a	
  threshold.	
  
2.	
  Mul?ple	
  Models:	
  Ensembles	
  




78	
  
Key	
  Idea:	
  Combine	
  Weak	
  Learners	
  
                                                              Model 2


   Model 1

                                                             Model 3




•  Average	
  several	
  weak	
  models,	
  	
  rather	
  than	
  build	
  
   one	
  complex	
  model.	
  
Combining	
  Weak	
  Learners	
  
1 Classifier                                                                                              3 Classifiers                                                                                                                5 Classifiers
      55%                                                                                                     57.40%                                                                                                                       59.30%
      60%                                                                                                       64.0%                                                                                                                      68.20%
      65%                                                                                                     71.00%                                                                                                                       76.50%

   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
   	
  	
  	
  	
  	
  	
  	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  6	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
   	
  	
  	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  5	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  10	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  10	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  5	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
Ensembles	
  
•  Ensembles	
  are	
  collec?ons	
  of	
  classifiers	
  with	
  a	
  
     rule	
  for	
  combining	
  the	
  results.	
  
•  The	
  simplest	
  combining	
  rule:	
  	
  
      – Use	
  average	
  for	
  regression	
  trees.	
  	
  	
  
      – Use	
  majority	
  vo?ng	
  for	
  classifica?on.	
  
•  Ensembles	
  are	
  unreasonably	
  effec?ve	
  at	
  
     building	
  models.	
  
	
  
Building ensembles over clusters of
commodity workstations has been used
since the 1990’s.
Ensemble Models
2. build
model
                         3. gather 1.    Partition data and
                                         scatter
                         models
                                   2.    Build models (e.g.
                                         tree-based model)
                                   3.    Gather models
                        4. form    4.    Form collection of
    1. partition data   ensemble         models into ensemble
                                         (e.g. majority vote for
                                         classification &
                           model         averaging for
                                         regression)
Example:	
  Fraud	
  
•  Ensembles	
  of	
  trees	
  have	
  proved	
  remarkably	
  
   effec?ve	
  in	
  detec?ng	
  fraud	
  
•  Trees	
  can	
  be	
  very	
  large	
  
•  Very	
  fast	
  to	
  score	
  
•  Lots	
  of	
  variants	
  
   –  Random	
  forests	
  
   –  Random	
  trees	
  
•  Some?mes	
  need	
  to	
  reverse	
  engineer	
  reason	
  
   codes.	
  
3.	
  CUSUM	
  
          Baseline	
                                                 Observed	
  
                                                                     Distribu?on	
  




                  f0(x)	
  	
                                                       f1(x)	
  	
  


                                                           	
  
                                                                  f0(x)	
  =	
  baseline	
  density	
  	
  
§  Assume	
  that	
  we	
  have	
  a	
  mean	
  and	
            f1(x)	
  =	
  observed	
  density	
  
    standard	
  devia?on	
  for	
  both	
                         g(x)	
  =	
  log(f1(x)/f0(x))	
  
    distribu?ons.	
  	
                                           Z0	
  =	
  0	
  
                                                                  Zn+1	
  =	
  max{Zn+g(x),	
  0}	
  
                                                                  Alert	
  if	
  Zn+1>threshold	
  
 85	
                                                      	
  
4.	
  Cubes	
  of	
  Models	
  
Example	
  1:	
  Cubes	
  of	
  Models	
  for	
  Change	
  
              Detec?on	
  for	
  Payments	
  System	
  
Build	
  separate	
  segmented	
  model	
                        Geospatial
     for	
  every	
  en?ty	
  of	
  interest	
                   region	

Divide	
  &	
  conquer	
  data	
  (segment)	
  
     using	
  mul?dimensional	
  data	
  
     cubes	
  
For	
  each	
  dis?nct	
  cube,	
  establish	
                     Type of
        separate	
  baselines	
  for	
  each	
                     Transaction	

        quan&fy	
  of	
  interest	
            Entity
                                               (bank, etc.)	

Detect	
  changes	
  from	
  baselines	
  	
  

                               Estimate separate baselines for each	

                               quantify of interest	

  87	
  
Example	
  2	
  –	
  Cubes	
  of	
  Models	
  	
  
           for	
  Detec?ng	
  Anomalies	
  in	
  Traffic	
  
         •  Highway	
  Traffic	
  Data	
  
            –  each	
  day	
  (7)	
  x	
  each	
  hour	
  (24)	
  x	
  each	
  sensor	
  
               (hundreds)	
  x	
  each	
  weather	
  condi?on	
  (5)	
  x	
  each	
  
               special	
  event	
  (dozens)	
  
            –  50,000	
  baselines	
  models	
  used	
  in	
  current	
  
               testbed	
  




88	
  
5.	
  Mul?ple	
  Models	
  in	
  Hadoop	
  
•  Trees	
  in	
  Mapper	
  
•  Build	
  lots	
  of	
  small	
  trees	
  (over	
  historical	
  data	
  in	
  
   Hadoop)	
  
•  Load	
  all	
  of	
  them	
  into	
  the	
  mapper	
  
•  Mapper	
  
    –  For	
  each	
  event,	
  score	
  against	
  each	
  tree	
  
    –  Map	
  emits	
  
         •  Key	
  t	
  value	
  =	
  event	
  id	
  t	
  tree	
  id,	
  score	
  
•  Reducer	
  
    –  Aggregate	
  scores	
  for	
  each	
  event	
  
    –  Select	
  a	
  “winner”	
  for	
  each	
  event	
  
Ques?ons?	
  


For	
  the	
  most	
  current	
  version	
  of	
  these	
  slides,	
  
please	
  see	
  tutorials.opendatagroup.com	
  
Best	
  Prac?ces	
  for	
  Building	
  and	
  	
  
Deploying	
  Predic?ve	
  Models	
  Over	
  Big	
  Data	
  
                         	
  
       Module	
  6:	
  Case	
  Study	
  -­‐	
  CTA	
  

                     Robert	
  Grossman	
  
          Open	
  Data	
  Group	
  &	
  Univ.	
  of	
  Chicago	
  
                          Collin	
  Benne=	
  
                     Open	
  Data	
  Group	
  
                                    	
  
                     October	
  23,	
  2012	
  
                                    	
  
Chicago	
  Transit	
  Authority	
  

       •  The	
  Chicago	
  Transit	
  Authority	
  provides	
  
          service	
  to	
  Chicago	
  and	
  surrounding	
  
          suburbs	
  
       •  CTA	
  operates	
  24	
  hours	
  each	
  day	
  and	
  on	
  
          an	
  average	
  weekday	
  provides	
  1.7	
  
          million	
  rides	
  on	
  buses	
  and	
  trains	
  
       •  It	
  has	
  1,800	
  buses,	
  opera?ng	
  over	
  140	
  
          routes,	
  along	
  2,230	
  route	
  miles	
  
       •  Buses	
  provide	
  approximately	
  1	
  million	
  
          rides	
  day	
  and	
  serve	
  more	
  than	
  12,000	
  
          posted	
  bus	
  stops	
  
CTA	
  Data	
  
•  The	
  City	
  of	
  Chicago	
  has	
  a	
  public	
  data	
  portal	
  
         •  h=ps://data.cityofchicago.org/	
  


•  Some	
  of	
  available	
  categories	
  are:	
  
         •  Administra?on	
  &	
  Finance	
  
         •  Community	
  &	
  Economic	
  Development	
  
         •  Educa?on	
  
         •  Facili?es	
  &	
  Geographic	
  Boundaries	
  
         •  Transporta?on	
  
Moving	
  to	
  Hadoop	
  
•  For	
  an	
  example,	
  we	
  model	
  ridership	
  by	
  bus	
  route	
  
        	
  	
  	
  	
  	
  Data	
  set:	
  CTA	
  -­‐	
  Ridership	
  -­‐	
  Bus	
  Routes	
  -­‐	
  Daily	
  Totals	
  	
  	
  
        	
  	
  	
  	
  	
  by	
  Route	
  
•  MapReduce	
  
    –  Map	
  Input:	
  
        •  Route	
  number,	
  Day	
  type,	
  Number	
  of	
  riders	
  
    –  Map	
  Output	
  /	
  Reduce	
  Input:	
  
        •  Key	
  t	
  value	
  =	
  Route	
  t	
  	
  day,	
  date,	
  riders	
  
    –  Reduce	
  Output:	
  
        •  Model	
  for	
  each	
  route	
  
Example	
  Data	
  Set	
  
•  Data	
  goes	
  back	
  to	
  2001	
  
     –  Route	
  Number	
  
     –  Date	
  
     –  Day	
  of	
  Week	
  Type	
  
     –  Number	
  of	
  riders	
  
•  Demo	
  size:	
  
     –  11	
  MB	
  
     –  534,208	
  records	
  	
  
•  Saved	
  as	
  csv	
  file	
  in	
  HDFS	
  
Workflow	
  
Workflow	
  
Workflow	
  
Mapper	
  
•  Using	
  Hadoop’s	
  Streaming	
  Interface,	
  we	
  can	
  use	
  a	
  
   language	
  appropriate	
  for	
  each	
  step	
  
•  The	
  mapper	
  needs	
  to	
  read	
  the	
  records	
  quickly	
  
•  We	
  use	
  Python	
  
Reducer	
  
•  For	
  this	
  example,	
  we	
  calculate	
  a	
  simple	
  
   sta?s?c:	
  
    –  Gaussian	
  Distribu?on	
  of	
  riders	
  for	
  route,	
  day	
  of	
  
       week	
  pairs	
  described	
  by	
  the	
  mean	
  and	
  variance	
  
•  We	
  use	
  the	
  Baseline	
  Model	
  in	
  PMML	
  to	
  describe	
  this.	
  
•  The	
  Baseline	
  model	
  is	
  oen	
  useful	
  for	
  Change	
  
   Detec?on.	
  
    •  Do	
  the	
  records	
  matching	
  a	
  segment	
  fit	
  the	
  expected	
  
       distribu?on	
  describing	
  the	
  segment	
  
Describing	
  a	
  Segment	
  
•  Baseline	
  Model	
  
	
  
       <BaselineModel functionName="regression”>
           <MiningSchema>
               <MiningField usageType="active"
                            name="rides" />
           </MiningSchema>
           <TestDistributions field="rides"
                              testStatistic="zValue”>
               <Baseline>
                   <GaussianDistribution
                        variance="64308.1625973"
                        mean="1083.88950936”>
                   </GaussianDistribution>
               </Baseline>
       </BaselineModel>
Building	
  PMML	
  in	
  the	
  Reducer	
  
•  You	
  can	
  embed	
  a	
  PMML	
  engine	
  into	
  the	
  
   reducer	
  if	
  you	
  are	
  using	
  
    –  Java	
  
    –  R	
  
    –  Python	
  
•  We	
  present	
  a	
  light-­‐weight	
  solu?on	
  which	
  
   builds	
  valid	
  models,	
  but	
  also	
  allows	
  you	
  to	
  
   construct	
  invalid	
  ones	
  
Using	
  a	
  Valid	
  Model	
  Template	
  
Template	
  in	
  R	
  
•  Similar	
  but	
  a	
  li=le	
  less	
  elegant	
  in	
  R	
  
Popula?ng	
  the	
  Template	
  (Python)	
  
•  This	
  is	
  done	
  in	
  an	
  XML	
  /	
  XPATH	
  way,	
  rather	
  than	
  as	
  
   PMML	
  
Popula?ng	
  the	
  Template	
  (R)	
  
Reducer	
  Trade	
  Offs	
  
•  Standard	
  trade	
  offs	
  apply	
  
    –  Holding	
  events	
  in	
  the	
  reducer	
  (easier,	
  RAM)	
  vs.	
  
    –  Compu?ng	
  running	
  sta?s?cs	
  (more	
  coding,	
  CPU)	
  
•  We	
  present	
  two	
  techniques	
  using	
  R	
  
    –  Holding	
  all	
  events	
  for	
  a	
  given	
  key	
  in	
  a	
  dataframe	
  
       and	
  then	
  calcula?ng	
  the	
  sta?s?c	
  when	
  a	
  key	
  
       transi?on	
  is	
  iden?fied	
  
    –  Holding	
  the	
  minimum	
  data	
  to	
  calculate	
  running	
  
       sta?s?cs.	
  	
  Computed	
  as	
  each	
  event	
  is	
  seen	
  
Two	
  Approaches	
  
•  Holding	
  events	
  requires	
  extra	
  memory	
  and	
  
   may	
  get	
  slow	
  as	
  the	
  number	
  of	
  events	
  shuffled	
  
   to	
  the	
  reducer	
  for	
  a	
  given	
  keys	
  grows	
  large	
  
    –  But,	
  it	
  there	
  are	
  a	
  lot	
  of	
  sta?s?cal	
  and	
  data	
  mining	
  
       methods	
  available	
  in	
  R	
  that	
  operate	
  on	
  a	
  
       dataframe	
  
•  Tracking	
  running	
  sta?s?cs	
  requires	
  more	
  
   coding	
  
    –  But	
  may	
  be	
  the	
  only	
  way	
  to	
  process	
  a	
  batch	
  
       without	
  spilling	
  to	
  disk	
  
Dataframe	
  
•  Accumulate	
  each	
  mapped	
  record	
  
•  For	
  each	
  event,	
  add	
  it	
  to	
  a	
  temporary	
  table	
  
     doAny <- function(v, date, rides) {
         # add a column to the temporary table
         v[["rows"]][[length(v[["rows"]]) + 1]]
             <- list(date, rides)
     }


•  Create	
  the	
  dataframe	
  when	
  we	
  have	
  them	
  all
Dataframe	
  con?nued	
  
•  When	
  you	
  have	
  seen	
  all	
  of	
  the,	
  process	
  and	
  and	
  compute	
  
doLast <- function(v) {
  # make a data frame out of the temporary table
  dataframe <- do.call("rbind", v[["rows"]])
  colnames(dataframe) <- c("date", "rides")

    # pull a column out and call the function on it
    rides <- unlist(dataframe[,"rides"])
    result <- ewup2(rides, alpha)

    count <- result[["count"]][[length(rides)]]
    mean <- result[["mean"]][[length(rides)]]
    variance <- result[["variance"]][[length(rides)]]
    varn <- result[["varn"]][[length(rides)]]
}
Running	
  Totals	
  
•  Update	
  sta?s?cs	
  as	
  each	
  event	
  is	
  encountered	
  
     doNonFirst <- function(v, x) {
         # update the counters
         v[["count"]] <- v[["count"]] + 1
         diff <- v[["prevx"]] - v[["mean"]]
         incr <- alpha * diff
         v[["mean"]] <- v[["mean"]] + incr
         v[["varn"]] <- (1.0 - alpha)*
            (v[["varn"]] + incr*diff)
         v[["prevx"]] <- x
     }
Running	
  Totals	
  con?nued	
  
•  Update	
  sta?s?cs	
  as	
  each	
  event	
  is	
  encountered	
  
     doLast <- function(v) {
       if (v[["count"]] >= 2) {
           # when N >= 2, correct for 1/(N - 1)
           variance <- v[["varn"]] /
               (v[["count"]] - 1.0)
       }
       else {
           variance <- v[["varn"]]
       }
     }
Results	
  
                                                        •  As	
  new	
  data	
  are	
  seen,	
  
                                                           models	
  can	
  be	
  updated	
  
                                      Model	
  
   Model	
  



                          Model	
  
                                                        •  New	
  routes	
  added	
  as	
  
                                                           new	
  segments	
  
       Model	
                                          •  Model	
  is	
  not	
  bound	
  to	
  
                   Model	
  
                                                           Hadoop	
  

Model	
  
                               Model	
  
EDA	
  
•  Write	
  plots	
  to	
  HDFS	
  as	
  SVG	
  files	
  (text,	
  xml)	
  
•  Graphical	
  view	
  of	
  the	
  reduce	
  calcula?on	
  or	
  sta?s?c	
  
 <?xml version="1.0" standalone="no"?>
 <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/
 svg11.dtd">
 <svg style="stroke-linejoin:miter; stroke:black; stroke-width:2.5; text-anchor:middle;
 fill:none" xmlns="http://www.w3.org/2000/svg" font-family="Helvetica, Arial, FreeSans, Sans,
 sans, sans-serif" width="2000px" height="500px" version="1.1" xmlns:xlink="http://www.w3.org/
 1999/xlink" viewBox="0 0 2000 500">
 <defs>
     <clipPath id="Legend_0_clip">
         <rect x="1601.2" y="25" width="298.8" height="215" />
     </clipPath>
     <clipPath id="TimeSeries_0_clip">
         <rect x="240" y="25" width="1660" height="435" />
     </clipPath>
     <clipPath id="TimeSeries_10_clip">
         <rect x="240" y="25" width="1660" height="435" />
     </clipPath>
     <clipPath id="TimeSeries_11_clip">
         <rect x="240" y="25" width="1660" height="435" />
     …	
  
EDA	
  
Ques?ons?	
  


For	
  the	
  most	
  current	
  version	
  of	
  these	
  slides,	
  
please	
  see	
  tutorials.opendatagroup.com	
  
About	
  Robert	
  Grossman	
  
Robert	
  Grossman	
  is	
  the	
  Founder	
  and	
  a	
  Partner	
  of	
  Open	
  Data	
  Group,	
  which	
  
specializes	
  in	
  building	
  predic?ve	
  models	
  over	
  big	
  data.	
  He	
  is	
  the	
  Director	
  of	
  
Informa?cs	
  at	
  the	
  Ins?tute	
  for	
  Genomics	
  and	
  Systems	
  Biology	
  (IGSB)	
  and	
  a	
  
faculty	
  member	
  in	
  the	
  Computa?on	
  Ins?tute	
  (CI)	
  at	
  the	
  University	
  of	
  
Chicago.	
  His	
  areas	
  of	
  research	
  include:	
  big	
  data,	
  predic?ve	
  analy?cs,	
  
bioinforma?cs,	
  data	
  intensive	
  compu?ng	
  and	
  analy?c	
  infrastructure.	
  He	
  has	
  
led	
  the	
  development	
  of	
  new	
  open	
  source	
  soware	
  tools	
  for	
  analyzing	
  big	
  
data,	
  cloud	
  compu?ng,	
  data	
  mining,	
  distributed	
  compu?ng	
  and	
  high	
  
performance	
  networking.	
  Prior	
  to	
  star?ng	
  Open	
  Data	
  Group,	
  he	
  founded	
  
Magnify,	
  Inc.	
  in	
  1996,	
  which	
  provides	
  data	
  mining	
  solu?ons	
  to	
  the	
  insurance	
  
industry.	
  Grossman	
  was	
  Magnify’s	
  CEO	
  un?l	
  2001	
  and	
  its	
  Chairman	
  un?l	
  it	
  
was	
  sold	
  to	
  ChoicePoint	
  in	
  2005.	
  He	
  blogs	
  about	
  big	
  data,	
  data	
  science,	
  and	
  
data	
  engineering	
  at	
  rgrossman.com.	
  
About	
  Robert	
  Grossman	
  (cont’d)	
  
 •  You	
  can	
  find	
  more	
  informa?on	
  about	
  big	
  data	
  and	
  related	
  
    areas	
  on	
  my	
  blog:	
  
     	
   	
   	
  rgrossman.com.	
  
 •  Some	
  of	
  my	
  technical	
  papers	
  are	
  available	
  online	
  at	
  
    rgrossman.com	
  	
  	
  
 •  I	
  just	
  finished	
  a	
  book	
  for	
  the	
  non-­‐technical	
  reader	
  called	
  The	
  
    Structure	
  of	
  Digital	
  Compu&ng:	
  From	
  Mainframes	
  to	
  Big	
  Data,	
  
    which	
  is	
  available	
  from	
  Amazon.	
  
 	
  
About	
  Collin	
  Benne=	
  
Collin	
  Benne=	
  is	
  a	
  principal	
  at	
  Open	
  Data	
  Group.	
  In	
  three	
  and	
  a	
  half	
  years	
  
with	
  the	
  company,	
  Collin	
  has	
  worked	
  on	
  the	
  open	
  source	
  Augustus	
  scoring	
  
engine	
  and	
  a	
  cloud-­‐based	
  environment	
  for	
  rapid	
  analy?c	
  prototyping	
  called	
  
RAP.	
  Addi?onally,	
  he	
  has	
  released	
  open	
  source	
  projects	
  for	
  the	
  Open	
  Cloud	
  
Consor?um.	
  One	
  of	
  these,	
  MalGen,	
  has	
  been	
  used	
  to	
  benchmark	
  several	
  
parallel	
  computa?on	
  frameworks.	
  Previously,	
  he	
  led	
  soware	
  development	
  
for	
  the	
  Product	
  Development	
  Team	
  at	
  Acquity	
  Group,	
  an	
  IT	
  consul?ng	
  firm	
  
head-­‐quartered	
  in	
  Chicago.	
  He	
  also	
  worked	
  at	
  startups	
  Orbitz	
  (when	
  it	
  was	
  
s?ll	
  was	
  one)	
  and	
  Business	
  Logic	
  Corpora?on.	
  He	
  has	
  co-­‐authored	
  papers	
  
on	
  Weyl	
  tensors,	
  large	
  data	
  clouds,	
  and	
  high	
  performance	
  wide	
  area	
  cloud	
  
testbeds.	
  He	
  holds	
  degrees	
  in	
  English,	
  Mathema?cs	
  and	
  Computer	
  Science.	
  
	
  
	
  
About	
  Open	
  Data	
  	
  
•  Open	
  Data	
  began	
  opera?ons	
  in	
  2001	
  and	
  has	
  
   built	
  predic?ve	
  models	
  for	
  companies	
  for	
  over	
  
   ten	
  years	
  
•  Open	
  Data	
  provides	
  management	
  consul?ng,	
  
   outsourced	
  analy?c	
  services,	
  &	
  analy?c	
  staffing	
  
                        •  For	
  more	
  informa?on	
  
                            •  www.opendatagroup.com	
  
                            •  info@opendatagroup.com	
  
                            	
  
About	
  Open	
  Data	
  Group	
  (cont’d)	
  
Open	
  Data	
  Group	
  has	
  been	
  building	
  predic?ve	
  models	
  over	
  big	
  data	
  for	
  its	
  
clients	
  over	
  ten	
  years.	
  	
  Open	
  Data	
  Group	
  is	
  one	
  of	
  the	
  pioneers	
  using	
  
technologies	
  such	
  as	
  Hadoop	
  and	
  NoSQL	
  databases	
  so	
  that	
  companies	
  can	
  
build	
  predic?ve	
  models	
  efficiently	
  over	
  all	
  of	
  their	
  data.	
  	
  Open	
  Data	
  has	
  also	
  
par?cipated	
  in	
  the	
  development	
  of	
  the	
  Predic?ve	
  Model	
  Markup	
  Language	
  
(PMML)	
  so	
  that	
  companies	
  can	
  more	
  quickly	
  deploy	
  analy?c	
  models	
  into	
  their	
  
opera?onal	
  systems.	
  
For	
  the	
  past	
  several	
  years,	
  Open	
  Data	
  has	
  also	
  helped	
  companies	
  build	
  
predic?ve	
  models	
  and	
  score	
  data	
  in	
  real	
  ?me	
  using	
  elas?c	
  clouds,	
  such	
  as	
  
provided	
  with	
  Amazon’s	
  AWS.	
  
Open	
  Data’s	
  consul?ng	
  philosophy	
  is	
  nicely	
  summed	
  up	
  by	
  Marvin	
  Bower,	
  one	
  
of	
  McKinsey	
  &	
  Company’s	
  founders.	
  Bower	
  led	
  his	
  firm	
  using	
  three	
  rules:	
  (1)	
  
put	
  the	
  interests	
  of	
  the	
  client	
  ahead	
  of	
  revenues;	
  (2)	
  tell	
  the	
  truth	
  and	
  don’t	
  be	
  
afraid	
  to	
  challenge	
  a	
  client’s	
  opinion;	
  and	
  (3)	
  only	
  agree	
  to	
  perform	
  work	
  that	
  
is	
  necessary	
  and	
  something	
  [you]	
  can	
  do	
  well.	
  	
  
	
  
www.opendatagroup.com	
  

More Related Content

What's hot

Database Architecture Proposal
Database Architecture ProposalDatabase Architecture Proposal
Database Architecture ProposalDATANYWARE.com
 
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsUse cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsGord Sissons
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project reportsonalighai
 
Compositional AI: Fusion of AI/ML Services
Compositional AI: Fusion of AI/ML ServicesCompositional AI: Fusion of AI/ML Services
Compositional AI: Fusion of AI/ML ServicesDebmalya Biswas
 
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEED
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEEDTHE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEED
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEEDwebwinkelvakdag
 
Fbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesFbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesCindy Irby
 
Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Bikramjit Sarkar, Ph.D.
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperImpetus Technologies
 
Implementing bi in proof of concept techniques
Implementing bi in proof of concept techniquesImplementing bi in proof of concept techniques
Implementing bi in proof of concept techniquesRanjith Ramanan
 
The Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleThe Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleVasu S
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Templatebutest
 
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET Journal
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lakeCapgemini
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects IJMER
 
A blueprint for data in a multicloud world
A blueprint for data in a multicloud worldA blueprint for data in a multicloud world
A blueprint for data in a multicloud worldMehdi Charafeddine
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report Tom Donoghue
 
Data Warehouse
Data WarehouseData Warehouse
Data WarehouseSana Alvi
 

What's hot (19)

Database Architecture Proposal
Database Architecture ProposalDatabase Architecture Proposal
Database Architecture Proposal
 
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsUse cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project report
 
Compositional AI: Fusion of AI/ML Services
Compositional AI: Fusion of AI/ML ServicesCompositional AI: Fusion of AI/ML Services
Compositional AI: Fusion of AI/ML Services
 
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEED
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEEDTHE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEED
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEED
 
Fbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesFbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_services
 
Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 
Implementing bi in proof of concept techniques
Implementing bi in proof of concept techniquesImplementing bi in proof of concept techniques
Implementing bi in proof of concept techniques
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
The Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleThe Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | Qubole
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lake
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects
 
A blueprint for data in a multicloud world
A blueprint for data in a multicloud worldA blueprint for data in a multicloud world
A blueprint for data in a multicloud world
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
JDV Big Data Webinar v2
JDV Big Data Webinar v2JDV Big Data Webinar v2
JDV Big Data Webinar v2
 

Viewers also liked

Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...NICSA
 
Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)Shubham Gupta
 
Airline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining ProjectAirline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining ProjectHaozhe Wang
 
Flight Arrival Delay Prediction
Flight Arrival Delay PredictionFlight Arrival Delay Prediction
Flight Arrival Delay PredictionShabnam Abghari
 

Viewers also liked (7)

Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
 
Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)
 
Airline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining ProjectAirline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining Project
 
Flight Arrival Delay Prediction
Flight Arrival Delay PredictionFlight Arrival Delay Prediction
Flight Arrival Delay Prediction
 
Big Data For Flight Delay Report
Big Data For Flight Delay ReportBig Data For Flight Delay Report
Big Data For Flight Delay Report
 
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPTBIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
 
Flight Delay Prediction
Flight Delay PredictionFlight Delay Prediction
Flight Delay Prediction
 

Similar to Best practices for building and deploying predictive models over big data presentation

Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 
Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5Robert Grossman
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Supply Chain Council Presentation For Indianapolis 2 March 2012
Supply Chain Council Presentation For Indianapolis 2 March 2012Supply Chain Council Presentation For Indianapolis 2 March 2012
Supply Chain Council Presentation For Indianapolis 2 March 2012Arnold Mark Wells
 
An Integrated Framework for Parameter-based Optimization of Scientific Workflows
An Integrated Framework for Parameter-based Optimization of Scientific WorkflowsAn Integrated Framework for Parameter-based Optimization of Scientific Workflows
An Integrated Framework for Parameter-based Optimization of Scientific Workflowsvijayskumar
 
Queuing model based load testing of large enterprise applications
Queuing model based load testing of large enterprise applicationsQueuing model based load testing of large enterprise applications
Queuing model based load testing of large enterprise applicationsLeonid Grinshpan, Ph.D.
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Big Data Needs Big Analytics
Big Data Needs Big AnalyticsBig Data Needs Big Analytics
Big Data Needs Big AnalyticsDeepak Ramanathan
 
IBM Cognos - IBM informations-integration för IBM Cognos användare
IBM Cognos - IBM informations-integration för IBM Cognos användareIBM Cognos - IBM informations-integration för IBM Cognos användare
IBM Cognos - IBM informations-integration för IBM Cognos användareIBM Sverige
 
Kahn.theodore
Kahn.theodoreKahn.theodore
Kahn.theodoreNASAPMC
 
Practical data science
Practical data sciencePractical data science
Practical data scienceDing Li
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss sessionM Baddar
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
Build, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at ScaleBuild, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at ScaleAmazon Web Services
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro sessionAvinash Patil
 
Enterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive TechnologiesEnterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive TechnologiesDataWorks Summit
 

Similar to Best practices for building and deploying predictive models over big data presentation (20)

Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the Future
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Supply Chain Council Presentation For Indianapolis 2 March 2012
Supply Chain Council Presentation For Indianapolis 2 March 2012Supply Chain Council Presentation For Indianapolis 2 March 2012
Supply Chain Council Presentation For Indianapolis 2 March 2012
 
An Integrated Framework for Parameter-based Optimization of Scientific Workflows
An Integrated Framework for Parameter-based Optimization of Scientific WorkflowsAn Integrated Framework for Parameter-based Optimization of Scientific Workflows
An Integrated Framework for Parameter-based Optimization of Scientific Workflows
 
Queuing model based load testing of large enterprise applications
Queuing model based load testing of large enterprise applicationsQueuing model based load testing of large enterprise applications
Queuing model based load testing of large enterprise applications
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Big Data Needs Big Analytics
Big Data Needs Big AnalyticsBig Data Needs Big Analytics
Big Data Needs Big Analytics
 
IBM Cognos - IBM informations-integration för IBM Cognos användare
IBM Cognos - IBM informations-integration för IBM Cognos användareIBM Cognos - IBM informations-integration för IBM Cognos användare
IBM Cognos - IBM informations-integration för IBM Cognos användare
 
Kahn.theodore
Kahn.theodoreKahn.theodore
Kahn.theodore
 
Practical data science
Practical data sciencePractical data science
Practical data science
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss session
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Build, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at ScaleBuild, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at Scale
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro session
 
Enterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive TechnologiesEnterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive Technologies
 

More from Kun Le

Ibm big data and analytics for a holistic customer journey
Ibm big data and analytics for a holistic customer journeyIbm big data and analytics for a holistic customer journey
Ibm big data and analytics for a holistic customer journeyKun Le
 
University of Washington Computer Science & Engineering CSE 403: Software Eng...
University of Washington Computer Science & Engineering CSE 403: Software Eng...University of Washington Computer Science & Engineering CSE 403: Software Eng...
University of Washington Computer Science & Engineering CSE 403: Software Eng...Kun Le
 
Business Intelligence and Retail
Business Intelligence and RetailBusiness Intelligence and Retail
Business Intelligence and RetailKun Le
 
The “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedInThe “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedInKun Le
 
Marketo - Definitive guide to marketing metrics marketing analytics
Marketo - Definitive guide to marketing metrics marketing analyticsMarketo - Definitive guide to marketing metrics marketing analytics
Marketo - Definitive guide to marketing metrics marketing analyticsKun Le
 
Under the hood_of_mobile_marketing
Under the hood_of_mobile_marketingUnder the hood_of_mobile_marketing
Under the hood_of_mobile_marketingKun Le
 
IBM-Infoworld Big Data deep dive
IBM-Infoworld Big Data deep diveIBM-Infoworld Big Data deep dive
IBM-Infoworld Big Data deep diveKun Le
 
Smarter analytics for retailers Delivering insight to enable business success
Smarter analytics for retailers Delivering insight to enable business successSmarter analytics for retailers Delivering insight to enable business success
Smarter analytics for retailers Delivering insight to enable business successKun Le
 
Quoc Le, Stanford & Google - Tera Scale Deep Learning
Quoc Le, Stanford & Google - Tera Scale Deep LearningQuoc Le, Stanford & Google - Tera Scale Deep Learning
Quoc Le, Stanford & Google - Tera Scale Deep LearningKun Le
 
IBM - Using Predictive Analytics to Segment, Target and Optimize Marketing
IBM - Using Predictive Analytics to Segment, Target and Optimize MarketingIBM - Using Predictive Analytics to Segment, Target and Optimize Marketing
IBM - Using Predictive Analytics to Segment, Target and Optimize MarketingKun Le
 
IBM-Why Big Data?
IBM-Why Big Data?IBM-Why Big Data?
IBM-Why Big Data?Kun Le
 
Big data that drives marketing roi across all channels & campaigns
Big data that drives marketing roi across all channels & campaignsBig data that drives marketing roi across all channels & campaigns
Big data that drives marketing roi across all channels & campaignsKun Le
 

More from Kun Le (12)

Ibm big data and analytics for a holistic customer journey
Ibm big data and analytics for a holistic customer journeyIbm big data and analytics for a holistic customer journey
Ibm big data and analytics for a holistic customer journey
 
University of Washington Computer Science & Engineering CSE 403: Software Eng...
University of Washington Computer Science & Engineering CSE 403: Software Eng...University of Washington Computer Science & Engineering CSE 403: Software Eng...
University of Washington Computer Science & Engineering CSE 403: Software Eng...
 
Business Intelligence and Retail
Business Intelligence and RetailBusiness Intelligence and Retail
Business Intelligence and Retail
 
The “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedInThe “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedIn
 
Marketo - Definitive guide to marketing metrics marketing analytics
Marketo - Definitive guide to marketing metrics marketing analyticsMarketo - Definitive guide to marketing metrics marketing analytics
Marketo - Definitive guide to marketing metrics marketing analytics
 
Under the hood_of_mobile_marketing
Under the hood_of_mobile_marketingUnder the hood_of_mobile_marketing
Under the hood_of_mobile_marketing
 
IBM-Infoworld Big Data deep dive
IBM-Infoworld Big Data deep diveIBM-Infoworld Big Data deep dive
IBM-Infoworld Big Data deep dive
 
Smarter analytics for retailers Delivering insight to enable business success
Smarter analytics for retailers Delivering insight to enable business successSmarter analytics for retailers Delivering insight to enable business success
Smarter analytics for retailers Delivering insight to enable business success
 
Quoc Le, Stanford & Google - Tera Scale Deep Learning
Quoc Le, Stanford & Google - Tera Scale Deep LearningQuoc Le, Stanford & Google - Tera Scale Deep Learning
Quoc Le, Stanford & Google - Tera Scale Deep Learning
 
IBM - Using Predictive Analytics to Segment, Target and Optimize Marketing
IBM - Using Predictive Analytics to Segment, Target and Optimize MarketingIBM - Using Predictive Analytics to Segment, Target and Optimize Marketing
IBM - Using Predictive Analytics to Segment, Target and Optimize Marketing
 
IBM-Why Big Data?
IBM-Why Big Data?IBM-Why Big Data?
IBM-Why Big Data?
 
Big data that drives marketing roi across all channels & campaigns
Big data that drives marketing roi across all channels & campaignsBig data that drives marketing roi across all channels & campaigns
Big data that drives marketing roi across all channels & campaigns
 

Recently uploaded

Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 

Recently uploaded (20)

Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 

Best practices for building and deploying predictive models over big data presentation

  • 1. Best  Prac?ces  for  Building  and  Deploying   Predic?ve  Models  Over  Big  Data   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago     Collin  Benne=   Open  Data  Group   October  23,  2012  
  • 2. 1.  Introduc?on   7.      Building  Models  over  Hadoop   2.  Building  Predic?ve  Models   Using  R   –  EDA,  Features   8.      Case  Study:  Building  Trees   3.  Case  Study:  MalStone   over  Big  Data   9.        Analy?c  Opera?ons  –  SAMS   4.  Deploying  Predic?ve   Models   10.   Case  Study:  AdReady   5.  Working  with  Mul?ple   11.   Building  Predic?ve  Models  –   Models   Li  of  a  Model   6.  Case  Study:  CTA   12.   Case  Study:  Matsu   The  tutorial  is  divided  into  12  Modules.    You  can  download  all  the   modules  and  related  materials  from  tutorials.opendatagroup.com.  
  • 3. Best  Prac?ces  for  Building  and     Deploying  Predic?ve  Models  Over  Big  Data     Module  1:  Introduc?on     Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012      
  • 4. What  Is  Small  Data?   •  100  million  movie  ra?ngs   •  480  thousand  customers   •  17,000  movies   •  From  1998  to  2005   •  Less  than  2  GB  data.   •  Fits  into  memory,  but   very  sophis?cated   models  required  to  win.  
  • 5. What  is  Big  Data?   •  The  leaders  in  big  data   measure  data  in  Megawa=s.         –  As  in,  Facebook’s  leased  data   centers  are  typically  between  2.5   MW  and  6.0  MW.   –  Facebook’s  new  Pineville  data   center  is  30  MW.   •  At  the  other  end  of  the   spectrum,  databases  can   manage  the  data  required  for   many  projects.  
  • 6. An  algorithm  and   compu?ng   infrastructure  is  “big-­‐ data  scalable”  if  adding   a  rack  of  data  (and   corresponding   processors)  allows  you   to  compute  over  more   data  in  the  same  ?me.      
  • 7. What  is  Predic?ve  Analy?cs?   Short  Defini5on   •  Using  data  to  make  decisions.   Longer  Defini5on   •  Using  data  to  take  ac?ons  and  make  decisions   using  models  that  are  sta?s?cally  valid  and   empirically  derived.   7  
  • 8. Predic?ve   Analy?cs   Data  Mining   &  KDD   Computa?onally   Intensive   Sta?s?cs   2004   1993   1984   8  
  • 9. Key  Modeling  Ac?vi?es   §  Exploring  data  analysis  –  goal  is  to  understand  the   data  so  that  features  can  be  generated  and  model   evaluated   §  Genera?ng  features  –  goal  is  to  shape  events  into   meaningful  features  that  enable  meaningful   predic?on  of  the  dependent  variable   §  Building  models  –  goal  is  to  derive  sta?s?cally  valid   models  from  a  learning  set   §  Evalua?ng  models  –  goal  is  to  develop  meaningful   report  so  that  the  performance  of  one  model  can  be   compared  to  another  model    
  • 10. Analy?c     strategy  &   governance   Analy?c     Diamond   Analy?c  algorithms   Analy?c  opera?ons,   &  models   security  &  compliance   Analy?c  Infrastructure  
  • 11. IT  Organiza?on   Analy?c  Infrastructure   11  
  • 12. What  Is  Analy?c  Infrastructure?   1 D A T A M I N I N G G R O U P Analy&c  infrastructure  refers  to  the  soware   components,  soware  services,  applica?ons  and   plahorms  for  managing  data,  processing  data,   producing  analy?c  models,  and  using  analy?c   models  to  generate  alerts,  take  ac?ons  and  make   decisions.   12  
  • 13. Build  models   Analy?c  models  &   algorithms   Modeling  Group   13  
  • 14. Building  models:   Model   Data   Model   Producer   CART,  SVM,  k-­‐means,  etc.   PMML   14  
  • 15. Opera?ons,  product   Analy?c  Opera?ons   Deployed  models   15  
  • 16. Measures   Dashboards   Ac?ons   Models   Scores   SAM   Deployed  model  
  • 17. ETL   Deployment   Measures   Ac?ons   Validated  Model   Scores   Candidate  Model   Build  Features   Analy?c   Datamart   Cleaning  Data  &   Exploratory  Data   Data  Warehouse   Analysis   Opera?onal  data         Analy?c  Infrastructure   Analy?c  Modeling   Analy?c  Opera?ons  
  • 18. Life  Cycle  of  Predic?ve  Model   Exploratory  Data  Analysis  (R)   Process  the  data  (Hadoop)   Build  model  in  dev/ modeling  environment   PMML   Analy?c  modeling   Opera?ons   Deploy  model  in   Log   opera?onal  systems  with   files   scoring  applica?on   Refine  model   (Hadoop,  streams  &   Augustus)   Re?re  model  and  deploy   improved  model  
  • 19. Key  Abstrac?ons   Abstrac5on   Credit  Card  Fraud   On  line   Change  Detec5on   adver5sing   Events   Credit  card   Impressions  and   Status  updates   transac?ons   clicks   En??es   Credit  card   Cookies,  user  IDs   Computers,  devices,  etc.   accounts   Features   #  transac?ons  past   #  impressions  per   #  events  in  a  category   (examples)   n  minutes   category  past  n   previous  ?me  period   minutes,  RFM   related,  etc.   Models   CART   Clusters,  trees,   Baseline  models   recommenda?on,   etc.     Scores   Indicates  likelihood   Likelihood  of   Indicates  likelihood  that  a   of  fraudulent   clicking   change  has  taken  place   account  
  • 20. Event  Based  Upda?ng  of     Feature  Vectors     events FVs events Event  loop:   updates 1.  take  next  event   2.  update  one  or  more   associated  FVs   R n 3.  rescore  updated  FVs   to  compute  alerts   scores
  • 21. Example  Architecture   •  Data  lives  in  Hadoop  Distributed  File  System   (HDFS)  or  other  distributed  file  systems.   •  Hadoop  streams  and  MapReduce  are  used  to   process  the  data   •  R  is  used  for  Exploratory  Data  Analysis  (EDA)   and  models  are  exported  as  PMML.   •  PMML  models  are  imported  and  Hadoop,   Python  Streams  and  Augustus  are  used  to   score  data  in  produc?on  
  • 22. Ques?ons?   For  the  most  current  version  of  these  slides,   please  see  tutorials.opendatagroup.com  
  • 23. Best  Prac?ces  for  Building  and     Deploying  Predic?ve  Models  Over  Big  Data     Module  2:  Building  Models  –  Data  Explora?on     and  Features  Genera?on   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012    
  • 24. Key  Modeling  Ac?vi?es   §  Exploring  data  analysis  –  goal  is  to  understand  the   data  so  that  features  can  be  generated  and  model   evaluated   §  Genera?ng  features  –  goal  is  to  shape  events  into   meaningful  features  that  enable  meaningful   predic?on  of  the  dependent  variable   §  Building  models  –  goal  is  to  derive  sta?s?cally  valid   models  from  learning  set   §  Evalua?ng  models  –  goal  is  to  develop  meaningful   report  so  that  the  performance  of  one  model  can  be   compared  to  another  model    
  • 25. ?      Naïve  Bayes   ?      K  nearest  neighbor   1957     K-­‐means   1977    EM  Algorithm   Beware  of  any  vendor   1984    CART   whose  unique  value   1993    C4.5   proposi?on  includes   1994    A  Priori   some  “secret  analy?c   1995    SVM   sauce.”   1997    AdaBoost   1998    PageRank   Source:  X.  Wu,  et.  al.,  Top  10  algorithms  in  data  mining,  Knowledge  and  Informa?on  Systems,   Volume  14,  2008,  pages  1-­‐37.   25  
  • 26. Pessimis?c  View:  We  Get  a  Significantly   New  Algorithm  Every  Decade  or  So   1970’s      neural  networks   1980’s      classifica?ons  and  regression  trees   1990’s      support  vector  machine   2000’s      graph  algorithms  
  • 27. In  general,  understanding  the  data  through   exploratory  analysis  and  genera?ng  good   features  is  much  more  important  than  the   type  of  predic?ve  model  you  use.  
  • 28. Ques?ons  About  Datasets   •  Number  of  records   •  Number  of  data  fields   •  Size   –  Does  it  fit  into  memory?   –  Does  it  fit  on  a  single  disk  or  disk  array?   •  How  many  missing  values?   •  How  many  duplicates?   •  Are  there  labels  (for  the  dependent  variable)?   •  Are  there  keys?  
  • 29. Ques?ons  About  Data  Fields   •  Data  fields   –  What  is  the  mean,  variance,  …?   –  What  is  the  distribu?on?    Plot  the  histograms.   –  What  are  extreme  values,  missing  values,  unusual   values?   •  Pairs  of  data  fields   –  What  is  the  correla?on?    Plot  the  sca=er  plots.   •  Visualize  the  data   –  Histograms,  sca=er  plots,  etc.  
  • 31.  
  • 32. Compu?ng  Count  Distribu?ons   def  valueDistrib(b,  field,  select=None):      #  given  a  dataframe  b  and  an  a=ribute  field,      #  returns  the  class  distribu?on  of  the  field      ccdict  =  {}    if  select:        for  i  in  select:        count  =  ccdict.get(b[i][field],  0)          ccdict[b[i][field]]  =  count  +  1      else:          for  i  in  range(len(b)):          count  =  ccdict.get(b[i][field],  0)          ccdict[b[i][field]]  =  count  +  1      return  ccdict  
  • 33. Features   •  Normalize   –  [0,  1]  or  [-­‐1,  1]   •  Aggregate   •  Discre?ze  or  bin   •  Transform  con?nuous  to  con?nuous  or   discrete  to  discrete   •  Compute  ra?os   •  Use  percen?les   •  Use  indicator  variables  
  • 34. Predic?ng  Office  Building  Renewal     Using  Text-­‐based  Features   Classification Avg R Classification Avg R imports 5.17 technology 2.03 clothing 4.17 apparel 2.03 van 3.18 law 2.00 fashion 2.50 casualty 1.91 investment 2.42 bank 1.88 marketing 2.38 technologies 1.84 oil 2.11 trading 1.82 air 2.09 associates 1.81 system 2.06 staffing 1.77 foundation 2.04 securities 1.77
  • 35. Twi=er  Inten?on  &  Predic?on   •  Use  machine  learning   classifica?on   techniques:   –  Support  Vector   Machines   –  Trees   –  Naïve  Bayes   •  Models  require  clean   data  and  lots  of  hand-­‐ scored  examples    
  • 36. What  is  the  Size  of  Your  Data?   •  Small   –  Fits  into  memory   •  Medium   –  Too  large  for  memory   –  But  fits  into  a  database   •  Large   –  To  large  for  a  database   –  But  you  can  use  NoSQL  database  (HBase,  etc.)   –  Or  DFS  (Hadoop  DFS)   36  
  • 37. If  You  Can  Ask  For  Something   •  Ask  for  more  data   •  Ask  for  “orthogonal  data”  
  • 38. Ques?ons?   For  the  most  current  version  of  these  slides,   please  see  tutorials.opendatagroup.com  
  • 39. Best  Prac?ces  for  Building  and     Deploying  Predic?ve  Models  Over  Big  Data     Module  3:  Case  Study  -­‐  MalStone   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012    
  • 40. Part  1.  Sta?s?cs  over  Log  Files   •  Log  files  are  everywhere   •  Adver?sing  systems   •  Network  and  system  logs  for  intrusion   detec?on   •  Health  and  status  monitoring  
  • 41. What  are  the  Common  Elements?   •  Time  stamps   •  Sites   –  e.g.  Web  sites,  computers,  network  devices   •  En??es   –  e.g.  visitors,  users,  flows   •  Log  files  fill  disks,  many,  many  disks   •  Behavior  occurs  at  all  scales   •  Want  to  iden?fy  phenomena  at  all  scales   •  Need  to  group  “similar  behavior”   •  Need  to  do  sta?s?cs  (not  just  sor?ng)  
  • 42. Abstract  the  Problem  Using  Site-­‐En?ty  Logs   Example   Sites   En55es   Measuring  online   Web  sites   Consumers   adver?sing   Drive-­‐by  exploits   Web  sites   Computers   (iden?fied  by   cookies  or  IP)   Compromised   Compromised   User  accounts   systems   computers   42  
  • 43. MalStone  Schema   •  Event  ID   •  Time  stamp   •  Site  ID   •  En?ty  ID   •  Mark  (categorical  variable)   •  Fit  into  100  bytes  
  • 44. Toy  Example   map/shuffle   reduce   Events  collected   Map  events  by   For  each  site,   by  device  or   site   compute  counts   processor  in  ?me   and  ra?os  of   order   events  by  type   44  
  • 45. Distribu?ons   •  Tens  of  millions  of  sites   •  Hundreds  of  millions  of  en??es   •  Billions  of  events   •  Most  sites  have  a  few  number  of  events   •  Some  sites  have  many  events   •  Most  en??es  visit  a  few  sites   •  Some  visitors  visit  many  sites  
  • 46. MalStone  B   sites   en??es   dk-­‐2   dk-­‐1   dk   ?me   46  
  • 47. The  Mark  Model   •  Some  sites  are  marked  (percent  of  mark  is  a   parameter  and  type  of  sites  marked  is  a  draw   from  a  distribu?on)   •  Some  en??es  become  marked  aer  visi?ng  a   marked  site  (this  is  a  draw  from  a  distribu?on)   •  There  is  a  delay  between  the  visit  and  the  when   the  en?ty  becomes  marked  (this  is  a  draw  from  a   distribu?on)   •  There  is  a  background  process  that  marks  some   en??es  independent  of  visit  (this  adds  noise  to   problem)  
  • 48. Exposure  Window   Monitor  Window   dk-­‐2   dk-­‐1   dk   ?me   48  
  • 49. Nota?on   •  Fix  a  site  s[j]   •  Let  A[j]  be  en??es  that  transact  during   ExpWin  and  if  en?ty  is  marked,  then  visit   occurs  before  mark   •  Let  B[j]  be  all  en??es  in  A[j]  that  become   marked  some?me  during  the  MonWin   •  Subsequent  propor?on  of  marks  is                                          r[j]  =  |  B[j]  |    /    |  A[j]    |  
  • 50. ExpWin     MonWin  1   MonWin  2   B[j,  t]  are  en??es  that  become  marked  during  MonWin[j]   r[j,  t]  =  |  B[j,  t]  |    /    |  A[j]    |   dk-­‐2   dk-­‐1   dk   ?me   50  
  • 51. Part  2.    MalStone  Benchmarks   code.google.com/p/malgen/   MalGen  and  MalStone  implementa?ons  are  open  source  
  • 52. MalStone  Benchmark   •  Benchmark  developed  by  Open  Cloud   Consor?um  for  clouds  suppor?ng  data   intensive  compu?ng.   •  Code  to  generate  synthe?c  data  required  is   available  from  code.google.com/p/malgen   •  Stylized  analy?c  computa?on  that  is  easy  to   implement  in  MapReduce  and  its   generaliza?ons.   52  
  • 53. MalStone  A  &  B   Benchmark   Sta5s5c   #  records   Size   MalStone  A-­‐10   r[j]   10  billion     1  TB   MalStone  A-­‐100   r[j]   100  billion     10  TB   MalStone  A-­‐1000   r[j]   1  trillion   100  TB   MalStone  B-­‐10   r[j,  t]   10  billion     1  TB   MalStone  B-­‐100   r[j,  t]   100  billion     10  TB   MalStone  B-­‐1000   r[j,  t]   1  trillion   100  TB  
  • 54. •  MalStone  B  running  on  10  Billion  100  byte  records   •  Hadoop  version  0.18.3   •  20  nodes  in  the  Open  Cloud  Testbed   •  MapReduce  required  799  minutes   •  Hadoop  streams  required  142  minutes  
  • 55. Importance  of  Hadoop  Streams   MalStone  B   Hadoop    v0.18.3   799  min   Hadoop  Streaming  v0.18.3   142  min   Sector/Sphere   44  min   #  Nodes   20  nodes   #  Records   10  Billion   Size  of  Dataset     1  TB   55  
  • 56. Ques?ons?   For  the  most  current  version  of  these  slides,   please  see  tutorials.opendatagroup.com  
  • 57. Best  Prac?ces  for  Building  and     Deploying  Predic?ve  Models  Over  Big  Data     Module  4:  Deploying  Analy?cs   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012    
  • 58. Deploying  analy?c  models   Analy?c     Diamond   Analy?c  algorithms   Analy?c  opera?ons,   &  models   security  &  compliance   Analy?c  Infrastructure  
  • 59. Problems  with  Current  Techniques   •  Analy?cs  built  in  modeling  group  and  deployed   by  IT.   •  Time  required  to  deploy  models  and  to   integrate  models  with  other  applica?ons  can   be  long.   •  Models  are  deployed  in  proprietary  formats   •   Models  are  applica?on  dependent   •   Models  are  system  dependent   •   Models  are  architecture  dependent    
  • 60. Model  Independence   •  A  model  should  be  independent  of  the   environment:   o Produc?on  /  Staging  /  Development   o MapReduce  /  Elas?c  MapReduce  /  Standalone   o Developer’s  desktop  /  Analyst’s  laptop   o Produc?on  Workflow  /  Historical  Data  Store   •  A  model  should  be  independent  of  coding   language.  
  • 61. (Very  Simplified)  Architectural  View   Model   PMML   Data   Producer   Model   CART,  SVM,  k-­‐means,  etc.   •  The  Predic?ve  Model  Markup  Language   (PMML)  is  an  XML  language  for  sta?s?cal  and   data  mining  models  (www.dmg.org).   •  With  PMML,  it  is  easy  to  move  models   between  applica?ons  and  plahorms.   61  
  • 62. Predic?ve  Model  Markup     Language  (PMML)   •   Based  on  XML     •   Benefits  of  PMML   –  Open  standard  for  Data  Mining  &  Sta?s?cal    Models     –  Provides  independence  from  applica?on,  plahorm,   and  opera?ng  system   •  PMML  Producers   –  Some  applica?ons  create  predic?ve  models   •  PMML  Consumers   –  Other  applica?ons  read  or  consume  models  
  • 63. PMML   •  PMML  is  the  leading  standard  for  sta?s?cal   and  data  mining  models  and  supported  by   over  20  vendors  and  organiza?ons.   •  With  PMML,  it  is  easy  to  develop  a  model  on   one  system  using  one  applica?on  and  deploy   the  model  on  another  system  using  another   applica?on   •  Applica?ons  exist  in  R,  Python,  Java,  SAS,  SPSS   …  
  • 64. Vendor  Support   •  SPSS   •  R   •  SAS   •  Augustus   •  IBM   •  KNIME   •  Microstrategy   •  &  other  open  source   •  Salford   applica?ons   •  Tibco   •  Zemen?s   •  &  other  vendors    
  • 65. Philosophy   •   Very  important  to  understand  what  PMML   is  not  concerned  with  …   •   PMML  is  a  specifica?on  of  a  model,  not  an   implementa?on  of  a  model   •   PMML  allows  a  simple  means  of  binding   parameters  to  values  for  an  agreed  upon   set  of  data  mining  models  &   transforma?ons   •   Also,  PMML  includes  the  metadata   required  to  deploy  models  
  • 66. PMML  Document  Components   •  Data  dic?onary   •  Mining  schema   •  Transforma?on  Dic?onary   •  Mul?ple  models,  including  segments  and   ensembles.   •  Model  verifica?on,  …   •  Univariate  Sta?s?cs  (ModelStats)   •  Op?onal  Extensions    
  • 67. PMML  Models   •   trees   •   polynomial   •   associa?ons   regression   •   neural  nets   •   logis?c  regression   •   naïve  Bayes     •   general  regression   •   sequences   •   center  based   •   text  models   clusters   •   support  vector   •   density  based   machines   clusters   •   ruleset  
  • 68. PMML  Data  Flow   •  Data  Dic?onary  defines  data   •  Mining  Schema  defines  specific  inputs   (MiningFields)  required  for  model   •  Transforma?on  Dic?onary  defines  op?onal   addi?onal  derived  fields   •  A=ributes  of  feature  vectors  are  of  two   types     –  a=ributes  defined  by  the  mining  schema   –  derived  a=ributes  defined  via  transforma?ons   •  Models  themselves  can  also  support  certain   transforma?ons  
  • 69. (Simplified)  Architectural  View   algorithms  to   es?mate  models   Data  Pre-­‐ Model   Data   processing   features   Producer   •  PMML  also  supports  XML  elements   PMML   to  describe  data  preprocessing.   Model   69  
  • 70. Three  Important  Interfaces   Modeling  Environment   2   1   1   Data  Pre-­‐ Model   PMML   Data   processing   Producer   Model   2   Deployment  Environment     PMML   Model   1   3   3   Model   Post   Consumer   Processing   data   scores   ac?ons   70  
  • 71. Single  Model  PMML   Order  of  Evalua?on:   DD  àMS  à  TD  à  LT  à  Model   PMML  Document  (Global)   Model  ABC   DD   MS   Model  Execu?on   Data   Data   1-­‐to-­‐1   MF   Targets   Field   Target   Processing  of   TD   elements  for     TD   DF   PMML  supported   DF   models,  such  as   trees,  regression,   Outputs   LT   Naïve  Bayes,   Output   OF   DF   baselines,  etc.    Model      Verifica?on   MV  Fields   <  71  -­‐  PMML  Field  Diagram  (v8)   71  
  • 72. Augustus  PMML  Producer   Producer  Schema   (part  of  PMML)   Model   PMML   Producer   Model   Producer     Configura?on  File   various  data  formats,   including  csv  files,  h=p   streams,  &  databases     Augusuts  (augustus.googlecode.com)   72  
  • 73. Augustus  PMML  Consumer   PMML   Model   Post   Model   Processing   Consumer   scores   Consumer   ac?ons,  alerts,   various  data  formats,   Configura?on  File   reports,  etc.   including  csv  files,  h=p   streams,  &  databases     Augusuts  (augustus.googlecode.com)   73  
  • 74. Best  Prac?ces  for  Building  and     Deploying  Predic?ve  Models  Over  Big  Data     Module  5:  Mul?ple  Models   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012    
  • 75. 1.  Trees   For CART trees: L. Breiman, J. Friedman, R. A. Olshen, C. J. Stone, Classification and Regression Trees, 1984, Chapman & Hall.
  • 76. Classifica?on  Trees   Petal Len. Petal Width Sepal Len. Sepal Width Species 02      14      33      50          A   24      56      31      67          C   23      51      31      69          C   13      45      28      57          B   •  Want  a  func?on  Y  =  g(X),  which  predicts  the   red  variable  Y  using  one  or  more  of  the  blue   variables  X[1],  …,  X[4]   •   Assume  each  row  is  classified  A,  B,  or  C  
  • 77. Trees  Par??on  Feature  Space   Petal  Width   C   Petal  Length  <  2.45   1.75   A   B   C   Petal  Width  <  1.75?   Petal  Length   A   Petal  Len  <  4.95?   C   2.45   4.95   B   C   •  Trees  par??on  the  feature  space  into  regions  by  asking  whether  an   a=ribute  is  less  than  a  threshold.  
  • 78. 2.  Mul?ple  Models:  Ensembles   78  
  • 79. Key  Idea:  Combine  Weak  Learners   Model 2 Model 1 Model 3 •  Average  several  weak  models,    rather  than  build   one  complex  model.  
  • 80. Combining  Weak  Learners   1 Classifier 3 Classifiers 5 Classifiers 55% 57.40% 59.30% 60% 64.0% 68.20% 65% 71.00% 76.50%                                                    1                                                          1                      1                                          1                        2                        1                              1                      3                        3                          1                1                      4                        6                        4                        1        1                  5                    10                      10                      5                  1  
  • 81. Ensembles   •  Ensembles  are  collec?ons  of  classifiers  with  a   rule  for  combining  the  results.   •  The  simplest  combining  rule:     – Use  average  for  regression  trees.       – Use  majority  vo?ng  for  classifica?on.   •  Ensembles  are  unreasonably  effec?ve  at   building  models.    
  • 82. Building ensembles over clusters of commodity workstations has been used since the 1990’s.
  • 83. Ensemble Models 2. build model 3. gather 1.  Partition data and scatter models 2.  Build models (e.g. tree-based model) 3.  Gather models 4. form 4.  Form collection of 1. partition data ensemble models into ensemble (e.g. majority vote for classification & model averaging for regression)
  • 84. Example:  Fraud   •  Ensembles  of  trees  have  proved  remarkably   effec?ve  in  detec?ng  fraud   •  Trees  can  be  very  large   •  Very  fast  to  score   •  Lots  of  variants   –  Random  forests   –  Random  trees   •  Some?mes  need  to  reverse  engineer  reason   codes.  
  • 85. 3.  CUSUM   Baseline   Observed   Distribu?on   f0(x)     f1(x)       f0(x)  =  baseline  density     §  Assume  that  we  have  a  mean  and   f1(x)  =  observed  density   standard  devia?on  for  both   g(x)  =  log(f1(x)/f0(x))   distribu?ons.     Z0  =  0   Zn+1  =  max{Zn+g(x),  0}   Alert  if  Zn+1>threshold   85    
  • 86. 4.  Cubes  of  Models  
  • 87. Example  1:  Cubes  of  Models  for  Change   Detec?on  for  Payments  System   Build  separate  segmented  model   Geospatial for  every  en?ty  of  interest   region Divide  &  conquer  data  (segment)   using  mul?dimensional  data   cubes   For  each  dis?nct  cube,  establish   Type of separate  baselines  for  each   Transaction quan&fy  of  interest   Entity (bank, etc.) Detect  changes  from  baselines     Estimate separate baselines for each quantify of interest 87  
  • 88. Example  2  –  Cubes  of  Models     for  Detec?ng  Anomalies  in  Traffic   •  Highway  Traffic  Data   –  each  day  (7)  x  each  hour  (24)  x  each  sensor   (hundreds)  x  each  weather  condi?on  (5)  x  each   special  event  (dozens)   –  50,000  baselines  models  used  in  current   testbed   88  
  • 89. 5.  Mul?ple  Models  in  Hadoop   •  Trees  in  Mapper   •  Build  lots  of  small  trees  (over  historical  data  in   Hadoop)   •  Load  all  of  them  into  the  mapper   •  Mapper   –  For  each  event,  score  against  each  tree   –  Map  emits   •  Key  t  value  =  event  id  t  tree  id,  score   •  Reducer   –  Aggregate  scores  for  each  event   –  Select  a  “winner”  for  each  event  
  • 90. Ques?ons?   For  the  most  current  version  of  these  slides,   please  see  tutorials.opendatagroup.com  
  • 91. Best  Prac?ces  for  Building  and     Deploying  Predic?ve  Models  Over  Big  Data     Module  6:  Case  Study  -­‐  CTA   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012    
  • 92. Chicago  Transit  Authority   •  The  Chicago  Transit  Authority  provides   service  to  Chicago  and  surrounding   suburbs   •  CTA  operates  24  hours  each  day  and  on   an  average  weekday  provides  1.7   million  rides  on  buses  and  trains   •  It  has  1,800  buses,  opera?ng  over  140   routes,  along  2,230  route  miles   •  Buses  provide  approximately  1  million   rides  day  and  serve  more  than  12,000   posted  bus  stops  
  • 93. CTA  Data   •  The  City  of  Chicago  has  a  public  data  portal   •  h=ps://data.cityofchicago.org/   •  Some  of  available  categories  are:   •  Administra?on  &  Finance   •  Community  &  Economic  Development   •  Educa?on   •  Facili?es  &  Geographic  Boundaries   •  Transporta?on  
  • 94. Moving  to  Hadoop   •  For  an  example,  we  model  ridership  by  bus  route            Data  set:  CTA  -­‐  Ridership  -­‐  Bus  Routes  -­‐  Daily  Totals                by  Route   •  MapReduce   –  Map  Input:   •  Route  number,  Day  type,  Number  of  riders   –  Map  Output  /  Reduce  Input:   •  Key  t  value  =  Route  t    day,  date,  riders   –  Reduce  Output:   •  Model  for  each  route  
  • 95. Example  Data  Set   •  Data  goes  back  to  2001   –  Route  Number   –  Date   –  Day  of  Week  Type   –  Number  of  riders   •  Demo  size:   –  11  MB   –  534,208  records     •  Saved  as  csv  file  in  HDFS  
  • 99. Mapper   •  Using  Hadoop’s  Streaming  Interface,  we  can  use  a   language  appropriate  for  each  step   •  The  mapper  needs  to  read  the  records  quickly   •  We  use  Python  
  • 100. Reducer   •  For  this  example,  we  calculate  a  simple   sta?s?c:   –  Gaussian  Distribu?on  of  riders  for  route,  day  of   week  pairs  described  by  the  mean  and  variance   •  We  use  the  Baseline  Model  in  PMML  to  describe  this.   •  The  Baseline  model  is  oen  useful  for  Change   Detec?on.   •  Do  the  records  matching  a  segment  fit  the  expected   distribu?on  describing  the  segment  
  • 101. Describing  a  Segment   •  Baseline  Model     <BaselineModel functionName="regression”> <MiningSchema> <MiningField usageType="active" name="rides" /> </MiningSchema> <TestDistributions field="rides" testStatistic="zValue”> <Baseline> <GaussianDistribution variance="64308.1625973" mean="1083.88950936”> </GaussianDistribution> </Baseline> </BaselineModel>
  • 102. Building  PMML  in  the  Reducer   •  You  can  embed  a  PMML  engine  into  the   reducer  if  you  are  using   –  Java   –  R   –  Python   •  We  present  a  light-­‐weight  solu?on  which   builds  valid  models,  but  also  allows  you  to   construct  invalid  ones  
  • 103. Using  a  Valid  Model  Template  
  • 104. Template  in  R   •  Similar  but  a  li=le  less  elegant  in  R  
  • 105. Popula?ng  the  Template  (Python)   •  This  is  done  in  an  XML  /  XPATH  way,  rather  than  as   PMML  
  • 107. Reducer  Trade  Offs   •  Standard  trade  offs  apply   –  Holding  events  in  the  reducer  (easier,  RAM)  vs.   –  Compu?ng  running  sta?s?cs  (more  coding,  CPU)   •  We  present  two  techniques  using  R   –  Holding  all  events  for  a  given  key  in  a  dataframe   and  then  calcula?ng  the  sta?s?c  when  a  key   transi?on  is  iden?fied   –  Holding  the  minimum  data  to  calculate  running   sta?s?cs.    Computed  as  each  event  is  seen  
  • 108. Two  Approaches   •  Holding  events  requires  extra  memory  and   may  get  slow  as  the  number  of  events  shuffled   to  the  reducer  for  a  given  keys  grows  large   –  But,  it  there  are  a  lot  of  sta?s?cal  and  data  mining   methods  available  in  R  that  operate  on  a   dataframe   •  Tracking  running  sta?s?cs  requires  more   coding   –  But  may  be  the  only  way  to  process  a  batch   without  spilling  to  disk  
  • 109. Dataframe   •  Accumulate  each  mapped  record   •  For  each  event,  add  it  to  a  temporary  table   doAny <- function(v, date, rides) { # add a column to the temporary table v[["rows"]][[length(v[["rows"]]) + 1]] <- list(date, rides) } •  Create  the  dataframe  when  we  have  them  all
  • 110. Dataframe  con?nued   •  When  you  have  seen  all  of  the,  process  and  and  compute   doLast <- function(v) { # make a data frame out of the temporary table dataframe <- do.call("rbind", v[["rows"]]) colnames(dataframe) <- c("date", "rides") # pull a column out and call the function on it rides <- unlist(dataframe[,"rides"]) result <- ewup2(rides, alpha) count <- result[["count"]][[length(rides)]] mean <- result[["mean"]][[length(rides)]] variance <- result[["variance"]][[length(rides)]] varn <- result[["varn"]][[length(rides)]] }
  • 111. Running  Totals   •  Update  sta?s?cs  as  each  event  is  encountered   doNonFirst <- function(v, x) { # update the counters v[["count"]] <- v[["count"]] + 1 diff <- v[["prevx"]] - v[["mean"]] incr <- alpha * diff v[["mean"]] <- v[["mean"]] + incr v[["varn"]] <- (1.0 - alpha)* (v[["varn"]] + incr*diff) v[["prevx"]] <- x }
  • 112. Running  Totals  con?nued   •  Update  sta?s?cs  as  each  event  is  encountered   doLast <- function(v) { if (v[["count"]] >= 2) { # when N >= 2, correct for 1/(N - 1) variance <- v[["varn"]] / (v[["count"]] - 1.0) } else { variance <- v[["varn"]] } }
  • 113. Results   •  As  new  data  are  seen,   models  can  be  updated   Model   Model   Model   •  New  routes  added  as   new  segments   Model   •  Model  is  not  bound  to   Model   Hadoop   Model   Model  
  • 114. EDA   •  Write  plots  to  HDFS  as  SVG  files  (text,  xml)   •  Graphical  view  of  the  reduce  calcula?on  or  sta?s?c   <?xml version="1.0" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/ svg11.dtd"> <svg style="stroke-linejoin:miter; stroke:black; stroke-width:2.5; text-anchor:middle; fill:none" xmlns="http://www.w3.org/2000/svg" font-family="Helvetica, Arial, FreeSans, Sans, sans, sans-serif" width="2000px" height="500px" version="1.1" xmlns:xlink="http://www.w3.org/ 1999/xlink" viewBox="0 0 2000 500"> <defs> <clipPath id="Legend_0_clip"> <rect x="1601.2" y="25" width="298.8" height="215" /> </clipPath> <clipPath id="TimeSeries_0_clip"> <rect x="240" y="25" width="1660" height="435" /> </clipPath> <clipPath id="TimeSeries_10_clip"> <rect x="240" y="25" width="1660" height="435" /> </clipPath> <clipPath id="TimeSeries_11_clip"> <rect x="240" y="25" width="1660" height="435" /> …  
  • 115. EDA  
  • 116. Ques?ons?   For  the  most  current  version  of  these  slides,   please  see  tutorials.opendatagroup.com  
  • 117. About  Robert  Grossman   Robert  Grossman  is  the  Founder  and  a  Partner  of  Open  Data  Group,  which   specializes  in  building  predic?ve  models  over  big  data.  He  is  the  Director  of   Informa?cs  at  the  Ins?tute  for  Genomics  and  Systems  Biology  (IGSB)  and  a   faculty  member  in  the  Computa?on  Ins?tute  (CI)  at  the  University  of   Chicago.  His  areas  of  research  include:  big  data,  predic?ve  analy?cs,   bioinforma?cs,  data  intensive  compu?ng  and  analy?c  infrastructure.  He  has   led  the  development  of  new  open  source  soware  tools  for  analyzing  big   data,  cloud  compu?ng,  data  mining,  distributed  compu?ng  and  high   performance  networking.  Prior  to  star?ng  Open  Data  Group,  he  founded   Magnify,  Inc.  in  1996,  which  provides  data  mining  solu?ons  to  the  insurance   industry.  Grossman  was  Magnify’s  CEO  un?l  2001  and  its  Chairman  un?l  it   was  sold  to  ChoicePoint  in  2005.  He  blogs  about  big  data,  data  science,  and   data  engineering  at  rgrossman.com.  
  • 118. About  Robert  Grossman  (cont’d)   •  You  can  find  more  informa?on  about  big  data  and  related   areas  on  my  blog:        rgrossman.com.   •  Some  of  my  technical  papers  are  available  online  at   rgrossman.com       •  I  just  finished  a  book  for  the  non-­‐technical  reader  called  The   Structure  of  Digital  Compu&ng:  From  Mainframes  to  Big  Data,   which  is  available  from  Amazon.    
  • 119. About  Collin  Benne=   Collin  Benne=  is  a  principal  at  Open  Data  Group.  In  three  and  a  half  years   with  the  company,  Collin  has  worked  on  the  open  source  Augustus  scoring   engine  and  a  cloud-­‐based  environment  for  rapid  analy?c  prototyping  called   RAP.  Addi?onally,  he  has  released  open  source  projects  for  the  Open  Cloud   Consor?um.  One  of  these,  MalGen,  has  been  used  to  benchmark  several   parallel  computa?on  frameworks.  Previously,  he  led  soware  development   for  the  Product  Development  Team  at  Acquity  Group,  an  IT  consul?ng  firm   head-­‐quartered  in  Chicago.  He  also  worked  at  startups  Orbitz  (when  it  was   s?ll  was  one)  and  Business  Logic  Corpora?on.  He  has  co-­‐authored  papers   on  Weyl  tensors,  large  data  clouds,  and  high  performance  wide  area  cloud   testbeds.  He  holds  degrees  in  English,  Mathema?cs  and  Computer  Science.      
  • 120. About  Open  Data     •  Open  Data  began  opera?ons  in  2001  and  has   built  predic?ve  models  for  companies  for  over   ten  years   •  Open  Data  provides  management  consul?ng,   outsourced  analy?c  services,  &  analy?c  staffing   •  For  more  informa?on   •  www.opendatagroup.com   •  info@opendatagroup.com    
  • 121. About  Open  Data  Group  (cont’d)   Open  Data  Group  has  been  building  predic?ve  models  over  big  data  for  its   clients  over  ten  years.    Open  Data  Group  is  one  of  the  pioneers  using   technologies  such  as  Hadoop  and  NoSQL  databases  so  that  companies  can   build  predic?ve  models  efficiently  over  all  of  their  data.    Open  Data  has  also   par?cipated  in  the  development  of  the  Predic?ve  Model  Markup  Language   (PMML)  so  that  companies  can  more  quickly  deploy  analy?c  models  into  their   opera?onal  systems.   For  the  past  several  years,  Open  Data  has  also  helped  companies  build   predic?ve  models  and  score  data  in  real  ?me  using  elas?c  clouds,  such  as   provided  with  Amazon’s  AWS.   Open  Data’s  consul?ng  philosophy  is  nicely  summed  up  by  Marvin  Bower,  one   of  McKinsey  &  Company’s  founders.  Bower  led  his  firm  using  three  rules:  (1)   put  the  interests  of  the  client  ahead  of  revenues;  (2)  tell  the  truth  and  don’t  be   afraid  to  challenge  a  client’s  opinion;  and  (3)  only  agree  to  perform  work  that   is  necessary  and  something  [you]  can  do  well.       www.opendatagroup.com