SlideShare a Scribd company logo
1 of 43
Download to read offline
How	
  to	
  Make	
  Analy.c	
  Opera.ons	
  Look	
  More	
  Like	
  
DevOps:	
  Lessons	
  learned	
  Moving	
  Machine-­‐
Learning	
  Algorithms	
  to	
  Produc.on	
  Environments	
  
Robert	
  L.	
  Grossman	
  
University	
  of	
  Chicago	
  
and	
  
Open	
  Data	
  Group	
  
O’Reilly	
  Strata	
  Conference	
  
March	
  30,	
  2016	
  
rgrossman.com	
  
@bobgrossman	
  
Introduc.on	
  to	
  Analy.cOps	
  	
  
SoRware	
  
Development	
  
Quality	
  
Assurance	
  
Opera.ons	
  
DevOps	
  
The	
  goal	
  of	
  DevOps	
  is	
  to	
  establish	
  a	
  culture	
  and	
  an	
  environment	
  
where	
  building,	
  tes.ng,	
  releasing,	
  and	
  opera.ng	
  soRware	
  can	
  
happen	
  rapidly,	
  frequently,	
  and	
  more	
  reliably.*	
  
*Adapted	
  from	
  Wikipedia,	
  en.wikipedia.org/wiki/DevOps.	
  
Analy.c	
  
Modeling	
  
Quality	
  
Assurance	
  
Analy.c	
  
Opera.ons	
  
Analy.cOps	
  
The	
  goal	
  of	
  Analy.cOps	
  is	
  to	
  establish	
  a	
  culture	
  and	
  an	
  
environment	
  where	
  building,	
  valida.ng,	
  deploying,	
  and	
  running	
  
analy.c	
  models	
  happen	
  rapidly,	
  frequently,	
  and	
  reliably.	
  
Analy.c	
  
Modeling	
  
Quality	
  
Assurance	
  
Analy.c	
  
Opera.ons	
  
Analy.cOps	
  
The	
  goal	
  of	
  Analy.cOps	
  is	
  to	
  establish	
  a	
  culture	
  and	
  an	
  
environment	
  where	
  building,	
  valida.ng,	
  deploying,	
  and	
  running	
  
analy.c	
  models	
  happen	
  rapidly,	
  frequently,	
  and	
  reliably.	
  
•  SoRware	
  
•  Model	
  
•  Data	
  
Analy.c	
  strategy	
  
and	
  planning	
  
Analy.c	
  models	
  &	
  
algorithms	
   Analy.c	
  opera.ons	
  
Analy.c	
  Infrastructure	
  
*Source:	
  Robert	
  L.	
  Grossman,	
  The	
  Strategy	
  and	
  Prac.ce	
  of	
  Analy.cs,	
  O’Reilly,	
  2016,	
  to	
  appear.	
  
A	
  Problem	
  
There	
  are	
  plaZorms	
  and	
  tools	
  for	
  managing	
  and	
  processing	
  big	
  data	
  
(Hadoop),	
  for	
  building	
  analy.cs	
  (SAS,	
  SPSS,	
  R,	
  Sta.s.ca,	
  Spark,	
  
Skytree,	
  Mahout),	
  but	
  few	
  op.ons	
  for	
  deploying	
  analy.cs	
  into	
  
opera.ons	
  or	
  for	
  embedding	
  analy.cs	
  into	
  products	
  and	
  services.	
  
Data	
  scien.sts	
  
developing	
  analy.c	
  
models	
  &	
  algorithms	
  
Analy.c	
  infrastructure	
  
Enterprise	
  IT	
  
deploying	
  analy.cs	
  
into	
  products,	
  services	
  
and	
  opera.ons	
  
Deploying	
  analy.cs	
  
7	
  
More	
  Problems	
  
Data	
  scien.sts	
  
developing	
  analy.c	
  
models	
  &	
  algorithms	
  
Analy.c	
  infrastructure	
  
Enterprise	
  IT	
  
deploying	
  analy.cs	
  
into	
  products,	
  services	
  
and	
  opera.ons	
  
Deploying	
  analy.cs	
  
8	
  
Monitoring	
  
opera.onal	
  analy.cs	
  
ETL	
  and	
  datamarts	
  for	
  
the	
  modelers	
  
Case	
  Study	
  1:	
  Scoring	
  Engines	
  for	
  Cri.cal	
  
Systems	
  
Life	
  Cycle	
  of	
  Predic.ve	
  Model	
  
Exploratory	
  Data	
  Analysis	
  
Get	
  and	
  	
  
clean	
  the	
  data	
  
Build	
  model	
  in	
  dev/
modeling	
  environment	
  
Deploy	
  model	
  in	
  
opera.onal	
  systems	
  with	
  
scoring	
  applica.on	
  	
  
Monitor	
  performance	
  and	
  
employ	
  champion-­‐
challenger	
  methodology	
  to	
  
develop	
  improved	
  model	
  
Analy.c	
  modeling	
  
Analy.c	
  opera.ons	
  
Deploy	
  
model	
  
Perf.	
  
data	
  
Re.re	
  model	
  and	
  deploy	
  
improved	
  model	
  
Select	
  analy.c	
  
problem	
  &	
  
approach	
  
Scale	
  up	
  	
  
deployment	
  
Exploratory	
  Data	
  Analysis	
  
Get	
  and	
  	
  
clean	
  the	
  data	
  
Build	
  model	
  in	
  dev/
modeling	
  environment	
  
Deploy	
  model	
  in	
  
opera.onal	
  systems	
  with	
  
scoring	
  applica.on	
  	
  
Monitor	
  performance	
  and	
  
employ	
  champion-­‐
challenger	
  methodology	
  to	
  
develop	
  improved	
  model	
  
Analy.c	
  modeling	
  
Analy.c	
  opera.ons	
  
Deploy	
  
model	
  
Re.re	
  model	
  and	
  deploy	
  
improved	
  model	
  
Select	
  analy.c	
  
problem	
  &	
  
approach	
  
Scale	
  up	
  	
  
deployment	
  
ModelDev
AnalyticOps
Perf.	
  
data	
  
Differences	
  Between	
  the	
  Modeling	
  and	
  
Deployment	
  Environments	
  
•  Typically	
  modelers	
  use	
  specialized	
  languages	
  such	
  as	
  
SAS,	
  SPSS	
  or	
  R.	
  
•  Usually,	
  developers	
  responsible	
  for	
  products	
  and	
  
services	
  use	
  languages	
  such	
  as	
  Java,	
  JavaScript,	
  
Python,	
  C++,	
  etc.	
  
•  This	
  can	
  result	
  in	
  significant	
  effort	
  moving	
  the	
  model	
  
from	
  the	
  modeling	
  environment	
  to	
  the	
  deployment	
  
environment.	
  
Ways	
  to	
  Deploy	
  Models	
  into	
  	
  
Products/Services/Opera.ons	
  
•  Export	
  and	
  import	
  tables	
  of	
  scores	
  
•  Export	
  and	
  import	
  tables	
  of	
  parameters	
  
•  Have	
  the	
  product/service	
  interact	
  with	
  the	
  
model	
  as	
  a	
  web	
  or	
  message	
  service.	
  
•  Import	
  the	
  models	
  into	
  a	
  database	
  
•  Embed	
  the	
  model	
  into	
  a	
  product	
  or	
  service.	
  
•  Push	
  code.	
  
How	
  quickly	
  can	
  the	
  model	
  be	
  updated?	
  
•  Model	
  parameters?	
  
•  New	
  features?	
  	
  	
  	
  
•  New	
  pre-­‐	
  &	
  post-­‐	
  processing?	
  
What	
  is	
  a	
  Scoring	
  Engine?	
  
•  A	
  scoring	
  engine	
  is	
  a	
  component	
  that	
  is	
  integrated	
  into	
  
products	
  or	
  enterprise	
  IT	
  that	
  deploys	
  analy.c	
  models	
  in	
  
opera.onal	
  workflows	
  for	
  products	
  and	
  services.	
  
•  A	
  Model	
  Interchange	
  Format	
  is	
  a	
  format	
  that	
  supports	
  
the	
  expor.ng	
  of	
  a	
  model	
  by	
  one	
  applica.on	
  and	
  the	
  
impor.ng	
  of	
  a	
  model	
  by	
  another	
  applica.on.	
  	
  	
  
•  Model	
  Interchange	
  Formats	
  include	
  the	
  Predic.ve	
  Model	
  
Markup	
  Language	
  (PMML),	
  the	
  Portable	
  Format	
  for	
  
Analy.cs	
  (PFA),	
  and	
  various	
  in-­‐house	
  or	
  custom	
  formats.	
  
•  Scoring	
  engines	
  are	
  integrated	
  once,	
  but	
  allow	
  
applica.ons	
  to	
  update	
  models	
  as	
  quickly	
  as	
  reading	
  a	
  a	
  
model	
  interchange	
  format	
  file.	
  
14	
  
Analy.c	
  algorithms	
  
&	
  models	
  
Analy.c	
  opera.ons	
  
Deploying	
  analy.c	
  models	
  
Model	
  
Consumer	
  
Model	
  
Producer	
  
Analy.c	
  Infrastructure	
  
Export	
  
model	
  
Import	
  
model	
  
PMML	
  &	
  PFA	
  
Case	
  Study	
  2:	
  	
  Scaling	
  Bioinforma.cs	
  
Pipelines	
  for	
  the	
  Genomic	
  Data	
  Commons*	
  
This	
  case	
  study	
  describes	
  work	
  by	
  the	
  NCI	
  Genomic	
  Data	
  Commons	
  Project	
  and	
  the	
  
University	
  of	
  Chicago	
  Center	
  for	
  Data	
  Intensive	
  Science.	
  
TCGA	
  dataset:	
  1.54	
  PB	
  
consis.ng	
  of	
  577,878	
  
files	
  about	
  14,052	
  cases	
  
(pa.ents),	
  in	
  42	
  cancer	
  
types,	
  across	
  29	
  primary	
  
sites.	
  	
  
	
  
2.5+	
  PB	
  	
  
of	
  cancer	
  
genomics	
  data	
  
+	
  
Bionimbus	
  data	
  commons	
  
technology	
  running	
  mul.ple	
  
community	
  developed	
  variant	
  
calling	
  pipelines.	
  	
  Over	
  12,000	
  
cores	
  and	
  10	
  PB	
  of	
  raw	
  storage	
  in	
  
18+	
  racks	
  running	
  for	
  months.	
  
Analy.cOps	
  for	
  the	
  Genomic	
  Data	
  Commons	
  
Dev Ops
•  Virtualiza.on	
  and	
  the	
  requirement	
  for	
  massive	
  scale	
  out	
  
spawned	
  infrastructure	
  automa.on	
  (“infrastructure	
  as	
  
code”).	
  
•  Requirement	
  for	
  reducing	
  the	
  .me	
  to	
  deploying	
  code	
  
created	
  tools	
  for	
  con.nuous	
  integra.on	
  and	
  tes.ng.	
  
ModelDev AnalyticOps
•  Use	
  virtualiza.on	
  /	
  containers,	
  infrastructure	
  
automa.on	
  and	
  scale	
  out	
  to	
  support	
  large	
  scale	
  
analy.cs.	
  
•  Requirement:	
  reduce	
  the	
  .me	
  and	
  cost	
  to	
  do	
  high	
  
quality	
  analy.cs	
  	
  over	
  large	
  amounts	
  of	
  data.	
  
Genomic	
  Data	
  Commons	
  (GDC)	
  Files	
  Vary	
  Over	
  9	
  
Orders	
  of	
  Magnitude	
  in	
  Size	
  
GDC	
  Pipelines	
  Are	
  Complex	
  	
  
and	
  are	
  Mostly	
  Wriqen	
  by	
  Others	
  
Computa.ons	
  for	
  a	
  Single	
  	
  
Genome	
  Can	
  Take	
  Over	
  a	
  Week	
  
Source:	
  University	
  of	
  Chicago	
  Center	
  for	
  Data	
  Intensive	
  Science	
  Bioinforma.cs	
  Group.	
  
System	
  Loads	
  Vary	
  Significantly	
  
•  Model	
  quality	
  
(confusion	
  matrix)	
  
•  Data	
  quality	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
(six	
  dimensions)	
  
•  Lack	
  of	
  ground	
  truth	
  
•  SoRware	
  errors	
  
•  Workflow	
  with	
  
monitoring	
  
•  Scheduling	
  
•  Boqlenecks,	
  stragglers,	
  hot	
  spots,	
  etc.	
  
•  Analy.c	
  configura.ons	
  problems*	
  
•  System	
  failures	
  	
  
•  Human	
  errors	
  
Ten	
  Factors	
  Effec.ng	
  Analy.cOps	
  
*DMS	
  =	
  data-­‐model-­‐system	
  
Monitor	
  Data	
  Quality	
  and	
  Model	
  Performance	
  
and	
  Summarize	
  With	
  Dashboards	
  
Source:	
  University	
  of	
  Chicago	
  Center	
  for	
  Data	
  Intensive	
  Science	
  Bioinforma.cs	
  Group.	
  
Analy.cOps	
  Dashboard	
  
Source:	
  University	
  of	
  Chicago	
  Center	
  for	
  Data	
  Intensive	
  Science	
  Bioinforma.cs	
  Group.	
  
Data	
  Quality:	
  Batch	
  Effects	
  Can	
  Be	
  Significant	
  
Source:	
  University	
  of	
  Chicago	
  Center	
  for	
  Data	
  Intensive	
  Science	
  Bioinforma.cs	
  Group.	
  
Model	
  Quality:	
  Differences	
  in	
  Three	
  
Soma.c	
  Muta.on	
  Detec.on	
  Algorithms	
  
Source:	
  University	
  of	
  Chicago	
  Center	
  for	
  Data	
  Intensive	
  Science	
  Bioinforma.cs	
  Group.	
  
ORen	
  SoRware	
  Must	
  Be	
  Wriqen	
  so	
  that	
  It	
  Can	
  
Be	
  Run	
  Efficiently	
  in	
  Automated	
  Enivronments	
  
•  Generally,	
  community	
  soRware	
  in	
  bioinforma.cs	
  is	
  
designed	
  to	
  be	
  run	
  manually	
  over	
  local	
  clusters.	
  
•  Example	
  
– We	
  patched	
  one	
  piece	
  of	
  soRware	
  over	
  400	
  .mes	
  
so	
  that	
  it	
  could	
  run	
  over	
  12,000	
  genomes	
  	
  
– Although	
  only	
  3.3%	
  of	
  genomes	
  had	
  problems,	
  it	
  
required	
  significant	
  manual	
  effort.	
  
•  Analy.cOps	
  requires	
  opera.ng	
  the	
  soRware	
  in	
  
automated	
  environments.	
  
Decide	
  What	
  Not	
  to	
  Compute	
  
VarScan Rate
Rate (GB/hour)
Frequency
0.0 0.5 1.0 1.5 2.0
020040060080010001200
Manage	
  these	
  
cases	
  carefully.	
  
Model	
  Expected	
  Performance	
  
Processing	
  .me	
  
Tumor	
  BAM	
  size	
  (GB)	
  
Source:	
  University	
  of	
  Chicago	
  Center	
  for	
  Data	
  Intensive	
  Science	
  Bioinforma.cs	
  Group.	
  
Case	
  Study	
  3:	
  Deploying	
  Gaussian	
  Process	
  
Models	
  to	
  the	
  Industrial	
  Internet*	
  
*Thanks	
  to	
  the	
  DMG	
  PMML	
  and	
  PFA	
  Working	
  Groups.	
  	
  
Portable	
  Format	
  for	
  Analy.cs	
  (PFA)	
  Standard	
  
www.dmg.org	
  
PFA	
  is	
  Based	
  Upon	
  Defining	
  Primi.ves	
  for	
  
Analy.c	
  Models	
  
•  What	
  would	
  a	
  standard	
  look	
  like	
  that…	
  
– Defines	
  primi.ves	
  for	
  data	
  transforma.ons,	
  data	
  
aggrega.ons,	
  and	
  sta.s.cal	
  and	
  analy.c	
  models.	
  
– Supports	
  composi.on	
  of	
  data	
  mining	
  primi.ves	
  
(which	
  makes	
  it	
  easy	
  to	
  specify	
  machine	
  learning	
  
algorithms	
  and	
  pre-­‐/post-­‐	
  processing	
  of	
  data).	
  
– Is	
  extensible.	
  
– Is	
  “safe”	
  to	
  deploy	
  in	
  enterprise	
  IT	
  opera.onal	
  
environments.	
  
•  This	
  is	
  a	
  different	
  philosophy	
  that	
  is	
  different	
  and	
  
complementary	
  to	
  Predic.ve	
  Model	
  Markup	
  
Language	
  (PMML).	
  
34	
  
Benefits	
  of	
  PFA	
  
•  PFA	
  is	
  based	
  upon	
  JSON	
  and	
  Avro	
  and	
  integrates	
  
easily	
  into	
  modern	
  big	
  data	
  environments.	
  
•  PFA	
  allows	
  models	
  to	
  be	
  easily	
  chained	
  and	
  
composed	
  
•  PFA	
  allows	
  developers	
  and	
  users	
  users	
  of	
  analy.c	
  
systems	
  to	
  pre-­‐process	
  inputs	
  and	
  to	
  post-­‐process	
  
outputs	
  to	
  models	
  
•  PFA	
  is	
  easily	
  integrated	
  with	
  Storm,	
  Akka	
  and	
  other	
  
streaming	
  environments	
  
•  PFA	
  can	
  be	
  used	
  to	
  integrate	
  mul.ple	
  	
  tools	
  
applica.ons	
  within	
  an	
  analy.c	
  ecosystem.	
  
Gaussian	
  Process	
  Model	
  
Example	
  of	
  a	
  PFA	
  model	
  
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
input	
  and	
  output	
  of	
  scoring	
  engine	
  
expressed	
  as	
  Avro	
  schemas	
  
Example	
  of	
  a	
  PFA	
  model	
  
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
type	
  
(also	
  Avro)	
  
and	
  value	
  
(as	
  JSON,	
  
truncated)	
  
Gaussian	
  Process	
  
model	
  parameters	
  
Example	
  of	
  a	
  PFA	
  model	
  
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
calling	
  method:	
  parameters	
  
expressed	
  as	
  JSON	
  
input:	
  get	
  interpola.on	
  point	
  from	
  input	
  
{cell:	
  table}:	
  get	
  parameters	
  from	
  table	
  
null:	
  no	
  explicit	
  Kriging	
  weight	
  (universal)	
  
{fcn:	
  …}:	
  kernel	
  func.on	
  
Example	
  of	
  a	
  PFA	
  model	
  
•  Appears	
  declara.ve,	
  but	
  this	
  is	
  a	
  func.on	
  call.	
  
–  Fourth	
  parameter	
  is	
  another	
  func.on:	
  m.kernel.rbf	
  (radial	
  basis	
  
kernel,	
  a.k.a.	
  squared	
  exponen.al).	
  
–  	
  m.kernel.rbf	
  was	
  intended	
  for	
  SVM,	
  but	
  is	
  reusable	
  anywhere.	
  
–  One	
  argument	
  (gamma)	
  preapplied	
  so	
  that	
  it	
  fits	
  the	
  signature	
  
for	
  model.reg.gaussianProcess.	
  
•  Any	
  kernel	
  func.on	
  could	
  be	
  used,	
  including	
  user-­‐defined	
  func.ons	
  
wriqen	
  with	
  PFA	
  “code.”	
  
•  The	
  Gaussian	
  Process	
  could	
  be	
  used	
  anywhere,	
  even	
  as	
  a	
  pre-­‐
processing	
  or	
  post-­‐processing	
  step.	
  
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
Summary	
  
Ten	
  Analy.cOps	
  Rules	
  
1.  Team	
  a	
  modeler,	
  soRware	
  engineer,	
  and	
  systems	
  engineer.	
  
2.  Instrument	
  and	
  monitor	
  analy.cs,	
  soRware	
  and	
  systems	
  and	
  
populate	
  and	
  Analy.cOps	
  dashboard.	
  	
  
3.  Use	
  an	
  automated	
  tes.ng	
  and	
  deployment	
  environment	
  to	
  
improve	
  the	
  model	
  quality.	
  	
  
4.  Use	
  scoring	
  engines	
  with	
  languages	
  such	
  as	
  PFA	
  &	
  PMML.	
  
5.  Put	
  in	
  place	
  a	
  data	
  quality	
  program.	
  	
  
6.  For	
  complex	
  workloads,	
  use	
  workflow	
  and	
  schedulers	
  (even	
  if	
  
you	
  think	
  you	
  don’t	
  need	
  them	
  ini.ally)	
  and	
  model	
  scale	
  up.	
  
7.  Op.mize	
  the	
  end	
  to	
  end	
  performance	
  of	
  the	
  Analy.cOps,	
  not	
  
individual	
  analy.cs.	
  
8.  Dis.nguish	
  scores	
  from	
  ac.ons.	
  
9.  Iden.fy	
  and	
  eliminate	
  performance	
  hot	
  spots,	
  system	
  stragglers,	
  
etc.	
  
10.  Invest	
  in	
  root	
  cause	
  analysis	
  of	
  Analy.cOps	
  problems.	
  
Ques.ons?	
  
43	
  
rgrossman.com	
  
@bobgrossman	
  

More Related Content

What's hot

Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineRobert Dempsey
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsEric Chiang
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Sri Ambati
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Databricks
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Weave GitOps - continuous delivery for any Kubernetes
Weave GitOps - continuous delivery for any KubernetesWeave GitOps - continuous delivery for any Kubernetes
Weave GitOps - continuous delivery for any KubernetesWeaveworks
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...Databricks
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchDatabricks
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem ChemAxon
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningSergey Karayev
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science PlatformQAware GmbH
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 

What's hot (20)

Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev Ops
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Weave GitOps - continuous delivery for any Kubernetes
Weave GitOps - continuous delivery for any KubernetesWeave GitOps - continuous delivery for any Kubernetes
Weave GitOps - continuous delivery for any Kubernetes
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 

Viewers also liked

Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Robert Grossman
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?Robert Grossman
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Robert Grossman
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Grupo Visión México 2030, El municipio: una institución diseñada para el frac...
Grupo Visión México 2030, El municipio: una institución diseñada para el frac...Grupo Visión México 2030, El municipio: una institución diseñada para el frac...
Grupo Visión México 2030, El municipio: una institución diseñada para el frac...CICMoficial
 
Star wars trailer
Star wars trailerStar wars trailer
Star wars trailerdanigreenxo
 
Ventaja y desventaja de formacion virtual y presencial
Ventaja y desventaja de formacion virtual y presencialVentaja y desventaja de formacion virtual y presencial
Ventaja y desventaja de formacion virtual y presencialelianarosalesromero
 

Viewers also liked (20)

Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Grupo Visión México 2030, El municipio: una institución diseñada para el frac...
Grupo Visión México 2030, El municipio: una institución diseñada para el frac...Grupo Visión México 2030, El municipio: una institución diseñada para el frac...
Grupo Visión México 2030, El municipio: una institución diseñada para el frac...
 
Catalogo
CatalogoCatalogo
Catalogo
 
Star wars trailer
Star wars trailerStar wars trailer
Star wars trailer
 
Ventaja y desventaja de formacion virtual y presencial
Ventaja y desventaja de formacion virtual y presencialVentaja y desventaja de formacion virtual y presencial
Ventaja y desventaja de formacion virtual y presencial
 
Taller de negociacion
Taller de negociacionTaller de negociacion
Taller de negociacion
 

Similar to AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...DataWorks Summit
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-useltonrodriguez11
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro sessionAvinash Patil
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowDatabricks
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerProvectus
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionFlorian Wilhelm
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Praveen Penumathsa
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Neotys_Partner
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented SoftwarePraveen Penumathsa
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?Matei Zaharia
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 

Similar to AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments (20)

Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro session
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented Software
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?
 
Ds for finance day 4
Ds for finance day 4Ds for finance day 4
Ds for finance day 4
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 

More from Robert Grossman

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsRobert Grossman
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchRobert Grossman
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Robert Grossman
 

More from Robert Grossman (11)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

  • 1. How  to  Make  Analy.c  Opera.ons  Look  More  Like   DevOps:  Lessons  learned  Moving  Machine-­‐ Learning  Algorithms  to  Produc.on  Environments   Robert  L.  Grossman   University  of  Chicago   and   Open  Data  Group   O’Reilly  Strata  Conference   March  30,  2016   rgrossman.com   @bobgrossman  
  • 3. SoRware   Development   Quality   Assurance   Opera.ons   DevOps   The  goal  of  DevOps  is  to  establish  a  culture  and  an  environment   where  building,  tes.ng,  releasing,  and  opera.ng  soRware  can   happen  rapidly,  frequently,  and  more  reliably.*   *Adapted  from  Wikipedia,  en.wikipedia.org/wiki/DevOps.  
  • 4. Analy.c   Modeling   Quality   Assurance   Analy.c   Opera.ons   Analy.cOps   The  goal  of  Analy.cOps  is  to  establish  a  culture  and  an   environment  where  building,  valida.ng,  deploying,  and  running   analy.c  models  happen  rapidly,  frequently,  and  reliably.  
  • 5. Analy.c   Modeling   Quality   Assurance   Analy.c   Opera.ons   Analy.cOps   The  goal  of  Analy.cOps  is  to  establish  a  culture  and  an   environment  where  building,  valida.ng,  deploying,  and  running   analy.c  models  happen  rapidly,  frequently,  and  reliably.   •  SoRware   •  Model   •  Data  
  • 6. Analy.c  strategy   and  planning   Analy.c  models  &   algorithms   Analy.c  opera.ons   Analy.c  Infrastructure   *Source:  Robert  L.  Grossman,  The  Strategy  and  Prac.ce  of  Analy.cs,  O’Reilly,  2016,  to  appear.  
  • 7. A  Problem   There  are  plaZorms  and  tools  for  managing  and  processing  big  data   (Hadoop),  for  building  analy.cs  (SAS,  SPSS,  R,  Sta.s.ca,  Spark,   Skytree,  Mahout),  but  few  op.ons  for  deploying  analy.cs  into   opera.ons  or  for  embedding  analy.cs  into  products  and  services.   Data  scien.sts   developing  analy.c   models  &  algorithms   Analy.c  infrastructure   Enterprise  IT   deploying  analy.cs   into  products,  services   and  opera.ons   Deploying  analy.cs   7  
  • 8. More  Problems   Data  scien.sts   developing  analy.c   models  &  algorithms   Analy.c  infrastructure   Enterprise  IT   deploying  analy.cs   into  products,  services   and  opera.ons   Deploying  analy.cs   8   Monitoring   opera.onal  analy.cs   ETL  and  datamarts  for   the  modelers  
  • 9. Case  Study  1:  Scoring  Engines  for  Cri.cal   Systems  
  • 10. Life  Cycle  of  Predic.ve  Model   Exploratory  Data  Analysis   Get  and     clean  the  data   Build  model  in  dev/ modeling  environment   Deploy  model  in   opera.onal  systems  with   scoring  applica.on     Monitor  performance  and   employ  champion-­‐ challenger  methodology  to   develop  improved  model   Analy.c  modeling   Analy.c  opera.ons   Deploy   model   Perf.   data   Re.re  model  and  deploy   improved  model   Select  analy.c   problem  &   approach   Scale  up     deployment  
  • 11. Exploratory  Data  Analysis   Get  and     clean  the  data   Build  model  in  dev/ modeling  environment   Deploy  model  in   opera.onal  systems  with   scoring  applica.on     Monitor  performance  and   employ  champion-­‐ challenger  methodology  to   develop  improved  model   Analy.c  modeling   Analy.c  opera.ons   Deploy   model   Re.re  model  and  deploy   improved  model   Select  analy.c   problem  &   approach   Scale  up     deployment   ModelDev AnalyticOps Perf.   data  
  • 12. Differences  Between  the  Modeling  and   Deployment  Environments   •  Typically  modelers  use  specialized  languages  such  as   SAS,  SPSS  or  R.   •  Usually,  developers  responsible  for  products  and   services  use  languages  such  as  Java,  JavaScript,   Python,  C++,  etc.   •  This  can  result  in  significant  effort  moving  the  model   from  the  modeling  environment  to  the  deployment   environment.  
  • 13. Ways  to  Deploy  Models  into     Products/Services/Opera.ons   •  Export  and  import  tables  of  scores   •  Export  and  import  tables  of  parameters   •  Have  the  product/service  interact  with  the   model  as  a  web  or  message  service.   •  Import  the  models  into  a  database   •  Embed  the  model  into  a  product  or  service.   •  Push  code.   How  quickly  can  the  model  be  updated?   •  Model  parameters?   •  New  features?         •  New  pre-­‐  &  post-­‐  processing?  
  • 14. What  is  a  Scoring  Engine?   •  A  scoring  engine  is  a  component  that  is  integrated  into   products  or  enterprise  IT  that  deploys  analy.c  models  in   opera.onal  workflows  for  products  and  services.   •  A  Model  Interchange  Format  is  a  format  that  supports   the  expor.ng  of  a  model  by  one  applica.on  and  the   impor.ng  of  a  model  by  another  applica.on.       •  Model  Interchange  Formats  include  the  Predic.ve  Model   Markup  Language  (PMML),  the  Portable  Format  for   Analy.cs  (PFA),  and  various  in-­‐house  or  custom  formats.   •  Scoring  engines  are  integrated  once,  but  allow   applica.ons  to  update  models  as  quickly  as  reading  a  a   model  interchange  format  file.   14  
  • 15. Analy.c  algorithms   &  models   Analy.c  opera.ons   Deploying  analy.c  models   Model   Consumer   Model   Producer   Analy.c  Infrastructure   Export   model   Import   model   PMML  &  PFA  
  • 16. Case  Study  2:    Scaling  Bioinforma.cs   Pipelines  for  the  Genomic  Data  Commons*   This  case  study  describes  work  by  the  NCI  Genomic  Data  Commons  Project  and  the   University  of  Chicago  Center  for  Data  Intensive  Science.  
  • 17. TCGA  dataset:  1.54  PB   consis.ng  of  577,878   files  about  14,052  cases   (pa.ents),  in  42  cancer   types,  across  29  primary   sites.       2.5+  PB     of  cancer   genomics  data   +   Bionimbus  data  commons   technology  running  mul.ple   community  developed  variant   calling  pipelines.    Over  12,000   cores  and  10  PB  of  raw  storage  in   18+  racks  running  for  months.   Analy.cOps  for  the  Genomic  Data  Commons  
  • 18. Dev Ops •  Virtualiza.on  and  the  requirement  for  massive  scale  out   spawned  infrastructure  automa.on  (“infrastructure  as   code”).   •  Requirement  for  reducing  the  .me  to  deploying  code   created  tools  for  con.nuous  integra.on  and  tes.ng.  
  • 19. ModelDev AnalyticOps •  Use  virtualiza.on  /  containers,  infrastructure   automa.on  and  scale  out  to  support  large  scale   analy.cs.   •  Requirement:  reduce  the  .me  and  cost  to  do  high   quality  analy.cs    over  large  amounts  of  data.  
  • 20. Genomic  Data  Commons  (GDC)  Files  Vary  Over  9   Orders  of  Magnitude  in  Size  
  • 21. GDC  Pipelines  Are  Complex     and  are  Mostly  Wriqen  by  Others  
  • 22. Computa.ons  for  a  Single     Genome  Can  Take  Over  a  Week   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  • 23. System  Loads  Vary  Significantly  
  • 24. •  Model  quality   (confusion  matrix)   •  Data  quality                           (six  dimensions)   •  Lack  of  ground  truth   •  SoRware  errors   •  Workflow  with   monitoring   •  Scheduling   •  Boqlenecks,  stragglers,  hot  spots,  etc.   •  Analy.c  configura.ons  problems*   •  System  failures     •  Human  errors   Ten  Factors  Effec.ng  Analy.cOps   *DMS  =  data-­‐model-­‐system  
  • 25. Monitor  Data  Quality  and  Model  Performance   and  Summarize  With  Dashboards   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  • 26. Analy.cOps  Dashboard   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  • 27. Data  Quality:  Batch  Effects  Can  Be  Significant   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  • 28. Model  Quality:  Differences  in  Three   Soma.c  Muta.on  Detec.on  Algorithms   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  • 29. ORen  SoRware  Must  Be  Wriqen  so  that  It  Can   Be  Run  Efficiently  in  Automated  Enivronments   •  Generally,  community  soRware  in  bioinforma.cs  is   designed  to  be  run  manually  over  local  clusters.   •  Example   – We  patched  one  piece  of  soRware  over  400  .mes   so  that  it  could  run  over  12,000  genomes     – Although  only  3.3%  of  genomes  had  problems,  it   required  significant  manual  effort.   •  Analy.cOps  requires  opera.ng  the  soRware  in   automated  environments.  
  • 30. Decide  What  Not  to  Compute   VarScan Rate Rate (GB/hour) Frequency 0.0 0.5 1.0 1.5 2.0 020040060080010001200 Manage  these   cases  carefully.  
  • 31. Model  Expected  Performance   Processing  .me   Tumor  BAM  size  (GB)   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  • 32. Case  Study  3:  Deploying  Gaussian  Process   Models  to  the  Industrial  Internet*   *Thanks  to  the  DMG  PMML  and  PFA  Working  Groups.    
  • 33. Portable  Format  for  Analy.cs  (PFA)  Standard   www.dmg.org  
  • 34. PFA  is  Based  Upon  Defining  Primi.ves  for   Analy.c  Models   •  What  would  a  standard  look  like  that…   – Defines  primi.ves  for  data  transforma.ons,  data   aggrega.ons,  and  sta.s.cal  and  analy.c  models.   – Supports  composi.on  of  data  mining  primi.ves   (which  makes  it  easy  to  specify  machine  learning   algorithms  and  pre-­‐/post-­‐  processing  of  data).   – Is  extensible.   – Is  “safe”  to  deploy  in  enterprise  IT  opera.onal   environments.   •  This  is  a  different  philosophy  that  is  different  and   complementary  to  Predic.ve  Model  Markup   Language  (PMML).   34  
  • 35. Benefits  of  PFA   •  PFA  is  based  upon  JSON  and  Avro  and  integrates   easily  into  modern  big  data  environments.   •  PFA  allows  models  to  be  easily  chained  and   composed   •  PFA  allows  developers  and  users  users  of  analy.c   systems  to  pre-­‐process  inputs  and  to  post-­‐process   outputs  to  models   •  PFA  is  easily  integrated  with  Storm,  Akka  and  other   streaming  environments   •  PFA  can  be  used  to  integrate  mul.ple    tools   applica.ons  within  an  analy.c  ecosystem.  
  • 37. Example  of  a  PFA  model   input: {type: array, items: double} output: {type: array, items: double} cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]} action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}} input  and  output  of  scoring  engine   expressed  as  Avro  schemas  
  • 38. Example  of  a  PFA  model   input: {type: array, items: double} output: {type: array, items: double} cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]} action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}} type   (also  Avro)   and  value   (as  JSON,   truncated)   Gaussian  Process   model  parameters  
  • 39. Example  of  a  PFA  model   input: {type: array, items: double} output: {type: array, items: double} cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]} action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}} calling  method:  parameters   expressed  as  JSON   input:  get  interpola.on  point  from  input   {cell:  table}:  get  parameters  from  table   null:  no  explicit  Kriging  weight  (universal)   {fcn:  …}:  kernel  func.on  
  • 40. Example  of  a  PFA  model   •  Appears  declara.ve,  but  this  is  a  func.on  call.   –  Fourth  parameter  is  another  func.on:  m.kernel.rbf  (radial  basis   kernel,  a.k.a.  squared  exponen.al).   –   m.kernel.rbf  was  intended  for  SVM,  but  is  reusable  anywhere.   –  One  argument  (gamma)  preapplied  so  that  it  fits  the  signature   for  model.reg.gaussianProcess.   •  Any  kernel  func.on  could  be  used,  including  user-­‐defined  func.ons   wriqen  with  PFA  “code.”   •  The  Gaussian  Process  could  be  used  anywhere,  even  as  a  pre-­‐ processing  or  post-­‐processing  step.   model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
  • 42. Ten  Analy.cOps  Rules   1.  Team  a  modeler,  soRware  engineer,  and  systems  engineer.   2.  Instrument  and  monitor  analy.cs,  soRware  and  systems  and   populate  and  Analy.cOps  dashboard.     3.  Use  an  automated  tes.ng  and  deployment  environment  to   improve  the  model  quality.     4.  Use  scoring  engines  with  languages  such  as  PFA  &  PMML.   5.  Put  in  place  a  data  quality  program.     6.  For  complex  workloads,  use  workflow  and  schedulers  (even  if   you  think  you  don’t  need  them  ini.ally)  and  model  scale  up.   7.  Op.mize  the  end  to  end  performance  of  the  Analy.cOps,  not   individual  analy.cs.   8.  Dis.nguish  scores  from  ac.ons.   9.  Iden.fy  and  eliminate  performance  hot  spots,  system  stragglers,   etc.   10.  Invest  in  root  cause  analysis  of  Analy.cOps  problems.  
  • 43. Ques.ons?   43   rgrossman.com   @bobgrossman