Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

1. Introduction to the Course "Designing Data Bases with Advanced Data Models (NoSQL & Hadoop)"


Published on

The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.

Course Website

Published in: Data & Analytics
  • Login to see the comments

1. Introduction to the Course "Designing Data Bases with Advanced Data Models (NoSQL & Hadoop)"

  1. 1. Course  Introduc-on   Designing  Data  Bases  with  Advanced  Data  Models     Dr. Fabio Fumarola
  2. 2. Enterprise  compu-ng  evolu-on   •  We’ve  spent  several  years  in  the  world  of  enterprise   compu-ng   •  We’ve  seen  many  things  change  in:   –  Languages,  Architectures,   –  PlaAorms,  and  Processes.   •  But  in  one  thing  we  stayed  constant  –  “rela-onal   database  stored  the  data”   •  The  data  storage  ques-on  for  architects  was:  “which   rela-onal  database  to  use”   1  
  3. 3. The  stability  of  the  reign   •  Why?   •  An  organiza-on’s  data  lasts  much  longer  than  its   programs  (COBOL?)   •  It’s  valuable  to  have  stable  data  storage  which  is   accessible  from  many  applica-ons   •  In  the  last  decades  RDBMS  have  been  successful  in   solving  problems  related  to  storing,  serving  and   processing  data.   2  
  4. 4. FROM  BUSINESS  TO  DECISION   SUPPORT   Before  going  deep  on  the  course  arguments  let’s  understand  the   evolu-on  of  decision  support  systems.       3  
  5. 5. From  business  to  decision  support:  ‘60   •  Star-ng  from  ’60  data  were  stored  using  magne-c   disks.   •  Supported  analysis  where  sta-c,  only  aggregated  and   pre[y  limited   •  For  instance,  it  was  possible  to  extract  the  total   amount  of  last  month  sales   4  
  6. 6. From  business  to  decision  support:  ‘80   •  With  rela-onal  databases  and  SQL,  data  analysis  start   to  be  somehow  dynamics.   •  SQL  allows  us  to  extract  data  at  detailed  and   aggregated  level   •  Transac-onal  ac-vi-es  are  stored  in  Online   Transac-on  Process  databases     •  OLTP  are  used  in  several  applica-ons  such  as  orders,   salary,  invoices…   5  
  7. 7. From  business  to  decision  support:  ‘80   •  The  best  hypothesis  was  that  the  described  modules   are  included  into  Enterprise  Resource  Planning  (ERP)   sobware   •  Examples  of  such  vendors  are  SAP,  Microsob,  HP  and   Oracle.   •  Normally,  what    happen  is  that  each  module  is   implemented  as  an  ad-­‐hoc  sobware  with  is  own   database.     •  Cons:  Data  representa-on  and  integra-on.   6  
  8. 8. OLTP  design   •  Such  kind  of  databases  are  designed  to  be:   –  Strongly  normalized,   –  Fast  in  inser-ng  data.   •  However,  data  normaliza-on:   –  Do  not  foster  the  read  of  huge  quan-ty  of  data,   –  Increase  the  number  of  tables  used  to  store  records.   7  
  9. 9. Foster  decision  support   •  In  order  to  support  “Decisions”  we  need  to  extract   de-­‐normalized  data  via  several  JOINS.   •  Moreover,  opera-onal  databases  offer  a  no/limited   visibility  on  historical  data.   Considera-ons:   •  These  factors  make  hard  data  analysis  made  on  OLTP   databases…   8  
  10. 10. From  business  to  decision  support:  ‘90   •  Thus  in  ’90  bore  databases  designed  to  support   analysis  from  OLTP  databases.   •  This  is  the  arise  of  Data  Warehouses   •  DWs  are  central  repositories  of  integrated  data  from   one  or  more  disparate  sources.   •  They  store  current  and  historical  data  and  are  used   for  crea-ng  trending  reports  for  management   repor-ng  such  as  annual  and  quarterly  comparisons.   9  
  11. 11. Types  of  DWs  systems   •  Data  Mart     –  is  a  simple  form  of  a  data  warehouse  that  is  focused  on  a   single  subject  (or  func-onal  area),  such  as  sales,  finance  or   marke-ng.   •  Online  analy-cal  processing  (OLAP):     –  is  characterized  by  a  rela-vely  low  volume  of  transac-ons.     –  Queries  are  oben  very  complex  and  involve  aggrega-ons.   –  OLAP  databases  store  aggregated,  historical  data  in  mul--­‐ dimensional  schemas.   10  
  12. 12. Types  of  DWs  systems   •  Predic-ve  analysis   –  Predic-ve  analysis  is  about  finding  and  quan-fying  hidden   pa[erns  in  the  data  using  complex  mathema-cal  models   that  can  be  used  to  predict  future  outcomes.     –  Predic-ve  analysis  is  different  from  OLAP  in  that  OLAP   focuses  on  historical  data  analysis  and  is  reac-ve  in   nature,  while  predic-ve  analysis  focuses  on  the  future.     11  
  13. 13. Data  Warehouse   •  Data  stored  in  DWs  are  the  star-ng  point  for  the   Business  Intelligence  (BI).   •  Def.  “Business  intelligence  (BI)  is  the  set  of   techniques  and  tools  for  the  transforma7on  of  raw   data  into  meaningful  and  useful  informa7on  for   business  analysis  purposes”.   •  With  the  evolu-on  of  BI  systems  we  moved  from  SQL   based  analysis  to  Visual  Instruments.   12  
  14. 14. Example  DW  Cube   13  
  15. 15. OLAP  features   •  OLAP  systems  support  data  explora-on  through  drill-­‐ down,  drill-­‐up,  slicing  e  dicing  opera-ons.   •  However,  we  s-ll  have  an  historical  point  of  view  of   what  happened  in  the  business  but  now  on  what  is   happening.   •  We  cannot  make  predic-on  on  the  future   14  
  16. 16. From  business  to  decision  support:  ‘00   •  Star-ng  from  2000  it  arises  the  necessity  to  do   predic-ve  analysis.   •  The  techniques  in  this  scenario  are  in  the  field  of  Data   Mining   •  Data  Mining  is  the  computa-onal  process  of   discovering  pa[erns  in  large  dataset  involving   methods  at  the  intersec-on  of  ar-ficial  intelligence,   machine  learning,  sta-s-cs,  and  database  systems.   15  
  17. 17. From  business  to  decision  support:  ‘00   •  The  overall  goal  of  the  data  mining  process  is  to   extract  novel  informa-on  from  a  data  set  and   transform  it  into  understandable  knowledge.   •  Aside  from  the  raw  analysis  step,  it  involves  database   and  data  management  aspects,  data  pre-­‐processing,   model  and  inference  considera-ons,  interes-ngness   metrics,  complexity  considera-ons,  post-­‐processing   of  discovered  structures,  visualiza-on,  and  online   upda-ng.   16  
  18. 18. From  business  to  decision  support:  Recap   •  In  the  last  decades  RDBMS  have  been  successful  in   solving  problems  related  to  storing,  serving  and   processing  data.     17  
  19. 19. From  business  to  decision  support:  Recap   •  Vendors  such  as  Oracle,  Ver-ca,  Teradata,  Microsob   and  IBM  proposed  their  solu-on  based  on  Rela-onal   Algebra  and  SQL.   •  With  Data  Mining  we  are  able  to  extract  knowledge   from  data  which  can  be  used  to  support  predic-ve   analysis    (Data  Mining  is  not  only  predic-on!!!).   18  
  20. 20. Challenges  of  Scale  Differ  
  21. 21. THERE  IS  SOMETHING  THAT  DOES   NOT  WORK!   But   20  
  22. 22. 1.  Scaling  Up  Databases   A  ques-on  I’m  oben  asked  about  Heroku  is:  “How  do  you  scale  the  SQL   database?”  There’s  a  lot  of  things  I  can  say  about  using  caching,  sharding,  and   other  techniques  to  take  load  off  the  database.  But  the  actual  answer  is:  we   don’t.  SQL  databases  are  fundamentally  non-­‐scalable,  and  there  is  no  magical   pixie  dust  that  we,  or  anyone,  can  sprinkle  on  them  to  suddenly  make  them   scale.     Adam  Wiggins  Heroku     Adam  Wiggins,  Heroku  Pa[erson,  David;  Fox,  Armando  (2012-­‐07-­‐11).  Engineering  Long-­‐LasGng  SoHware:  An   Agile  Approach  Using  SaaS  and  Cloud  CompuGng,  Alpha  Edi-on  (Kindle  Loca-ons  1285-­‐1288).  Strawberry   Canyon  LLC.  Kindle  Edi-on.       21  
  23. 23. 2.  Data  Variety   •  RDBMs  have  problems  with  Unstructured  and  Semi-­‐ Structured  Data  (varied  data)   22  
  24. 24. 3.  Connec-vity   23  
  25. 25. 4.  P2P  Knowledge   24  
  26. 26. 5.  Concurrency   25  
  27. 27. 6.  Concurrency   26  
  28. 28. 6.  Diversity   27  
  29. 29. 7.  Cloud   28  
  30. 30. What  is  the  problem  with  RDBMs   Caching   Master/Slave   Master/Master   Cluster   Table  Par--oning   Federated  Tables   Sharding   Distributed  DBs   29
  31. 31. What  is  the  problem  with  RDBMs   •  RDBMS  can  somehow  deal  with  this  aspects,  but   they  have  issues  related  to:     –  expensive  licensing,   –  requiring  complex  applica-on  logic,   –  Dealing  with  evolving  data  models   •  There  were  a  need  for  systems  that  could:   –  work  with  different  kind  of  data  format,   –  Do  not  require  strict  schema,   –  and  are  easily  scalable.   30  
  32. 32. NOSQL:  THE  NEW  CHALLENGER!   Help!!!   31  
  33. 33. NoSQL   •  It  is  born  out  of  a  need  to  handle  large  data  volumes   •  It  forces  a  fundamental  shib  to  building  large   hardware  plaAorms  through  clusters  of  commodity   servers.   •  This  need  raises  from  the  difficul-es  of  making   applica-on  code  play  well  with  rela-onal  databases   32  
  34. 34. NoSQL   •  The  term  “NoSQL”  is  very  ill-­‐defined.     •  It’s  generally  applied  to  a  number  of  recent  non   rela-onal  databases:  Cassandra,  Mongo,  Neo4j,   Hbase  and  Redis,…   •  They  embrace     –  schemaless  data,     –  run  on  a  cluster,     –  and  have  the  ability  to  trade  off  tradi-onal  consistency  for   other  useful  proper-es   33  
  35. 35. Why  are  NoSQL  Databases   Interes-ng   1.    Applica-on  development  produc-vity:   –  A  lot  of  applica-on  development  is  spent  on  mapping  data   between  in  memory  data  structures  and  rela-onal   databases   –  A  NoSQL  database  may  provide  a  data  model  that  can   simplify  that  interac-on  resul-ng  in  less  code  to  write,   debug,  and  evolve.   34  
  36. 36. Why  are  NoSQL  Databases   Interes-ng   2.  Large-­‐scale  data:   –  Organiza-ons  are  finding  it  valuable  to  capture  mode   data  and  process  it  more  quickly.   –  They  are  finding  it  expensive  to  do  so  with  rela-onal   databases.   –  NoSQL  database  are  more  economic  if  ran  on  large   cluster  of  many  smaller  an  cheaper  machines.   –  Many  NoSQL  database  are  designed  to  run  on  clusters,  so   they  be[er  fit  on  Big  Data  scenarios.   35  
  37. 37. WHY  NOSQL   Internet  Hypertext,  RSS,  Wikis,  blogs,  wikis,  tagging,  user  generated   content,  RDF,  ontologies   36   Connectedness  
  38. 38. 37  
  39. 39. The  Value  of  Rela-onal  Databases   •  Rela-onal  databases  have  become  such  an   embedded  part  of  our  compu-ng  culture.   •  What  are  the  benefits  they  provide?   38  
  40. 40. Getng  at  Persistent  Data   •  The  most  obvious  value  of  a  database  is  keeping   large  amount  of  persistent  data   •  Two  areas  of  memory:   –  Main  memory:  fast,  vola-le,  limited  in  space  and  lose  data   when  it  loses  the  power   –  Backing  store:  larger  but  slower,  commonly  seen  as  a  disk   39  
  41. 41. Getng  at  Persistent  Data   •  The  most  obvious  value  of  a  database  is  keeping   large  amount  of  persistent  data   •  Two  areas  of  memory:   –  Main  memory:  fast,  vola-le,  limited  in  space  and  lose  data   when  it  loses  the  power   –  Backing  store:  larger  but  slower,  commonly  seen  as  a  disk   40  
  42. 42. Getng  at  Persistent  Data   •  The  backing  store  can  be  organized  in  all  sort  of   ways.   •  For  many  produc-vity  applica-ons  (such  as  word   processors)  it  is  a  file  in  the  file  system.   •  For  most  enterprise  applica-ons,  however,  the   backing  store  is  a  database.   •  A  database  allows  more  flexibility  than  a  file  system.   41  
  43. 43. Concurrency   •  Concurrency  is  notoriously  difficult  to  get  right.   •  Object  oriented  is  not  the  right  programming  model   to  deal  with  concurrency.   •  Since  enterprise  applica-ons  can  have  a  lot  of   concurrent  users,  there  is  a  lot  of  rooms  for  bad   things  to  happen.   •  Rela-on  databases  have  transac-ons  that  help   mi-ga-ng  this  problem,  but….   42  
  44. 44. Concurrency   •  You  s-ll  have  to  deal  with  transac-onal  error  when   you  try  to  book  a  room  that  is  just  gone.   •  The  reality  is  that  the  transac-onal  mechanism  has   worked  well  to  contain  the  complexity  of   concurrency.   •  Transac-on  with  rollback  allows  as  to  deal  with   errors.   43  
  45. 45. Integra-on   •  Enterprise  applica-ons  live  in  a  rich  ecosystem     •  mul-ple  applica-on  wri[en  by  different  teams  need  to   collaborate  in  order  to  get  things  done   •  This  collabora-on  is  done  via  data  sharing.   •  A  common  way  to  do  this  is  shared  database  integraGon   [Hohpe  and  Woolf]  where  mul-ple  applica-ons  store  their   data  into  a  single  database   •  Using  a  single  database,  allows  all  the  applica-on  to  share   data  easily,  while  the  database  concurrency  control   applica-ons  such  as  users.   44  
  46. 46. A  (Mostly)  Standard  Model   •  Rela-onal  database  have  succeeded  because  they   have  a  standard  model   •  As  a  result,  developers  and  database  professionals   can  apply  the  same  knowledge  in  several  projects.   •  Although  there  are  differences  between  different   RDBMs,  the  core  mechanism  remain  the  same.   45  
  47. 47. Impedance  Mismatch   •  It  is  the  difference  between  the  rela-onal  model  and   the  in-­‐memory  data  structures.   •  Rela-onal  data  organizes  data  into  table  and  rows   (rela-on  and  tuples).     –  A  tuple  is  a  set  of  name-­‐value  pairs   –  A  rela-on  is  a  set  of  tuples.   •  All  the  SQL  opera-ons  consume  and  return  rela-ons.   46  
  48. 48. Impedance  Mismatch:  Example   47  
  49. 49. Impedance  Mismatch   •  Tuples  and  rela-on  provides  elegance  and  simplicity,   but  it  also  introduces  limita-ons.   •  In  par-cular,  the  values  in  a  rela-onal  tuple  have  to   be  simple.   •  They  cannot  contain  any  structure,  such  as  nested   record  or  a  list.   •  This  limita-on  is  not  true  for  in  memory  data-­‐ structures.   48  
  50. 50. Impedance  Mismatch   •  As  a  result,  if  we  want  to  use  richer  in-­‐memory  data   structure,  we  have  to  translate  it  to  a  rela-onal   representa-on  to  store  in  on  disk.   •  While  object-­‐oriented  language  succeeded,  object-­‐ oriented  databases  faded  into  obscurity.   •  Impedance  mismatch  has  been  made  much  easier  to   deal  with  Object-­‐Rela-onal  Mapping    (ORM)   frameworks  such  as  Eclipse-­‐Link,  Hibernate  and   other  JPA  implementa-ons.   49  
  51. 51. Impedance  Mismatch   50   •  ORMs  remove  a  lot  of  work,  but  can  become  a   problem  when  people  try  to  ignore:   –  the  database,  and   –  query  performance  suffer   •  This  is  where  NoSQL  database  works  greatly,  why?  
  52. 52. Applica-on  and  Integra-on  DBs   •  This  is  a  event  that  happen  several  -mes  in  SW   projects.   •  In  this  scenario,  the  database  acts  as  an  integra-on   database.   •  The  downsides  to  share  database  are.   51  
  53. 53. Applica-on  and  Integra-on  DBs   •  The  downsides  to  share  database  are:   –  Its  structure  tend  to  be  more  complex  than  any  single   applica-on  needs,   –  If  an  applica-on  want  to  make  changes  to  its  data  storage,   it  needs  to  coordinate  with  all  the  other  applica-ons,   –  Performance  degrada-on  due  to  huge  number  of  access   –  Errors  in  database  usage  since  it  is  accessed  by  applica-on   wri[en  by  different  teams.   •  This  is  different  from  single  applica-on  databases   52  
  54. 54. Applica-on  and  Integra-on  DBs   •  Interoperability  concerns  can  now  shit  to  interfaces   of  the  applica-on  allowing  interac-on  over  HTTTP.   •  This  is  what  happen  with  micro-­‐services  ( h[p://­‐services/)   •  Micro-­‐services  and  Web  services  in  general  enable  a   form  of  communica-ons  based  on  data.   •  Data  is  represented  as  documents  using  before  XML   and  now  JSON  format.   53  
  55. 55. Applica-on  and  Integra-on  DBs   •  If  you  are  going  to  use  service  integra-on  using  text   over  HTTP  is  the  way  to  go.   •  However,  if  we  are  dealing  with  performance  there   are  binary  protocols.   54  
  56. 56. A[ack  of  the  Clusters   •  In  2000s  several  large  web  proper-es  drama-cally   increase  in  scale:   –  Websites  started  tracking  ac-vity  and  structure  in  detail   (analy-cs)   –  Large  sets  of  data  appeared:  links,  social  networks,  ac-vity   in  logs,  mapping  data.   –  With  this  growth  in  data  came  a  growth  in  users   •  Coping  with  this  increase  in  data  and  traffic  required   more  compu-ng  resources.   55  
  57. 57. A[ack  of  the  Clusters   •  There  were  to  choices:   –  Scaling  up   –  Scaling  out   •  Scaling  up  implies  bigger  machines,  more  processors,   disk  storage  and  memory  ($$$).   •  Scaling  out  was  the  alterna-ve:   –  Use  a  lot  of  small  machine  in  a  cluster.   –  It  is  cheap  and  also  more  resilient  (individual  failures)   56  
  58. 58. A[ack  of  the  Clusters   •  This  revealed  a  new  problem,  rela-onal  databases   are  note  designed  to  run  on  clusters.   •  Clustered  rela-onal  databases  (e.g.  Oracle  RAC  or   Microsob  SQL  Server)  work  on  the  concept  of  shared   disk  subsystem.   •  RDBMs  can  also  run  on  separate  server  with   different  sets  of  data  (sharding)   57  
  59. 59. A[ack  of  the  Clusters   •  However,  it  needs  an  applica-on  to  control  the   sharded-­‐database.   •  Also  we  lose  querying,  referen-al  integrity,   transac-ons  or  consistency  control  cross  shard.   •  These  technical  issues  are  exacerbated  by  licensing   cots.   •  This  mismatch  between  DBs  and  clusters  led  some   organiza-ons  to  consider  different  solu-ons   58  
  60. 60. The  emergence  of  NoSQL   •  Two  companies  in  par-cular  –  Google  and  Amazon  –   have  been  very  influen-al.   •  They  were  capturing  a  large  amount  of  data  and   their  business  is  on  data  management   •  Both  companies  produces  influen-al  papers:   –  BigTable  from  Google   –  Dynamo  DB  from  Amazon   59  
  61. 61. The  emergence  of  NoSQL   •  As  part  of  innova-on  in  data  management  system,  several   new  technologies  where  built:   –  2003  -­‐  Google  File  System,   –  2004  -­‐  MapReduce,   –  2006  -­‐  BigTable,   –  2007  -­‐  Amazon  DynamoDB   –  2012  Google  Cloud  Engine   •  Each  solved  different  use  cases  and  had  a  different  set  of   assump-ons.   •  All  these  mark  the  beginning  of  a  different  way  of  thinking   about  data  management.   60  
  62. 62. The  emergence  of  NoSQL   •  It  is  irony  that  the  term  “NoSQL”  appeared  in  late   90s  from  a  rela-onal  database  made  by  Carlo  Strozzi.   •  The  name  comes  from  the  fact  that  it  does  not  used   SQL  as  query  language.   •  However,  the  usage  of  “NoSQL”  as  we  consider   today  come  from  a  meetup  on  2009  in  San  Francisco.   •  They  want  a  term  that  can  be  used  as  Twi[er   hashtag.  #NoSQL   61  
  63. 63. NoSQL  Characteris-cs   1.  They  don’t  use  SQL.  (HBase,  Cassandra,  Redis…)   2.  They  are  generally  open-­‐source  projects.   3.  Most  of  them  are  designed  to  run  on  clusters.     4.  RDBMs  used  ACID  transac-ons  to  handle   consistency  across  the  whole  database.  NoSQL   resort  to  other  op-ons  (CAP  theorem).   5.  No  all  are  cluster  oriented  (Graph  DBs)   6.  NoSQL  operate  without  a  schema.  (schema  free)   62  
  64. 64. The  emergence  of  NoSQL   •  NoSQL  does  not  stands  for  Not-­‐Only  SQL.   •  It  is  be[er  to  NoSQL  as  a  movement  rather  than  a   techonology.   •  RDBMs  are  not  going  away.   •  The  change  is  that  rela-onal  databases  are  an  op-on   •  This  point  of  view  is  oben  referred  to  as  polyglot   persistence   63  
  65. 65. The  emergence  of  NoSQL   •  Instead  of  just  picking  a  rela-onal  database,  we  need   to  understand:   1.  The  nature  of  the  data  we  are  storing,  and   2.  How  we  want  to  manipulate  it.   •  In  order  to  deal  with  this  change  most  organiza-ons   need  to  shib  from  integra-on  database  to   applica-on  database.   •  In  this  course  we  concentrate  on  Big  Data  running   on  clusters.   64  
  66. 66. The  emergence  of  NoSQL   •  The  Big  Data  concerns  have  created  an  opportunity   for  people  to  think  freshly  about  their  storage  needs.   •  NoSQL  help  developer  produc-vity  by  simplifying   their  database  access  even  if  they  have  no  need  to   scale  beyond  single  machine.   65  
  67. 67. 66