Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 34

More Related Content

You Might Also Like

Related Books

Free with a 30 day trial from Scribd

See all

Automatisierte Provisionierung einer Data Lab Umgebung für Data Scientists

  1. 1. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich  Überraschend mehr Möglichkeiten © OPITZ CONSULTING 2020 Automatisierung mit Hilfe von Infrastructure as Code Fabian Hardt, Senior Consultant – Big Data / Analytics Data Lab Umgebung für Data Scientists
  2. 2. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich© OPITZ CONSULTING 2020 #Figures€49,0 55,5 56,0 2017 2018 2019 Turnover in million euros Ø Sector distribution Retail/ Logistics/ Services Chemical & pharma Other Public IndustryFinance 455 482 555 2017 2018 2019 Employees 91% customer satisfaction Recommendation: NPS 40.26 > 130 actively managed service customers > 600 active customers > 5000 databases in support > 4000 systems in support > 99.5% SLA compliance § > 90% of our customers’ projects use agile methods “Sound management in conjunction with a long-term customer-oriented strategy guarantees financial success.” – Torsten Schlautmann
  3. 3. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Seite 3 Agenda 1 2 3 4 5 Data Lake Data Lab vs. Data Factory Infrastructure as Code Demo Summary
  4. 4. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Seite 4 Data Lake  Purpose  Requirements  Architecture 1
  5. 5. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lake - Purpose  Primary task: Centralized data provisioning  All data of a company is collected in one centralized data platform  Also external data is collected (social-media, weather, stock market prices, etc.)  Similar to DWH Systems, integration of different data sources  Store Raw data without knowing the intended purpose  Structured data  Unstructured data  Suitable for very large amounts of data  Mostly implemented with big data technologies Data Lake
  6. 6. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lake – Enterprise Requirements (Data Factory)  Usability of stored Data  Searchability of data  Evaluability of data  Availability of data  Data Security / Data protection regulations  Authentication of users  Authorization of users  Data Governance  Quality Assurance of data  Governs the compliance requirements } Users should only see the data that they’re allowed to see
  7. 7. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Core components of a Data Lake Refined Data - Quality assured data, - typically data that could transition into a classical DWH Data Refinery - Preprocessing Area Raw Data - sensor data, - streaming, - social media, - documents, - Images - … Metadata
  8. 8. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Seite 9 Data Lab vs. Data Factory  Purpose  Requirements  Architecture 2
  9. 9. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lab - purpose  Develop processes and algorithms to gain data insights  Development Area for Data Scientists  Allows Data Scientist to install and use any software of their choice  Allows Data Scientist experimental processes with external data  Allows Data Scientist to manage all data and processes
  10. 10. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lab - Requirements  Should give all freedoms to Data Scientist  Use all kinds of technologies and software (in container)  Use data as close as possible to real enterprise data  Use data from external sources  Should match all Requirements from a Software Development Environment  Software artifact should be runnable in Data Factory  Should prevent Data Scientist from seeing data he should not see  Should ensure data lake does not get messed up  Should ensure even resource hungry processes do not influence the data lake
  11. 11. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lab – Requirements implementation  Data Lab as Sandbox dedicated to Data Scientist  No Authorization needed  Gives all freedoms to Data Scientist  Access to Data Lake to use Data from there  Only read access at most  Sometimes only working on copies of data from Data Lake  Sandbox Architecture prevents Data Lab from influencing Data Lake processes
  12. 12. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lab - Architecture Refined Data - Quality assured data, - typically data that could transition into a classical DWH Data Refinery - Preprocessing Area Raw Data - sensor data, - streaming, - social media, - documents, - Images - … Metadata Analytics Sandbox External Data Data from Lake
  13. 13. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Factory - Requirements  New models and algorithms should be integrated easily  Established algorithms should be updated easily  Found insights should be integrated in Data Lake to be evalueable  Challenge to design extensible data model  Single Data Factory components should not influence ohter Data Lake processes  Must be complient conform
  14. 14. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Science vs. Enterprise  Looking for business insights in a huge amount of data  Mainly used to train algorithms  Not a production system  no operating team necessary  Usually no other target groups  Often no further authorization is needed  Usually only authentication is enabled Data Lab Enterprise Data Lake  Is very integrated into the daily business  Production system  Typically many target applications are based on this collection of data  High availability  Backup & Recovery
  15. 15. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Science vs. Enterprise Data Lab  Explorative approach  Insights generating  Work with production near data  Use Sandboxes  Data Scientists  Experts from the specialist department  Goal: Train models and algorithms  Generating value from trained models and algorithms  Automatic processing of data from Data Lake  Further development like in other software development products  Makes use of trained models and algorithms Data Factory
  16. 16. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Deployment from Data Lab to Factory Data FactoryData Lab Trained Algorithms Generated Insights from Analysis Refined Data Data Refinery Raw Data DWH Weitere Datenquellen No Data Transfer from Lab to Factory
  17. 17. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Conclusion Why do we need a Data Lab?  Where is the problem? Why should we do this?  „Playground“ / sandbox for Data Scientists  No risk to break something in EDL / Data Factory  No dependency on release cycles, etc.  "Free" choice of tools  Made possible by Infrastructure as Code  Setting up entire environments in minutes  Very similar or identical setup to EDL / Data Factory
  18. 18. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Seite 19 Infrastructure as Code (IaC)  What is IaC  Advantages 3
  19. 19. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Keyword – Infrastructure as Code  Called IaC  Process of managing and provision computers, vm‘s, networks  Version controlled infrastructure definitions  Repeatable and reliable  Frameworks  Hashicorp Terraform  Puppet  SaltStack  RedHat Ansible  …
  20. 20. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Advantages of IaC  Costs  Reduce costs by  Avoiding repetitive tasks  Avoiding mistakes  Reuse code / infrastructure  Speed  Provisioning on multiple machines in parallel  Calculations / regex in code  Risk  Reduces the risk through  Better preparation  Idempotent approach  Prepared updates
  21. 21. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Challenges for IaC  A few new skills needed  Typical development workflows (e.g. GitFlow)  Some basic coding paradigms  JSON  Ruby  YAML  HCL  …  Monitoring  Who is provisioning – what?, where?, how?  Legacy tracking via worksheet not possible  → Solution: e.g. Ansible Tower / AWX
  22. 22. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Reuse / reproducible result Use Playbooks / scripts / templates for all stages DEV TEST PROD Playbook Script Reliable and easier deployment in next stages Development of scripts / playbooks in DEV environment Inventory Inventory Inventory
  23. 23. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Often seen like this TEST Oh, this is our Data Science environment! Yes, I trained this modelon TEST environment first. No, we can‘t use TEST for new Release now, Data Scientist using it! DEV Team Data Science Team Product- management
  24. 24. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Better like this… PROD TEST Data Lab / SandboxFork Trained models New Reports
  25. 25. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Terraform (Hashicorp)  Open Source software tool  Released 2014  Code in HCL (Hashicorp Configuration Language) or JSON  Terraform CLI to trigger Terraform actions  Manages public / private cloud infrastructure  Extendable with providers – AWS, Azure, OCI, vSphere  Performs CRUD operations to accomplish the target state
  26. 26. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Cloud-init  Customizes OS directly after cloning from template  Many public- / private clouds and OS Types are supported  Perfectly for „low-level stuff“  Set default locale  Set hostnames  Set up SSH keys  Mount some shared folders  It can also install some software packages  Transfer point to e.g. Ansible
  27. 27. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Ansible (RedHat)  Open Source software, founded by AnsibleWorks in 2012  Very good supported on many Unix-like systems – but also on Windows  Strengths: Provisioning, configuration management, some kind of application development  Agentless – working with temporary SSH access  Features  Inventory based configuration of different stages  Ansible Vault – Store sensitive data  A playbook can be written idempotent  Ansible Tower / Upstream project AWX – REST API, web bases console for Ansible
  28. 28. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Seite 29 Demo  Here comes a small sample… 4
  29. 29. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich
  30. 30. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lab creation workflow Admin Chooses template according to requirements Environment is getting provisioned Data Lab – ready to use Pull data through APIs Push data into Data Lab Analytics Sandbox
  31. 31. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Sample workflow (on premise) Admin triggers runs a playbook vm creation add dns entries / AD or LDAP users do some configuration copy data into Data Lab notify Data Scientists
  32. 32. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Sample workflow (Oracle Cloud) Admin triggers use predefined stack do some configuration copy data into Data Lab notify Data Scientists create vm instances and networks Optional: add some firewall rules
  33. 33. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Summary  First of all → Don‘t use TEST as Data Science ENV  Use a dedicated Data Lab environment to innovate  IaC is the way to build and destroy a Data Lab environment  Scalable and repeatable  Mix multiple clouds or on-premise environments  Use the mix of roles to succeed – admins, developers, data scientists  Increase satisfaction (company-wide)  More flexibility for Data Scientists  Matching processes for administration and development in Data Factory  Speed up innovation lifecycle
  34. 34. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich  Überraschend mehr Möglichkeiten @OC_WIRE OPITZCONSULTING opitzconsulting opitz-consulting-bcb8-1009116 WWW.OPITZ-CONSULTING.COM Vielen Dank für Ihr Interesse, wir freuen uns auf den gemeinsamen Austausch mit Ihnen! Fabian Hardt Senior Consultant, Business Intelligence & Analytics Kirchstraße 6 51647 Gummersbach Fabian.Hardt@opitz-consulting.com +49 (0) 2261 6001-1045

×