SlideShare a Scribd company logo
1 of 31
Data Lakes
visão prática
Marco Garcia
CTO, Founder – Cetax, TutorPro
mgarcia@cetax.com.br
https://www.linkedin.com/in/mgarciacetax/
Com mais de 20 anos de experiência em TI, sendo 18 exclusivamente com Business
Intelligence , Data Warehouse e Big Data, Marco Garcia é certificado pelo Kimball University,
nos EUA, onde obteve aula pessoalmente com Ralph Kimball – um dos principais gurus do
Data Warehouse.
1º Instrutor Certificado Hortonworks LATAM
Arquiteto de Dados e Instrutor na Cetax Consultoria.
02
Apresentação
Data Lake ?
Data Lake ?
The ability to learn or understand or to deal with new or trying situations :reason; also:the skilled use of reason
the ability to apply knowledge to manipulate one's environment or to think abstractly as measured by objective
criteria (as tests).
What is intelligence?
04
Data Lake ?
1ª Citação Data
Lake
Outubro-2010
Data Warehouse x Data Lake
https://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-differences.html
Garrafas de água:
- Limpas
- Tratadas
- Empacotadas
- Prontas para o
Consumo
Lago de Dados :
- Bruto
- Sem
tratamento
- Precisa ser
trabalhada
para ser
consumida
“Dados são o novo Petróleo”
No ano de 2012 a
Como petróleo, precisam ser refinados !
DATA IS THE NEW OIL!
DADOS PARA BIG DATA
DADOS POR VALIDADE PARA BIG DATA
FERRAMENTAS PARA BIG DATA
ARQUITETURA COMPLETA PARA BIG DATA ? Hadoop !
Hadoop
WhatisApacheHadoop?
 Allows for the distributed processing of large data sets across clusters of computers using
simple programming models
 Is designed to scale up from single servers to thousands of machines, each offering local
computation and storage
 Does not rely on hardware to deliver high-availability, but rather the library itself is
designed to detect and handle failures at the application layer
 Delivers a highly-available service on top of a cluster of computers, each of which may be
prone to failures
The Apache Hadoop project describes the technology as a software framework that:
Source: http://hadoop.apache.org
HadoopCore=Storage+Compute
storage storage
storage storage
CPU RAM
Yet Another Resource
Negotiator (YARN)
Hadoop Distributed File
System (HDFS)
HadoopDistribution
DistinctMastersandScale-OutWorkers
worker node
NodeManager
DataNode
master node 2
ZooKeeper
Resource
Manager
master node 1
ZooKeeper
NameNode
master node 3
ZooKeeper
HiveServer2
utility node 1
Client
Gateway
Knox
utility node 2
Client
Gateway
Ambari Server
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
worker node
NodeManager
DataNode
Como seria o DataLake
no Hadoop ?
compute
&
storage
. . .
. . .
. .
compute
&
storage
.
.
YARN
KNOX
AMBARI
HCATALOG (table metadata)
Step 2: Model/Apply Metadata
(data processing)
HIVE PIG
Step 3: Transform, Aggregate & Materialize
LOAD
SQOOP/Hive
Web HDFS
Data Sources
RDBMS, No/New SQL Store
(Oracle, Hana)
EDW
(SAP BW)
Step 4a: Publish/Exchange
Step 4c: Analyze
Analytical Tools
SAS, Python, R, Matlib
ANALYTICAL
NN
AppMaster
Streaming
INTERACTIVE
HIVE Server
Query/Visualization/Re
porting Tools
SAP BO
Tableau/Excel
Any JDBC Compliant
ToolStep 4b: Explore/Visualize
FALCON (data lifecycle)
Manage Steps 1-3: Data Lifecycle with Falcon
LOAD
SQOOP
FLUME
NIFI
KAFKA
SOURCE DATA
App/System
Logs
Customer/Invent
ory Data
Transaction/Sale
s Data
Flat Files
Twitter/Facebook
Streams
DB
File
JMS
REST
HTTP
Streaming
Step 1:Extract & Load
PassosparaoDataLake
Passo 1 - Extrair e Carregar
Passo 2 - Modelar e Aplicar os metadados
Passo 3 - Transformar, Agregar e Materializar os dados
Passo 4a - Publicar ou Enviar Dados
Passo 4b - Explorar e Visualizar
Passo 4c - Analisar, fazer Ciência de Dados
Como Estruturar e
Criar o Data Lake
PontosFundamentais
 Alinhe o Data Lake com a Estrutura Organizacional
 Crie áreas (Zones) no Data Lake (ingest zone, transformation zone, presentation zone)
 Processos de Ingestão de Dados
 Segurança
 Linhagem de Dados
 Entender as necessidades
 Integrações serão necessárias !
EstruturaLógicadaOrganização
 Alinhe a estrutura por funções e não por departamentos ou equipes, as organizações
mudam, mas as funções quase sempre são semelhantes.
 Pense em um investimento de longo prazo
 Esteja sempre atendo a regulamentações e controles internos ou mesmo externos.
 Pense no Data Lake em Camadas
OqueArmazenar?TUDO!
HDFSlayer
 Data is written into landing zone
SQOOP
HDF
Flume
…
RAW format
 Security
Contains PII information
Landing zone is using HDFS TDE for data
protection
Only ETL tools are accessing this layer
Access by data wrangler only
Data retention is limited ( < 1 month )
Landing zone
RDBMS
Landing
SQOOP
Nifi
HDFSlayer
 Data is compressed in large files
Hadoop archive (har)
Solve small file problem
 Data is automatically removed
Retention policy managed via Falcon
 Security
Archive zone is using HDFS TDE for data
protection
Limited set of users can access it
 HDFS tiering
Archival layer
Landing Archive
HDFSlayer
 Data is moving from Landing to Speed
Data is cleaned as part of ETL
Optimized file format
Orc, parquet, avro, …
 Multiple copy of same dataset depending
on use cases
RAW data store in optimized file format
Tokenised, normalisation, datamarts, ...
 Security
Sensitive data are tokenised
Business users access this layer
Presentation layer
Landing Archive
Presentation
Multi-tenantenvironment
 Third party tools move data from landing
into dev & test zone
PII information are encrypted using 3rd party
solution
One way tokenisation
Data is consistently tokenised
Enable join in between different datasets
 Benefit
Development is done against realistic dataset
(volume & format)
Give access to data scientist team
Development and test layer
Landing Dev / Test / …
Multi-tenantenvironment
 Data
Accessed from presentation layer
 Benefit
Give access to version of production data to
data scientist teams
Allow data science team to acquire ad-hoc
external datasets
Data exploration layer
Landing Dev / Test / …
Data
exploration
Multi-tenantenvironment
 Third party tools move data from landing
into dev & test zone
PII information are encrypted using 3rd party
solution
Reversible tokenisation
Data is consistently tokenised
Enable join in between different datasets
Production layer
Landing Dev / Test / …
Prod
Data
exploration
Bestpractices
 Create a catalogue of datasets in Atlas
Data owner
Source system
Project using it
 Keep multiple copy of the same data
Raw
Optimized
Tokenized
 Disaster Recovery
Dev / Test / Data Exploration run on DR cluster
Define prioritize workload
 Create dataset structures based upon
projects
Datasets will be reused across projects
 No write access to business users
Do’s Don’ts
Obrigado !
Visite nos :
www.cetax.com.br
Estamos contratando !

More Related Content

What's hot

Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY
 
Designing An Enterprise Data Fabric
Designing An Enterprise Data FabricDesigning An Enterprise Data Fabric
Designing An Enterprise Data FabricAlan McSweeney
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?confluent
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for DinnerKent Graziano
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesLars E Martinsson
 
Introduction to DCAM, the Data Management Capability Assessment Model
Introduction to DCAM, the Data Management Capability Assessment ModelIntroduction to DCAM, the Data Management Capability Assessment Model
Introduction to DCAM, the Data Management Capability Assessment ModelElement22
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
Introdução à Neo4j
Introdução à Neo4j Introdução à Neo4j
Introdução à Neo4j Neo4j
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Building a Consistent Hybrid Cloud Semantic Model In Denodo
Building a Consistent Hybrid Cloud Semantic Model In DenodoBuilding a Consistent Hybrid Cloud Semantic Model In Denodo
Building a Consistent Hybrid Cloud Semantic Model In DenodoDenodo
 
DAMA Feb2015 Mastering Master Data
DAMA Feb2015 Mastering Master DataDAMA Feb2015 Mastering Master Data
DAMA Feb2015 Mastering Master DataMary Levins, PMP
 
Creating your Center of Excellence (CoE) for data driven use cases
Creating your Center of Excellence (CoE) for data driven use casesCreating your Center of Excellence (CoE) for data driven use cases
Creating your Center of Excellence (CoE) for data driven use casesFrank Vullers
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise AnalyticsDATAVERSITY
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudDenodo
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 

What's hot (20)

Data mesh
Data meshData mesh
Data mesh
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & Approaches
 
Designing An Enterprise Data Fabric
Designing An Enterprise Data FabricDesigning An Enterprise Data Fabric
Designing An Enterprise Data Fabric
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture Deliverables
 
Introduction to DCAM, the Data Management Capability Assessment Model
Introduction to DCAM, the Data Management Capability Assessment ModelIntroduction to DCAM, the Data Management Capability Assessment Model
Introduction to DCAM, the Data Management Capability Assessment Model
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Introdução à Neo4j
Introdução à Neo4j Introdução à Neo4j
Introdução à Neo4j
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Building a Consistent Hybrid Cloud Semantic Model In Denodo
Building a Consistent Hybrid Cloud Semantic Model In DenodoBuilding a Consistent Hybrid Cloud Semantic Model In Denodo
Building a Consistent Hybrid Cloud Semantic Model In Denodo
 
DAMA Feb2015 Mastering Master Data
DAMA Feb2015 Mastering Master DataDAMA Feb2015 Mastering Master Data
DAMA Feb2015 Mastering Master Data
 
Creating your Center of Excellence (CoE) for data driven use cases
Creating your Center of Excellence (CoE) for data driven use casesCreating your Center of Excellence (CoE) for data driven use cases
Creating your Center of Excellence (CoE) for data driven use cases
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 

Similar to Data Lakes visão prática: estruturação e criação

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Ceph Days 2014 Paul Evans Slide Deck
Ceph Days 2014 Paul Evans Slide DeckCeph Days 2014 Paul Evans Slide Deck
Ceph Days 2014 Paul Evans Slide DeckDaystromTech
 
Enterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Enterprise Hadoop is Here to Stay: Plan Your Evolution StrategyEnterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Enterprise Hadoop is Here to Stay: Plan Your Evolution StrategyInside Analysis
 
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...RainStor
 
Hadoop : The Pile of Big Data
Hadoop : The Pile of Big DataHadoop : The Pile of Big Data
Hadoop : The Pile of Big DataEdureka!
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 

Similar to Data Lakes visão prática: estruturação e criação (20)

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Ceph Days 2014 Paul Evans Slide Deck
Ceph Days 2014 Paul Evans Slide DeckCeph Days 2014 Paul Evans Slide Deck
Ceph Days 2014 Paul Evans Slide Deck
 
Enterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Enterprise Hadoop is Here to Stay: Plan Your Evolution StrategyEnterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Enterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
 
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
 
Hadoop : The Pile of Big Data
Hadoop : The Pile of Big DataHadoop : The Pile of Big Data
Hadoop : The Pile of Big Data
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 

More from Marco Garcia

Webinar Carreiras de Dados
Webinar Carreiras de DadosWebinar Carreiras de Dados
Webinar Carreiras de DadosMarco Garcia
 
Cases Big Data Aplicados a logística
Cases Big Data Aplicados a logísticaCases Big Data Aplicados a logística
Cases Big Data Aplicados a logísticaMarco Garcia
 
Trabalhos Big Data e Algoritmos - Mercado Financeiro
Trabalhos Big Data e Algoritmos - Mercado FinanceiroTrabalhos Big Data e Algoritmos - Mercado Financeiro
Trabalhos Big Data e Algoritmos - Mercado FinanceiroMarco Garcia
 
Webinar carreiras dados
Webinar carreiras dadosWebinar carreiras dados
Webinar carreiras dadosMarco Garcia
 
CASES Cetax de Inteligência em Saúde - Dados e Algorítmos
CASES Cetax de Inteligência em Saúde - Dados e AlgorítmosCASES Cetax de Inteligência em Saúde - Dados e Algorítmos
CASES Cetax de Inteligência em Saúde - Dados e AlgorítmosMarco Garcia
 
Using Data To Tranform Your Business - Marketing Business
Using Data To Tranform Your Business - Marketing BusinessUsing Data To Tranform Your Business - Marketing Business
Using Data To Tranform Your Business - Marketing BusinessMarco Garcia
 
Workshop BigData, Hadoop e Data Science - Cetax x Deal
Workshop BigData, Hadoop e Data Science - Cetax x DealWorkshop BigData, Hadoop e Data Science - Cetax x Deal
Workshop BigData, Hadoop e Data Science - Cetax x DealMarco Garcia
 
Integração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia CetaxIntegração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia CetaxMarco Garcia
 
Carreiras em Business Intelligence e Big Data
Carreiras em Business Intelligence e Big DataCarreiras em Business Intelligence e Big Data
Carreiras em Business Intelligence e Big DataMarco Garcia
 
Big Data - Artigo, Conceito, o Que é
Big Data - Artigo, Conceito, o Que é Big Data - Artigo, Conceito, o Que é
Big Data - Artigo, Conceito, o Que é Marco Garcia
 
Palestra Business Intelligence
Palestra Business IntelligencePalestra Business Intelligence
Palestra Business IntelligenceMarco Garcia
 
O que é Business Intelligence (BI)
O que é Business Intelligence (BI)O que é Business Intelligence (BI)
O que é Business Intelligence (BI)Marco Garcia
 
Curso de Business Intelligence e Data Warehouse - Conceitos e Fundamentos
Curso de Business Intelligence e Data Warehouse - Conceitos e FundamentosCurso de Business Intelligence e Data Warehouse - Conceitos e Fundamentos
Curso de Business Intelligence e Data Warehouse - Conceitos e FundamentosMarco Garcia
 
Cursos de Data Warehouse
Cursos de Data WarehouseCursos de Data Warehouse
Cursos de Data WarehouseMarco Garcia
 
Business Intelligence - Palestra
Business Intelligence - PalestraBusiness Intelligence - Palestra
Business Intelligence - PalestraMarco Garcia
 
Modelagem Dimensional
Modelagem DimensionalModelagem Dimensional
Modelagem DimensionalMarco Garcia
 

More from Marco Garcia (17)

Webinar Carreiras de Dados
Webinar Carreiras de DadosWebinar Carreiras de Dados
Webinar Carreiras de Dados
 
Cases Big Data Aplicados a logística
Cases Big Data Aplicados a logísticaCases Big Data Aplicados a logística
Cases Big Data Aplicados a logística
 
Trabalhos Big Data e Algoritmos - Mercado Financeiro
Trabalhos Big Data e Algoritmos - Mercado FinanceiroTrabalhos Big Data e Algoritmos - Mercado Financeiro
Trabalhos Big Data e Algoritmos - Mercado Financeiro
 
Webinar carreiras dados
Webinar carreiras dadosWebinar carreiras dados
Webinar carreiras dados
 
CASES Cetax de Inteligência em Saúde - Dados e Algorítmos
CASES Cetax de Inteligência em Saúde - Dados e AlgorítmosCASES Cetax de Inteligência em Saúde - Dados e Algorítmos
CASES Cetax de Inteligência em Saúde - Dados e Algorítmos
 
Using Data To Tranform Your Business - Marketing Business
Using Data To Tranform Your Business - Marketing BusinessUsing Data To Tranform Your Business - Marketing Business
Using Data To Tranform Your Business - Marketing Business
 
Live - BigData
Live - BigDataLive - BigData
Live - BigData
 
Workshop BigData, Hadoop e Data Science - Cetax x Deal
Workshop BigData, Hadoop e Data Science - Cetax x DealWorkshop BigData, Hadoop e Data Science - Cetax x Deal
Workshop BigData, Hadoop e Data Science - Cetax x Deal
 
Integração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia CetaxIntegração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia Cetax
 
Carreiras em Business Intelligence e Big Data
Carreiras em Business Intelligence e Big DataCarreiras em Business Intelligence e Big Data
Carreiras em Business Intelligence e Big Data
 
Big Data - Artigo, Conceito, o Que é
Big Data - Artigo, Conceito, o Que é Big Data - Artigo, Conceito, o Que é
Big Data - Artigo, Conceito, o Que é
 
Palestra Business Intelligence
Palestra Business IntelligencePalestra Business Intelligence
Palestra Business Intelligence
 
O que é Business Intelligence (BI)
O que é Business Intelligence (BI)O que é Business Intelligence (BI)
O que é Business Intelligence (BI)
 
Curso de Business Intelligence e Data Warehouse - Conceitos e Fundamentos
Curso de Business Intelligence e Data Warehouse - Conceitos e FundamentosCurso de Business Intelligence e Data Warehouse - Conceitos e Fundamentos
Curso de Business Intelligence e Data Warehouse - Conceitos e Fundamentos
 
Cursos de Data Warehouse
Cursos de Data WarehouseCursos de Data Warehouse
Cursos de Data Warehouse
 
Business Intelligence - Palestra
Business Intelligence - PalestraBusiness Intelligence - Palestra
Business Intelligence - Palestra
 
Modelagem Dimensional
Modelagem DimensionalModelagem Dimensional
Modelagem Dimensional
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Data Lakes visão prática: estruturação e criação

  • 1. Data Lakes visão prática Marco Garcia CTO, Founder – Cetax, TutorPro mgarcia@cetax.com.br https://www.linkedin.com/in/mgarciacetax/
  • 2. Com mais de 20 anos de experiência em TI, sendo 18 exclusivamente com Business Intelligence , Data Warehouse e Big Data, Marco Garcia é certificado pelo Kimball University, nos EUA, onde obteve aula pessoalmente com Ralph Kimball – um dos principais gurus do Data Warehouse. 1º Instrutor Certificado Hortonworks LATAM Arquiteto de Dados e Instrutor na Cetax Consultoria. 02 Apresentação
  • 5. The ability to learn or understand or to deal with new or trying situations :reason; also:the skilled use of reason the ability to apply knowledge to manipulate one's environment or to think abstractly as measured by objective criteria (as tests). What is intelligence? 04 Data Lake ?
  • 7. Data Warehouse x Data Lake https://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-differences.html Garrafas de água: - Limpas - Tratadas - Empacotadas - Prontas para o Consumo Lago de Dados : - Bruto - Sem tratamento - Precisa ser trabalhada para ser consumida
  • 8. “Dados são o novo Petróleo” No ano de 2012 a Como petróleo, precisam ser refinados ! DATA IS THE NEW OIL!
  • 10. DADOS POR VALIDADE PARA BIG DATA
  • 12. ARQUITETURA COMPLETA PARA BIG DATA ? Hadoop ! Hadoop
  • 13. WhatisApacheHadoop?  Allows for the distributed processing of large data sets across clusters of computers using simple programming models  Is designed to scale up from single servers to thousands of machines, each offering local computation and storage  Does not rely on hardware to deliver high-availability, but rather the library itself is designed to detect and handle failures at the application layer  Delivers a highly-available service on top of a cluster of computers, each of which may be prone to failures The Apache Hadoop project describes the technology as a software framework that: Source: http://hadoop.apache.org
  • 14. HadoopCore=Storage+Compute storage storage storage storage CPU RAM Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS)
  • 16. DistinctMastersandScale-OutWorkers worker node NodeManager DataNode master node 2 ZooKeeper Resource Manager master node 1 ZooKeeper NameNode master node 3 ZooKeeper HiveServer2 utility node 1 Client Gateway Knox utility node 2 Client Gateway Ambari Server worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode
  • 17. Como seria o DataLake no Hadoop ?
  • 18. compute & storage . . . . . . . . compute & storage . . YARN KNOX AMBARI HCATALOG (table metadata) Step 2: Model/Apply Metadata (data processing) HIVE PIG Step 3: Transform, Aggregate & Materialize LOAD SQOOP/Hive Web HDFS Data Sources RDBMS, No/New SQL Store (Oracle, Hana) EDW (SAP BW) Step 4a: Publish/Exchange Step 4c: Analyze Analytical Tools SAS, Python, R, Matlib ANALYTICAL NN AppMaster Streaming INTERACTIVE HIVE Server Query/Visualization/Re porting Tools SAP BO Tableau/Excel Any JDBC Compliant ToolStep 4b: Explore/Visualize FALCON (data lifecycle) Manage Steps 1-3: Data Lifecycle with Falcon LOAD SQOOP FLUME NIFI KAFKA SOURCE DATA App/System Logs Customer/Invent ory Data Transaction/Sale s Data Flat Files Twitter/Facebook Streams DB File JMS REST HTTP Streaming Step 1:Extract & Load
  • 19. PassosparaoDataLake Passo 1 - Extrair e Carregar Passo 2 - Modelar e Aplicar os metadados Passo 3 - Transformar, Agregar e Materializar os dados Passo 4a - Publicar ou Enviar Dados Passo 4b - Explorar e Visualizar Passo 4c - Analisar, fazer Ciência de Dados
  • 20. Como Estruturar e Criar o Data Lake
  • 21. PontosFundamentais  Alinhe o Data Lake com a Estrutura Organizacional  Crie áreas (Zones) no Data Lake (ingest zone, transformation zone, presentation zone)  Processos de Ingestão de Dados  Segurança  Linhagem de Dados  Entender as necessidades  Integrações serão necessárias !
  • 22. EstruturaLógicadaOrganização  Alinhe a estrutura por funções e não por departamentos ou equipes, as organizações mudam, mas as funções quase sempre são semelhantes.  Pense em um investimento de longo prazo  Esteja sempre atendo a regulamentações e controles internos ou mesmo externos.  Pense no Data Lake em Camadas
  • 24. HDFSlayer  Data is written into landing zone SQOOP HDF Flume … RAW format  Security Contains PII information Landing zone is using HDFS TDE for data protection Only ETL tools are accessing this layer Access by data wrangler only Data retention is limited ( < 1 month ) Landing zone RDBMS Landing SQOOP Nifi
  • 25. HDFSlayer  Data is compressed in large files Hadoop archive (har) Solve small file problem  Data is automatically removed Retention policy managed via Falcon  Security Archive zone is using HDFS TDE for data protection Limited set of users can access it  HDFS tiering Archival layer Landing Archive
  • 26. HDFSlayer  Data is moving from Landing to Speed Data is cleaned as part of ETL Optimized file format Orc, parquet, avro, …  Multiple copy of same dataset depending on use cases RAW data store in optimized file format Tokenised, normalisation, datamarts, ...  Security Sensitive data are tokenised Business users access this layer Presentation layer Landing Archive Presentation
  • 27. Multi-tenantenvironment  Third party tools move data from landing into dev & test zone PII information are encrypted using 3rd party solution One way tokenisation Data is consistently tokenised Enable join in between different datasets  Benefit Development is done against realistic dataset (volume & format) Give access to data scientist team Development and test layer Landing Dev / Test / …
  • 28. Multi-tenantenvironment  Data Accessed from presentation layer  Benefit Give access to version of production data to data scientist teams Allow data science team to acquire ad-hoc external datasets Data exploration layer Landing Dev / Test / … Data exploration
  • 29. Multi-tenantenvironment  Third party tools move data from landing into dev & test zone PII information are encrypted using 3rd party solution Reversible tokenisation Data is consistently tokenised Enable join in between different datasets Production layer Landing Dev / Test / … Prod Data exploration
  • 30. Bestpractices  Create a catalogue of datasets in Atlas Data owner Source system Project using it  Keep multiple copy of the same data Raw Optimized Tokenized  Disaster Recovery Dev / Test / Data Exploration run on DR cluster Define prioritize workload  Create dataset structures based upon projects Datasets will be reused across projects  No write access to business users Do’s Don’ts
  • 31. Obrigado ! Visite nos : www.cetax.com.br Estamos contratando !

Editor's Notes

  1. Os dados podem ser o novo petróleo, a nova corrida que as empresas vão enfrentar para multiplicar seus lucros! A correta coleta, processamento e análise dos dados podem ser um diferencial competitivo a todos os negócios. Claro, como petróleo, os dados também precisam ser refinados para um melhor resultado.
  2. Essa lista é um exemplo de possíveis fontes, mas deveremos ter muito mais fontes. As novas ferramentas permitem conexão e captura de dados em diversas categorias de softwares ou mesmo equipamentos eletrônicos que permita captura de dados. Claro que além dos dados tradicionais que hoje buscamos em outros sistemas, bancos de dados e arquivos de texto.
  3. Referencia - http://voltdb.com/blog/big-data/big-data-value-continuum/
  4. Muitos softwares ? Por favor, se acalme, vamos falar disso um pouco mais para frente.
  5. Muitos softwares ? Por favor, se acalme, vamos falar disso um pouco mais para frente.
  6. This “wordy” slide is straight from the project’s self-description and warrants a splash before we go much further… So what is Apache Hadoop? It is a scalable, fault tolerant, open source framework for the distributed storing and processing of large sets of data on commodity hardware. But what does all that mean? Well first of all it is scalable. Hadoop clusters can range from as few as one machine to literally thousands of machines. That is scalability! It is also fault tolerant. Hadoop services become fault tolerant through redundancy. For example, the Hadoop Distributed File System, called HDFS, automatically replicates data blocks to three separate machines, assuming that your cluster has at least three machines in it. Many other Hadoop services are replicated, too, in order to avoid any single points of failure. Hadoop is also open source. Hadoop development is a community effort governed under the licensing of the Apache Software Foundation. Anyone can help to improve Hadoop by adding features, fixing software bugs, or improving performance and scalability. Hadoop also uses distributed storage and processing. Large datasets are automatically split into smaller chunks, called blocks, and distributed across the cluster machines. Not only that, but each machine processes its local block of data. This means that processing is distributed too, potentially across hundreds of CPUs and hundreds of gigabytes of memory. All of this occurs on commodity hardware which reduces not only the original purchase price, but also potentially reduces support costs as well.
  7. At the most granular level, Hadoop is an engine who provides storage via HDFS and compute via YARN capabilities. The “ecosystem” tools wrap around core.
  8. Hadoop is not a monolithic piece of software. It is a collection of architectural pillars that contain software frameworks. Most of the frameworks are part of the Apache software ecosystem. The picture illustrates the Apache frameworks that are part of the Hortonworks Hadoop distribution. So why does Hadoop have so many frameworks and tools? The reason is that each tool is designed for a specific purpose. The functionality of some tools overlap but typically one tool is going to be better than others when performing certain tasks. For example, both Apache Storm and Apache Flume ingest data and perform real-time analysis. But Storm has more functionality and is more powerful for real-time data analysis.
  9. Here is an example cluster with three master nodes, 12 worker nodes, and two utility nodes. The cluster is running various services, like YARN and HDFS. Services can be implemented by one or more service components. The three master nodes are running service master components. The 12 worker nodes are running service worker components, sometimes called slave components. The two utility nodes are running service components that provide access, security, and management services for the cluster. This page does not illustrate all services, service master, or service worker components. More detail is provided in other lessons.
  10. Break Glass?
  11. If need to be reprocess – Copy form Archive into Landing Har tracking by atlas
  12. ISO27001 – Data & Processing should be separated – Doesn’t mean separated env Separated dev & test are used for upgrade / patch testing - can be smaller / virtualised / ..
  13. ISO24001 – Data & Processing should be separated – Doesn’t mean separated env