SlideShare a Scribd company logo
1 of 92
Download to read offline
© Copyright 2016. Apps Associates LLC. 1
Big Data Overview & Hadoop for DBA’s
Satyendra Pasalapudi
Associate Practice Director
Apps Associates LLC
© Copyright 2016. Apps Associates LLC. 2
About Me
Satyendra Kumar Pasalapudi
Associate Practice Director – Infrastructure/Cloud Practice at Apps Associates
Co-Founder & President of All India Oracle Users Group(AIOUG)
@pasalapudi
© Copyright 2016. Apps Associates LLC. 3
www.ora-search.com
© Copyright 2016. Apps Associates LLC. 4
History of Data Management Systems
Magnetic tape
“flat” (sequential) files
Pre-computer
technologies:
Printing press
Dewey decimal
system
Punched cards
Magnetic Disk
IMS
Relational
Model
defined
Indexed-Sequential
Access Mechanism
(ISAM)
Network Model
IDMS
ADABAS
System R
Oracle V2
Ingres
dBase
DB2
Informix
Sybase
SQL Server
Access
Postgres
MySQL
Cassandra
Hadoop
Vertica
Riak
HBase
Dynamo
MongoDB
Redis
VoltDB
Hana
Neo4J
Aerospike
Hierarchical model
1960-701940-50 1950-60 1970-80 1980-90 1990-2000 2000-2010
© Copyright 2016. Apps Associates LLC. 5
@dvantages of Cloud
© Copyright 2016. Apps Associates LLC. 6
Generational Change for Enterprise (IT)
 Cloud supports mission critical workloads
─ 87% of Enterprises use Cloud for Mission
Critical Applications
 Cloud use in the enterprise continues to
grow
─ Half of the Enterprises say they will use
cloud for at least 75% of their workloads
by 2018
 No one cloud fits all
─ More than half (53 %) of enterprises use
two(2) to four(4) cloud providers
Source: Verizon 2016 State of the Market: Enterprise Cloud report
© Copyright 2016. Apps Associates LLC. 7
Cloud – Probable to Inevitable
 GE undergoing most important
transformation in 140 year history
─ 9000 Applications to AWS & to 4000 Applications
─ 300 ERPs (two years back) to more manageable
─ 34 Data Centers to 4 Data Centers
 By 2020 - US$15b of Software Revenue
 Changes
─ People - Reduce Outsourcing
─ Technology - Build Approach for things that matter
─ 20% of Applications in Cloud as of today
─ 70% of Applications by 2020 in Cloud
Source: AWS 2015 Keynote – Oct 6 2015
OOW Keynote with Mark Hurd Oct 26 2015
─ Service Management
─ Network Perimeter
─ Risk Based Security Controls
─ Self Service and Automation
─ Financial Transparency
© Copyright 2016. Apps Associates LLC. 8
What is Cloud
The Role of Data
is Changing
© Copyright 2016. Apps Associates LLC. 10
Until now, Questions you ask drove
Data model
New model is collect as much data as possible
– “Data-First Philosophy”
© Copyright 2016. Apps Associates LLC. 11
Data is the new raw material for
any business on par with
capital, people, labor
Data is the new raw material for any business on par
with capital, people, labor
© Copyright 2016. Apps Associates LLC. 12
Characteristics of Big Data
© Copyright 2016. Apps Associates LLC. 13
Cost effectively manage
and analyze
all available data in its
native form
unstructured,
structured, streaming
ERP
CRM
RFID
Website
Network Switches
Social Media
Billing
Big data Challenge
© Copyright 2016. Apps Associates LLC. 14
Hybrid Cloud Framework
HR FIN
SCOM SALES
PROCUREMENT
PLANNING
DW / BI
© Copyright 2016. Apps Associates LLC. 15
Big data Eco System
© Copyright 2016. Apps Associates LLC. 16
Not Easy to Get Analytic Value at Fast Enough Pace
Tool Complexity
• Early Hadoop tools only for experts
• Existing BI tools not designed for Hadoop
• Emerging solutions lack broad capabilities
80% effort
typically spent on
evaluating and
preparing data
Data Uncertainty
• Not familiar and overwhelming
• Potential value not obvious
• Requires significant manipulation
Overly dependent
on scarce and
highly skilled
resources
Source : Oracle
© Copyright 2016. Apps Associates LLC. 17
Informatica Study May 2013
Addressed by Oracle Big Data Discovery
Key Challenges in Managing Big Data
© Copyright 2016. Apps Associates LLC. 18
Sample of Big Data Use Cases Today
MEDIA/
ENTERTAINMENT
Viewers / advertising
effectiveness
Cross Sell
COMMUNICATIONS
Location-based
advertising
EDUCATION &
RESEARCH
Experiment
sensor analysis
Retail / CPG
Sentiment analysis
Hot products
OptimizedMarketing
HEALTH CARE
Patient sensors,
monitoring, EHRs
Quality of care
LIFE SCIENCES
Clinical trials
Genomics
HIGH TECHNOLOGY /
INDUSTRIAL MFG.
Mfg quality
Warranty analysis
OIL & GAS
Drilling
exploration
sensor analysis
FINANCIAL
SERVICES
Risk & portfolio analysis
New products
AUTOMOTIVE
Auto sensors
reporting
location,
problems
Games
Adjust to
player
behavior
In-GameAds
LAW ENFORCEMENT
& DEFENSE
Threat analysis -
social media
monitoring, photo
analysis
TRAVEL &
TRANSPORTATION
Sensor analysis for
optimal traffic flows
Customer sentiment
UTILITIES
Smart Meter
analysis for
network
capacity,
ON-LINE
SERVICES /
SOCIAL MEDIA
People & career
matching
Web-site
optimization
What is the main difference in this data?
Volume, Velocity, Variety
These Characteristics Challenge Your Existing
Architecture
© Copyright 2016. Apps Associates LLC. 19
Big Data Verticals
Media/A
dvertising
Targeted
Advertisin
g
Image
and Video
Processin
g
Oil & Gas
Seismic
Analysis
Retail
Recomme
nd
Transactio
ns
Analysis
Life
Sciences
Genome
Analysis
Financial
Services
Monte
Carlo
Simulatio
ns
Risk
Analysis
Security
Anti-virus
Fraud
Detection
Image
Recogniti
on
Social
Network/
Gaming
User
Demograp
hics
Usage
analysis
In-game
metrics
© Copyright 2016. Apps Associates LLC. 20
Sample Enterprise Big Data Architecture
Operational
RDBMS (Oracle,
SQL Server, …)
In-memory
Analytics
(HANA,
Exalytics …)
In-memory
processing
(Spark)
Hadoop
Web DBMS
(MySQL,
Mongo,
Cassandra)
ERP & in-
house CRM
Analytic/BI
software (SAS,
Tableau
Web Server
Data
Warehouse
RDBMS
(Oracle,
Teradata …)
© Copyright 2016. Apps Associates LLC. 21
Enterprise Data Hub / Data Lake / Data Reservoir
We Need Tools Built Specifically
for Big Data
© Copyright 2016. Apps Associates LLC. 23
Hadoop and it’s Eco System
• Scale out Easily
• Parallel Computing
• Commodity Hardware
• Solves some Problems
• Complex to Run
• Special Skills to Maintain
Cassandra
© Copyright 2016. Apps Associates LLC. 24
ETL for Unstructured Data
© Copyright 2016. Apps Associates LLC. 25
ETL for Structured Data
© Copyright 2016. Apps Associates LLC. 26
Hadoop Design Principles
• System shall manage and heal itself
– Automatically and transparently route around failure
– Speculatively execute redundant tasks if certain nodes are detected to be
slow
• Performance shall scale linearly
– Proportional change in capacity with resource change
• Compute should move to data
– Lower latency, lower bandwidth
• Simple core, modular and extensible
© Copyright 2016. Apps Associates LLC. 27
Hadoop History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Starts as a Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• May 2009 – Hadoop sorts Petabyte in 17 hours
Google File System (GFS)
Map Reduce BigTable
Google Applications
Google Software
Architecture
(circa 2005)
Start ReduceMap
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map Reduce
© Copyright 2016. Apps Associates LLC. 30
Hadoop Ecosystem
HDFS (Hadoop Distributed File System)
HBase (key-value store)
MapReduce (Job Scheduling/Execution System)
Data Access
Sqoop
Flume
Client Access
Hue
Hive(Sql)
Pig(Pl/Sql)
ZooKeeper
(Coordination)
(Streaming/Pipes APIs)
Chukwa(Monitoring)
Data Mining
Mahout
OS – Redhat, Suse, Ubuntu,Windows
Commodity Hardware
Java Virtual Machine
Networking
Orchestration
Oozie
© Copyright 2016. Apps Associates LLC. 31
Hadoop – Simplified View
• MPP (Massively Parallel) hardware running database-like software
• “Data” is stored in parts, across multiple worker nodes
• “Work” operates in parallel, on the different parts of the table
Controller Worker Nodes
© Copyright 2016. Apps Associates LLC. 32
HDFS Architecture
HDFS Architecture
Namenode
Breplication
Rack1 Rack2
Client
Blocks
Datanodes Datanodes
Client
Write
Read
Metadata ops Metadata(Name, replicas..)
(/home/foo/data,6. ..
Block ops
© Copyright 2016. Apps Associates LLC. 34
Head Node Data 1 Data 2 Data 3 Data 4
MYFILE.TXT
..block1 -> block1
..block2 -> block2
..block3 -> block3
HDFS – Highly Available
© Copyright 2016. Apps Associates LLC. 35
Namenode and Datanodes
 Master/slave architecture
 HDFS cluster consists of a single Namenode, a master server that manages the file
system namespace and regulates access to files by clients.
 There are a number of DataNodes usually one per node in a cluster.
 The DataNodes manage storage attached to the nodes that they run on.
 HDFS exposes a file system namespace and allows user data to be stored in files.
 A file is split into one or more blocks and set of blocks are stored in DataNodes.
 DataNodes: serves read, write requests, performs block creation, deletion, and
replication upon instruction from Namenode.
Hadoop 1 – Job & Task Trackers
Master Node - The majority of hadoop deployments consist of sevaral master node
instances. Having more than one master node helps eliminate the risk of single
point of failure.
NameNode - These processes are charged with storing a directory tree of all files
in the Hadoop Distributed File SYstem (HDFS). They also keep track of where the
file data is kept within in the cluster. Client Applications contact Name Nodes when
they need to locate a file, or add, or copy or delete a file.
DataNodes - The datanode stores data in the HDFS and is responsible for
replicating data across clusters. Data Nodes interact with client applications when
the NameNopde has supplied the Datanode's address.
WorkerNode: Unlike a master node, whose numbers we can count on one hand, a
representative Hadoop Deployment consists of dozens or hundreds of worker
nodes, which provides enough processing power to analyze a
few hundreds terabytes all the way upto one petabyte. Each worker node includes
a DataNode as well as Task Tracker.
Map Reduce
Job Tracker /MapReduce Workload Management Layer - This
process is assigned to interact with client applications. It is
responsible for distributing MapReduce tasks to particular nodes
within in a cluster. This engine coordinates all aspects of hadoop
such as scheduling and launching jobs.
Task Tracker - This is a process in the cluster that is capable of
receiving tasks( inlcuding Map, Reduce, and Shuffle) from a Job
Tracker
© Copyright 2016. Apps Associates LLC. 38
Data Replication Similar to that of ASM
 HDFS is designed to store very large files across machines in a large cluster.
 Each file is a sequence of blocks.
 All blocks in the file except the last are of the same size.
 Blocks are replicated for fault tolerance.
 Block size and replicas are configurable per file.
 The Namenode receives a Heartbeat and a BlockReport from each DataNode
in the cluster.
 BlockReport contains all the blocks on a Datanode.
© Copyright 2016. Apps Associates LLC. 39
Replica Placement & Rack Aware
 The placement of the replicas is critical to HDFS reliability and performance.
 Optimizing replica placement distinguishes HDFS from other distributed file systems.
 Rack-aware replica placement:
 Goal: improve reliability, availability and network bandwidth utilization
 Many racks, communication between racks are through switches.
 Network bandwidth between machines on the same rack is greater than those in different racks.
 Namenode determines the rack id for each DataNode.
 Replicas are typically placed on unique racks
 Simple but non-optimal
 Writes are expensive
 Replication factor is 3
 Replicas are placed: one on a node in a local rack, one on a different node in the local rack and
one on a node in a different rack.
© Copyright 2016. Apps Associates LLC. 40
Replica Selection
• Replica selection for READ operation: HDFS tries to minimize the bandwidth
consumption and latency.
• If there is a replica on the Reader node then that is preferred.
• HDFS cluster may span multiple data centers: replica in the local data center
is preferred over the remote one.
© Copyright 2016. Apps Associates LLC. 41
Hadoop Components
• Hadoop is bundled with two independent components
– HDFS (Hadoop Distributed File System)
• Designed for scaling in terms of storage and IO bandwidth
– MR framework (MapReduce)
• Designed for scaling in terms of performance
© Copyright 2016. Apps Associates LLC. 42
Understanding file structure
1 GB file
File is
split into
blocks
Each block is
typically
64MB
Each block is stored as
two files – one holding
data and second for
metadata, checksum
Bloc
k
© Copyright 2016. Apps Associates LLC. 43
Hadoop Processes
• Processes running on Hadoop
– NameNode
– DataNode
– Secondary NameNode
– Task Tracker
– Job Tracker
© Copyright 2016. Apps Associates LLC. 44
NameNode
• Single point of contact
• HDFS master
• Holds meta information
– List of files and directories
– Location of blocks
• Single node per cluster
– Cluster can have thousands of DataNodes and tens
of thousands of HDFS client.
NameNode
© Copyright 2016. Apps Associates LLC. 45
DataNode
• Can execute multiple tasks concurrently
• Holds actual data blocks, checksum and generation stamp
• If block is half full, needs only half of the space of full block
• At start-up, connects to NameNode and perform handshake
• No binding to IP address or port, uses Storage ID
• Sends heartbeat to NameNode
DataNode
Storage ID:
XYZ001
© Copyright 2016. Apps Associates LLC. 46
Communication
• Total Storage Capacity
• Fraction of storage in use
• No of data transfer currently
in progress
• Instructs DataNode
• Replicate block to other node
• Remove local block replica
• Send immediate block report
• Shut down the node
Every 3
seconds.
“I AM ALIVE”
NameNod
e
DataNode
Storage ID:
XYZ001
DataNode
Storage ID:
XYZ002
DataNode
Storage ID:
XYZ003
Reply
No heartbeat
for 10 minutes
Heartbeat
© Copyright 2016. Apps Associates LLC. 47
Coordination in a distributed system
• Coordination: An act that multiple nodes must perform together.
• Examples:
– Group membership
– Locking
– Publisher/Subscriber
– Leader Election
– Synchronization
• Getting node coordination correct is very hard!
ZooKeeper allows distributed processes to
coordinate with each other through a shared
hierarchical name space of data registers.
Introducing ZooKeeper
- ZooKeeper Wiki
ZooKeeper is much more than a
distributed lock server!
What is ZooKeeper?
• An open source, high-performance coordination service for
distributed applications.
• Exposes common services in simple interface:
– naming
– configuration management
– locks & synchronization
– group services
… developers don't have to write them from scratch
• Build your own on it for specific needs.
© Copyright 2016. Apps Associates LLC. 52
HDFS Distributions
© Copyright 2016. Apps Associates LLC. 53
Real Time BI
• Speed, agility, and intelligence are competitive advantages that nearly all
organizations seek.
• Existing Traditional Reporting Systems provide information after 24 – 36 hours.
• To support Operational Users and influence what should happen next, the data
should be available in real time to know what is happening now.
© Copyright 2016. Apps Associates LLC. 54
Hadoop 2.0
20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
MapReduce
Largely Batch Processing
Hadoop w/
MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
(Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clusters
Largely batch system
Difficult to integrate
MR-279: YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Enabled the
Modern Data
Architecture
October 23, 2013
© Copyright 2015. Apps Associates LLC. 56
Hadoop 2.0
Multi Use Data Platform
Batch, Interactive, Realtime, Online, Streaming, …
HADOOP 2
Redundant, Reliable Storage
(HDFS)
Efficient Cluster Resource
Management & Shared Services
(YARN)
Standard Query
Processing
Hive
Batch
MapReduce
Online Data
Processing
Interactive
Tez
Real Time Stream
Processing Others
© Copyright 2016. Apps Associates LLC. 57
Hadoop 2.0 with YARN
© Copyright 2016. Apps Associates LLC. 58
Resource Manager/Node Manager Components
© Copyright 2016. Apps Associates LLC. 59
Problems with this approach in Hadoop 1.0
 It limits scalability: JobTracker runs on single
machine doing several task like
1) Resource management
2) Job and task scheduling and
3) Monitoring
Although there are so many machines (DataNode)
available; they are not getting used. This limits scalability.
 Availability Issue: In Hadoop 1.0, JobTracker is single
Point of availability. This means if JobTracker fails, all
jobs must restart.
 Distinct map slots and reduce slots
 Limitation in running non-MapReduce Application
© Copyright 2016. Apps Associates LLC. 60
Yarn Architecture
 Rescource Manager:
Arbitrates division of resources among all the
applications in the system. The Resource Manager has a pluggable
scheduler component, which is responsible for allocating resources
to the various running applications
 Node Manager:
per-machine slave, runs on slave nodes, which is
responsible for launching the applications’ containers, monitoring
their resource usage (CPU, memory, disk, network),and reporting
the same to the Resource Manager.
 Application Master:
Negotiate appropriate resource containers from the
Scheduler, tracking their status and monitoring for progress
 Container:
Unit of allocation incorporating resource elements
such as memory, cpu, disk, network etc, to execute a specific task of
the application (similar to map/reduce slots in MRv1)
© Copyright 2016. Apps Associates LLC. 61
Yarn - Execution Sequence
1) A client program submits the application
2) ResourceManager allocates a specified container to start the
ApplicationMaster
3) ApplicationMaster, on boot-up, registers with
ResourceManager
4) ApplicationMaster negotiates with ResourceManager for
appropriate resource containers
5) On successful container allocations, ApplicationMaster
contacts NodeManager to launch the container
6) Application code is executed within the container, and then
ApplicationMaster is responded with the execution status
7) During execution, the client communicates directly with
ApplicationMaster or ResourceManager to get status, progress
updates etc.
8) Once the application is complete, ApplicationMaster
unregisters with ResourceManager and shuts down, allowing
its own container process
© Copyright 2016. Apps Associates LLC. 62
Operational vs. Analytical Databases
© Copyright 2016. Apps Associates LLC. 63
A New Technology
No Means Yes!
© Copyright 2016. Apps Associates LLC. 65
Use Cases
© Copyright 2016. Apps Associates LLC. 66
Brewer's CAP Theorem
© Copyright 2016. Apps Associates LLC. 67
Brewer's CAP Theorem
© Copyright 2016. Apps Associates LLC. 68
NoSQL Technology Spectrum
Name Site Counter
Dick Ebay 507,018
Dick Google 690,414
Jane Google 716,426
Dick Facebook 723,649
Jane Facebook 643,261
Jane ILoveLarry.com 856,767
Dick MadBillFans.com 675,230
NameId Name
1 Dick
2 Jane
SiteId SiteName
1 Ebay
2 Google
3 Facebook
4 ILoveLarry.com
5 MadBillFans.com
NameId SiteId Counter
1 1 507,018
1 3 690,414
2 3 716,426
1 3 723,649
2 3 643,261
2 4 856,767
1 5 675,230
Id Name Ebay Google Facebook (other columns) MadBillFans.com
1 Dick 507,018 690,414 723,649 . . . . . . . . . . . . . . 675,230
Id Name Google Facebook (other columns) ILoveLarry.com
2 Jane 716,426 643,261 . . . . . . . . . . . . . . 856,767
BigTable Data Model
Document databases
• Structured documents – XML and JSON
(JavaScript Object Notation) become more
prevalent within applications
• Web programmers start storing these in BLOBS in
MySQL
• Emergence of XML and JSON databases
Graph
Database
Neo4J
Infinite Graph
FlockDB
Document
JSON based
MongoDB
CouchDB
RethinkDB
XML based
MarkLogic
BerkeleyDB
XML
Key Value
MemchacheD
B
Oracle NoSQL
Dynamo
Voldemort
DynamoDB
Riak
Table Based BigTable
Cassandra
Hbase
HyperTable
Accumulo
© Copyright 2016. Apps Associates LLC. 72
Run the Business
 Scale-out and scale-up
 Collect any data
 SQL
 Transactional and analytic
applications for the enterprise
 Secure and highly available
RelationalHadoop
Change the Business
 Scale-out, low cost store
 Collect any data
 Map-reduce, SQL
 Analytic applications
NoSQL
Scale the Business
 Scale-out, low cost store
 Collect key-value data
 Find data by key
 Web applications
Multiple Data Stores
© Copyright 2016. Apps Associates LLC. 73
Data Analytics Challenge
Separate silos of information to analyze
© Copyright 2016. Apps Associates LLC. 74
Data Analytics Challenge
Separate data access interfaces
© Copyright 2016. Apps Associates LLC. 75
SQL on Hadoop is Obvious
Stinger
© Copyright 2016. Apps Associates LLC. 76
Data Analytics Challenge
No comprehensive SQL interface across Oracle, Hadoop and NoSQL
© Copyright 2016. Apps Associates LLC. 77
Oracle Big Data Management System
Rich, comprehensive SQL access to all enterprise data
NoSQL
© Copyright 2016. Apps Associates LLC. 78
What Does Unified Query Mean for You?
After
Data Science
???
Anyone
Before
PhD
© Copyright 2016. Apps Associates LLC. 79
What Does Unified Query Mean for You?
After
Application Development
Before
© Copyright 2016. Apps Associates LLC. 80
Storage Layer
A New Hadoop Processing Engine
Filesystem (HDFS)
NoSQL Databases
(Oracle NoSQL DB, Hbase)
Resource Management (YARN)
Processing Layer
MapReduce
and Hive
Spark Impala Search
Big Data
SQL
© Copyright 2016. Apps Associates LLC. 81
Big Data SQL
SELECT w.sess_id, c.name
FROM web_logs w, customers c
WHERE w.source_country = ‘Brazil’
AND w.cust_id = c.customer_id;
Relevant SQL runs on BDA nodes
10’s of Gigabytes of Data
Only columns and rows needed to
answer query are returned
Hadoop Cluster
B B B
Big Data SQL
Oracle Database
CUSTOMERSWEB_LOGS
© Copyright 2016. Apps Associates LLC. 82
Big Data SQL
SELECT w.sess_id, c.name
FROM web_logs w, customers c
WHERE w.source_country = ‘Brazil’
AND w.cust_id = c.customer_id;
Relevant SQL runs on BDA nodes
10’s of Gigabytes of Data
Only columns and rows needed to
answer query are returned
Hadoop Cluster
B B B
Big Data SQL
Oracle Database
CUSTOMERSWEB_LOGS
SQL Push Down in Big Data SQL
• Hadoop Scans on Unstructured Data
• WHERE Clause Evaluation
• Column Projection
• Bloom Filters for Better Join Performance
• JSON Parsing, Data Mining Model Evaluation
© Copyright 2016. Apps Associates LLC. 83
Query All Data without Application Change or Data Conversion
Oracle Big Data SQL
INGEST PROCESS
VISUALIZE
ANALYZE
STORE
High Level Architecture
© Copyright 2016. Apps Associates LLC. 85
Fast Pace Innovation
Dec 18th 2015
http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
© Copyright 2016. Apps Associates LLC. 86
BDD Value Proposition
Note: company logos and images are for illustration purposes only. Not a real use case for the company.
© Copyright 2016. Apps Associates LLC. 87
Oracle BDD - Technical Innovation on Hadoop
Oracle Big Data Discovery Workloads
Hadoop Cluster
(BDA or Commodity
Hardware)
BDD node
data node
data node
data node
data node
name node
Data Processing, Workflow & Monitoring
• Profiling: catalog entry creation, data type &
language detection,schema configuration
• Sampling: dgraph (index) file creation
• Transforms: >100 functions
• Enrichments: location (geo), text (cleanup,
sentiment,entity, key-phrase, whitelisttagging)
Self-Service Provisioning & Data Transfer
• Personal Data: Upload CSV and XLS to HDFS
In-Memory Discovery Indexes
• DGraph: Search, Guided Navigation,Analytics
Studio
• Web UI: Find, Explore, Transform, Discover, Share
Hadoop 2.x
Filesystem
(HDFS)
Workload Mgmt
(YARN)
Metadata
(HCatalog)
Other Hadoop
Workloads
MapReduce
Spark
Hive
Pig
Oracle Big Data SQL
(BDA only)
© Copyright 2016. Apps Associates LLC. 88
Sample Enterprise Big Data Architecture
Operational
RDBMS (Oracle,
SQL Server, …)
In-memory
Analytics
(HANA,
Exalytics …)
In-memory
processing
(Spark)
Hadoop
Web DBMS
(MySQL,
Mongo,
Cassandra)
ERP & in-
house CRM
Analytic/BI
software (SAS,
Tableau
Web Server
Data
Warehouse
RDBMS
(Oracle,
Teradata …)
© Copyright 2016. Apps Associates LLC. 89
Cloud
Consultant
Core Skills
50%
Automation
10%
Cloud
Knowledge
20%
Tools &
Integration
20 %
= + + +
How to transition into a Cloud Consultant
© Copyright 2016. Apps Associates LLC. 90
Thank You!Satyendra.pasalapudi@appsassociates.com
@pasalapudi
https://community.oracle.com/groups/aioug-social-group
© Copyright 2016. Apps Associates LLC. 92
www.ora-search.com

More Related Content

What's hot

Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
AWR and ASH Advanced Usage with DB12c
AWR and ASH Advanced Usage with DB12cAWR and ASH Advanced Usage with DB12c
AWR and ASH Advanced Usage with DB12cKellyn Pot'Vin-Gorman
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Expert summit SQL Server 2016
Expert summit   SQL Server 2016Expert summit   SQL Server 2016
Expert summit SQL Server 2016Łukasz Grala
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance UpdateCloudera, Inc.
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
Getting Started with Azure SQL Database (Presented at Pittsburgh TechFest 2018)
Getting Started with Azure SQL Database (Presented at Pittsburgh TechFest 2018)Getting Started with Azure SQL Database (Presented at Pittsburgh TechFest 2018)
Getting Started with Azure SQL Database (Presented at Pittsburgh TechFest 2018)Chad Green
 
Concur Discovers the True Value of Data
Concur Discovers the True Value of DataConcur Discovers the True Value of Data
Concur Discovers the True Value of DataCloudera, Inc.
 
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018 Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018 Antonios Chatzipavlis
 
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...Spark Summit
 
Azure SQL Database Introduction by Tim Radney
Azure SQL Database Introduction by Tim RadneyAzure SQL Database Introduction by Tim Radney
Azure SQL Database Introduction by Tim RadneyHasan Savran
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 

What's hot (20)

Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
AWR and ASH Advanced Usage with DB12c
AWR and ASH Advanced Usage with DB12cAWR and ASH Advanced Usage with DB12c
AWR and ASH Advanced Usage with DB12c
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Expert summit SQL Server 2016
Expert summit   SQL Server 2016Expert summit   SQL Server 2016
Expert summit SQL Server 2016
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance Update
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Making the Cloud a Known Entity
Making the Cloud a Known EntityMaking the Cloud a Known Entity
Making the Cloud a Known Entity
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Getting Started with Azure SQL Database (Presented at Pittsburgh TechFest 2018)
Getting Started with Azure SQL Database (Presented at Pittsburgh TechFest 2018)Getting Started with Azure SQL Database (Presented at Pittsburgh TechFest 2018)
Getting Started with Azure SQL Database (Presented at Pittsburgh TechFest 2018)
 
Concur Discovers the True Value of Data
Concur Discovers the True Value of DataConcur Discovers the True Value of Data
Concur Discovers the True Value of Data
 
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018 Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
 
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
 
Azure SQL Database Introduction by Tim Radney
Azure SQL Database Introduction by Tim RadneyAzure SQL Database Introduction by Tim Radney
Azure SQL Database Introduction by Tim Radney
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 

Viewers also liked

Playing with the moon
Playing with the moonPlaying with the moon
Playing with the moonLion Roars
 
La educación virtual en el periodismo - Tarea foro 6
La educación virtual en el periodismo - Tarea foro 6La educación virtual en el periodismo - Tarea foro 6
La educación virtual en el periodismo - Tarea foro 6Electrosur
 
Sociales esquemas
Sociales esquemasSociales esquemas
Sociales esquemasalbertito57
 
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...Cloudera, Inc.
 
Launch Festival 2016 - Push Notifications -- You're Doing it Wrong
Launch Festival 2016 - Push Notifications -- You're Doing it WrongLaunch Festival 2016 - Push Notifications -- You're Doing it Wrong
Launch Festival 2016 - Push Notifications -- You're Doing it WrongRichard Sgro
 
Kongregate - Maximizing Player Retention and Monetization in Free-to-Play Gam...
Kongregate - Maximizing Player Retention and Monetization in Free-to-Play Gam...Kongregate - Maximizing Player Retention and Monetization in Free-to-Play Gam...
Kongregate - Maximizing Player Retention and Monetization in Free-to-Play Gam...David Piao Chiu
 
The age of orchestration: from Docker basics to cluster management
The age of orchestration: from Docker basics to cluster managementThe age of orchestration: from Docker basics to cluster management
The age of orchestration: from Docker basics to cluster managementNicola Paolucci
 
Career page photo slide show
Career page photo slide showCareer page photo slide show
Career page photo slide showAlissa Turpin
 
Progression Search
Progression SearchProgression Search
Progression SearchJacelyn Tan
 
Características de los videojuegos
Características de los videojuegosCaracterísticas de los videojuegos
Características de los videojuegostraderotate1
 
Introction to docker swarm
Introction to docker swarmIntroction to docker swarm
Introction to docker swarmHsi-Kai Wang
 
Esurance Careers Slideshow
Esurance Careers SlideshowEsurance Careers Slideshow
Esurance Careers SlideshowEsurance
 
Randomforestで高次元の変数重要度を見る #japanr LT
 Randomforestで高次元の変数重要度を見る #japanr LT Randomforestで高次元の変数重要度を見る #japanr LT
Randomforestで高次元の変数重要度を見る #japanr LTAkifumi Eguchi
 
Linkedin us regional page intro slides edited_160902_final
Linkedin us regional page intro slides edited_160902_finalLinkedin us regional page intro slides edited_160902_final
Linkedin us regional page intro slides edited_160902_finalPaul Hussey
 
Procesos de enseñanza aprendizaje, roles, rutinas y
Procesos de enseñanza aprendizaje, roles, rutinas yProcesos de enseñanza aprendizaje, roles, rutinas y
Procesos de enseñanza aprendizaje, roles, rutinas yAndrea Islas Andrade
 

Viewers also liked (20)

Playing with the moon
Playing with the moonPlaying with the moon
Playing with the moon
 
La educación virtual en el periodismo - Tarea foro 6
La educación virtual en el periodismo - Tarea foro 6La educación virtual en el periodismo - Tarea foro 6
La educación virtual en el periodismo - Tarea foro 6
 
Sociales esquemas
Sociales esquemasSociales esquemas
Sociales esquemas
 
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
 
Launch Festival 2016 - Push Notifications -- You're Doing it Wrong
Launch Festival 2016 - Push Notifications -- You're Doing it WrongLaunch Festival 2016 - Push Notifications -- You're Doing it Wrong
Launch Festival 2016 - Push Notifications -- You're Doing it Wrong
 
Kongregate - Maximizing Player Retention and Monetization in Free-to-Play Gam...
Kongregate - Maximizing Player Retention and Monetization in Free-to-Play Gam...Kongregate - Maximizing Player Retention and Monetization in Free-to-Play Gam...
Kongregate - Maximizing Player Retention and Monetization in Free-to-Play Gam...
 
Bascules
BasculesBascules
Bascules
 
The age of orchestration: from Docker basics to cluster management
The age of orchestration: from Docker basics to cluster managementThe age of orchestration: from Docker basics to cluster management
The age of orchestration: from Docker basics to cluster management
 
Career page photo slide show
Career page photo slide showCareer page photo slide show
Career page photo slide show
 
Progression Search
Progression SearchProgression Search
Progression Search
 
Meet the Outwarians
Meet the OutwariansMeet the Outwarians
Meet the Outwarians
 
Características de los videojuegos
Características de los videojuegosCaracterísticas de los videojuegos
Características de los videojuegos
 
Introction to docker swarm
Introction to docker swarmIntroction to docker swarm
Introction to docker swarm
 
Esurance Careers Slideshow
Esurance Careers SlideshowEsurance Careers Slideshow
Esurance Careers Slideshow
 
Minneapolis VAST/HQ
Minneapolis VAST/HQMinneapolis VAST/HQ
Minneapolis VAST/HQ
 
Randomforestで高次元の変数重要度を見る #japanr LT
 Randomforestで高次元の変数重要度を見る #japanr LT Randomforestで高次元の変数重要度を見る #japanr LT
Randomforestで高次元の変数重要度を見る #japanr LT
 
Linkedin us regional page intro slides edited_160902_final
Linkedin us regional page intro slides edited_160902_finalLinkedin us regional page intro slides edited_160902_final
Linkedin us regional page intro slides edited_160902_final
 
Perifericos de salida
Perifericos de salidaPerifericos de salida
Perifericos de salida
 
Dattatray Bhat
Dattatray BhatDattatray Bhat
Dattatray Bhat
 
Procesos de enseñanza aprendizaje, roles, rutinas y
Procesos de enseñanza aprendizaje, roles, rutinas yProcesos de enseñanza aprendizaje, roles, rutinas y
Procesos de enseñanza aprendizaje, roles, rutinas y
 

Similar to Aioug big data and hadoop

The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big DataMrinal Kumar
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Dba to data scientist -Satyendra
Dba to data scientist -SatyendraDba to data scientist -Satyendra
Dba to data scientist -Satyendrapasalapudi123
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfkalai75
 
An Analytical Study on Research Challenges and Issues in Big Data Analysis.pdf
An Analytical Study on Research Challenges and Issues in Big Data Analysis.pdfAn Analytical Study on Research Challenges and Issues in Big Data Analysis.pdf
An Analytical Study on Research Challenges and Issues in Big Data Analysis.pdfApril Knyff
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 

Similar to Aioug big data and hadoop (20)

The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
Big Data
Big DataBig Data
Big Data
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Dba to data scientist -Satyendra
Dba to data scientist -SatyendraDba to data scientist -Satyendra
Dba to data scientist -Satyendra
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
 
An Analytical Study on Research Challenges and Issues in Big Data Analysis.pdf
An Analytical Study on Research Challenges and Issues in Big Data Analysis.pdfAn Analytical Study on Research Challenges and Issues in Big Data Analysis.pdf
An Analytical Study on Research Challenges and Issues in Big Data Analysis.pdf
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
Big Data
Big DataBig Data
Big Data
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 

More from AiougVizagChapter

All about Oracle Golden Gate by Udaya Kumar Pyla
All about Oracle Golden Gate by Udaya Kumar PylaAll about Oracle Golden Gate by Udaya Kumar Pyla
All about Oracle Golden Gate by Udaya Kumar PylaAiougVizagChapter
 
Oracle database in cloud, dr in cloud and overview of oracle database 18c
Oracle database in cloud, dr in cloud and overview of oracle database 18cOracle database in cloud, dr in cloud and overview of oracle database 18c
Oracle database in cloud, dr in cloud and overview of oracle database 18cAiougVizagChapter
 
Awr + 12c performance tuning
Awr + 12c performance tuningAwr + 12c performance tuning
Awr + 12c performance tuningAiougVizagChapter
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAiougVizagChapter
 

More from AiougVizagChapter (6)

All about Oracle Golden Gate by Udaya Kumar Pyla
All about Oracle Golden Gate by Udaya Kumar PylaAll about Oracle Golden Gate by Udaya Kumar Pyla
All about Oracle Golden Gate by Udaya Kumar Pyla
 
Developer day v2
Developer day v2Developer day v2
Developer day v2
 
Oracle database in cloud, dr in cloud and overview of oracle database 18c
Oracle database in cloud, dr in cloud and overview of oracle database 18cOracle database in cloud, dr in cloud and overview of oracle database 18c
Oracle database in cloud, dr in cloud and overview of oracle database 18c
 
Aioug vizag ado_12c_aug20
Aioug vizag ado_12c_aug20Aioug vizag ado_12c_aug20
Aioug vizag ado_12c_aug20
 
Awr + 12c performance tuning
Awr + 12c performance tuningAwr + 12c performance tuning
Awr + 12c performance tuning
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Aioug big data and hadoop

  • 1. © Copyright 2016. Apps Associates LLC. 1 Big Data Overview & Hadoop for DBA’s Satyendra Pasalapudi Associate Practice Director Apps Associates LLC
  • 2. © Copyright 2016. Apps Associates LLC. 2 About Me Satyendra Kumar Pasalapudi Associate Practice Director – Infrastructure/Cloud Practice at Apps Associates Co-Founder & President of All India Oracle Users Group(AIOUG) @pasalapudi
  • 3. © Copyright 2016. Apps Associates LLC. 3 www.ora-search.com
  • 4. © Copyright 2016. Apps Associates LLC. 4 History of Data Management Systems Magnetic tape “flat” (sequential) files Pre-computer technologies: Printing press Dewey decimal system Punched cards Magnetic Disk IMS Relational Model defined Indexed-Sequential Access Mechanism (ISAM) Network Model IDMS ADABAS System R Oracle V2 Ingres dBase DB2 Informix Sybase SQL Server Access Postgres MySQL Cassandra Hadoop Vertica Riak HBase Dynamo MongoDB Redis VoltDB Hana Neo4J Aerospike Hierarchical model 1960-701940-50 1950-60 1970-80 1980-90 1990-2000 2000-2010
  • 5. © Copyright 2016. Apps Associates LLC. 5 @dvantages of Cloud
  • 6. © Copyright 2016. Apps Associates LLC. 6 Generational Change for Enterprise (IT)  Cloud supports mission critical workloads ─ 87% of Enterprises use Cloud for Mission Critical Applications  Cloud use in the enterprise continues to grow ─ Half of the Enterprises say they will use cloud for at least 75% of their workloads by 2018  No one cloud fits all ─ More than half (53 %) of enterprises use two(2) to four(4) cloud providers Source: Verizon 2016 State of the Market: Enterprise Cloud report
  • 7. © Copyright 2016. Apps Associates LLC. 7 Cloud – Probable to Inevitable  GE undergoing most important transformation in 140 year history ─ 9000 Applications to AWS & to 4000 Applications ─ 300 ERPs (two years back) to more manageable ─ 34 Data Centers to 4 Data Centers  By 2020 - US$15b of Software Revenue  Changes ─ People - Reduce Outsourcing ─ Technology - Build Approach for things that matter ─ 20% of Applications in Cloud as of today ─ 70% of Applications by 2020 in Cloud Source: AWS 2015 Keynote – Oct 6 2015 OOW Keynote with Mark Hurd Oct 26 2015 ─ Service Management ─ Network Perimeter ─ Risk Based Security Controls ─ Self Service and Automation ─ Financial Transparency
  • 8. © Copyright 2016. Apps Associates LLC. 8 What is Cloud
  • 9. The Role of Data is Changing
  • 10. © Copyright 2016. Apps Associates LLC. 10 Until now, Questions you ask drove Data model New model is collect as much data as possible – “Data-First Philosophy”
  • 11. © Copyright 2016. Apps Associates LLC. 11 Data is the new raw material for any business on par with capital, people, labor Data is the new raw material for any business on par with capital, people, labor
  • 12. © Copyright 2016. Apps Associates LLC. 12 Characteristics of Big Data
  • 13. © Copyright 2016. Apps Associates LLC. 13 Cost effectively manage and analyze all available data in its native form unstructured, structured, streaming ERP CRM RFID Website Network Switches Social Media Billing Big data Challenge
  • 14. © Copyright 2016. Apps Associates LLC. 14 Hybrid Cloud Framework HR FIN SCOM SALES PROCUREMENT PLANNING DW / BI
  • 15. © Copyright 2016. Apps Associates LLC. 15 Big data Eco System
  • 16. © Copyright 2016. Apps Associates LLC. 16 Not Easy to Get Analytic Value at Fast Enough Pace Tool Complexity • Early Hadoop tools only for experts • Existing BI tools not designed for Hadoop • Emerging solutions lack broad capabilities 80% effort typically spent on evaluating and preparing data Data Uncertainty • Not familiar and overwhelming • Potential value not obvious • Requires significant manipulation Overly dependent on scarce and highly skilled resources Source : Oracle
  • 17. © Copyright 2016. Apps Associates LLC. 17 Informatica Study May 2013 Addressed by Oracle Big Data Discovery Key Challenges in Managing Big Data
  • 18. © Copyright 2016. Apps Associates LLC. 18 Sample of Big Data Use Cases Today MEDIA/ ENTERTAINMENT Viewers / advertising effectiveness Cross Sell COMMUNICATIONS Location-based advertising EDUCATION & RESEARCH Experiment sensor analysis Retail / CPG Sentiment analysis Hot products OptimizedMarketing HEALTH CARE Patient sensors, monitoring, EHRs Quality of care LIFE SCIENCES Clinical trials Genomics HIGH TECHNOLOGY / INDUSTRIAL MFG. Mfg quality Warranty analysis OIL & GAS Drilling exploration sensor analysis FINANCIAL SERVICES Risk & portfolio analysis New products AUTOMOTIVE Auto sensors reporting location, problems Games Adjust to player behavior In-GameAds LAW ENFORCEMENT & DEFENSE Threat analysis - social media monitoring, photo analysis TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows Customer sentiment UTILITIES Smart Meter analysis for network capacity, ON-LINE SERVICES / SOCIAL MEDIA People & career matching Web-site optimization What is the main difference in this data? Volume, Velocity, Variety These Characteristics Challenge Your Existing Architecture
  • 19. © Copyright 2016. Apps Associates LLC. 19 Big Data Verticals Media/A dvertising Targeted Advertisin g Image and Video Processin g Oil & Gas Seismic Analysis Retail Recomme nd Transactio ns Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulatio ns Risk Analysis Security Anti-virus Fraud Detection Image Recogniti on Social Network/ Gaming User Demograp hics Usage analysis In-game metrics
  • 20. © Copyright 2016. Apps Associates LLC. 20 Sample Enterprise Big Data Architecture Operational RDBMS (Oracle, SQL Server, …) In-memory Analytics (HANA, Exalytics …) In-memory processing (Spark) Hadoop Web DBMS (MySQL, Mongo, Cassandra) ERP & in- house CRM Analytic/BI software (SAS, Tableau Web Server Data Warehouse RDBMS (Oracle, Teradata …)
  • 21. © Copyright 2016. Apps Associates LLC. 21 Enterprise Data Hub / Data Lake / Data Reservoir
  • 22. We Need Tools Built Specifically for Big Data
  • 23. © Copyright 2016. Apps Associates LLC. 23 Hadoop and it’s Eco System • Scale out Easily • Parallel Computing • Commodity Hardware • Solves some Problems • Complex to Run • Special Skills to Maintain Cassandra
  • 24. © Copyright 2016. Apps Associates LLC. 24 ETL for Unstructured Data
  • 25. © Copyright 2016. Apps Associates LLC. 25 ETL for Structured Data
  • 26. © Copyright 2016. Apps Associates LLC. 26 Hadoop Design Principles • System shall manage and heal itself – Automatically and transparently route around failure – Speculatively execute redundant tasks if certain nodes are detected to be slow • Performance shall scale linearly – Proportional change in capacity with resource change • Compute should move to data – Lower latency, lower bandwidth • Simple core, modular and extensible
  • 27. © Copyright 2016. Apps Associates LLC. 27 Hadoop History • Dec 2004 – Google GFS paper published • July 2005 – Nutch uses MapReduce • Feb 2006 – Starts as a Lucene subproject • Apr 2007 – Yahoo! on 1000-node cluster • Jan 2008 – An Apache Top Level Project • Jul 2008 – A 4000 node test cluster • May 2009 – Hadoop sorts Petabyte in 17 hours
  • 28. Google File System (GFS) Map Reduce BigTable Google Applications Google Software Architecture (circa 2005)
  • 30. © Copyright 2016. Apps Associates LLC. 30 Hadoop Ecosystem HDFS (Hadoop Distributed File System) HBase (key-value store) MapReduce (Job Scheduling/Execution System) Data Access Sqoop Flume Client Access Hue Hive(Sql) Pig(Pl/Sql) ZooKeeper (Coordination) (Streaming/Pipes APIs) Chukwa(Monitoring) Data Mining Mahout OS – Redhat, Suse, Ubuntu,Windows Commodity Hardware Java Virtual Machine Networking Orchestration Oozie
  • 31. © Copyright 2016. Apps Associates LLC. 31 Hadoop – Simplified View • MPP (Massively Parallel) hardware running database-like software • “Data” is stored in parts, across multiple worker nodes • “Work” operates in parallel, on the different parts of the table Controller Worker Nodes
  • 32. © Copyright 2016. Apps Associates LLC. 32 HDFS Architecture
  • 33. HDFS Architecture Namenode Breplication Rack1 Rack2 Client Blocks Datanodes Datanodes Client Write Read Metadata ops Metadata(Name, replicas..) (/home/foo/data,6. .. Block ops
  • 34. © Copyright 2016. Apps Associates LLC. 34 Head Node Data 1 Data 2 Data 3 Data 4 MYFILE.TXT ..block1 -> block1 ..block2 -> block2 ..block3 -> block3 HDFS – Highly Available
  • 35. © Copyright 2016. Apps Associates LLC. 35 Namenode and Datanodes  Master/slave architecture  HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.  There are a number of DataNodes usually one per node in a cluster.  The DataNodes manage storage attached to the nodes that they run on.  HDFS exposes a file system namespace and allows user data to be stored in files.  A file is split into one or more blocks and set of blocks are stored in DataNodes.  DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
  • 36. Hadoop 1 – Job & Task Trackers Master Node - The majority of hadoop deployments consist of sevaral master node instances. Having more than one master node helps eliminate the risk of single point of failure. NameNode - These processes are charged with storing a directory tree of all files in the Hadoop Distributed File SYstem (HDFS). They also keep track of where the file data is kept within in the cluster. Client Applications contact Name Nodes when they need to locate a file, or add, or copy or delete a file. DataNodes - The datanode stores data in the HDFS and is responsible for replicating data across clusters. Data Nodes interact with client applications when the NameNopde has supplied the Datanode's address. WorkerNode: Unlike a master node, whose numbers we can count on one hand, a representative Hadoop Deployment consists of dozens or hundreds of worker nodes, which provides enough processing power to analyze a few hundreds terabytes all the way upto one petabyte. Each worker node includes a DataNode as well as Task Tracker.
  • 37. Map Reduce Job Tracker /MapReduce Workload Management Layer - This process is assigned to interact with client applications. It is responsible for distributing MapReduce tasks to particular nodes within in a cluster. This engine coordinates all aspects of hadoop such as scheduling and launching jobs. Task Tracker - This is a process in the cluster that is capable of receiving tasks( inlcuding Map, Reduce, and Shuffle) from a Job Tracker
  • 38. © Copyright 2016. Apps Associates LLC. 38 Data Replication Similar to that of ASM  HDFS is designed to store very large files across machines in a large cluster.  Each file is a sequence of blocks.  All blocks in the file except the last are of the same size.  Blocks are replicated for fault tolerance.  Block size and replicas are configurable per file.  The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.  BlockReport contains all the blocks on a Datanode.
  • 39. © Copyright 2016. Apps Associates LLC. 39 Replica Placement & Rack Aware  The placement of the replicas is critical to HDFS reliability and performance.  Optimizing replica placement distinguishes HDFS from other distributed file systems.  Rack-aware replica placement:  Goal: improve reliability, availability and network bandwidth utilization  Many racks, communication between racks are through switches.  Network bandwidth between machines on the same rack is greater than those in different racks.  Namenode determines the rack id for each DataNode.  Replicas are typically placed on unique racks  Simple but non-optimal  Writes are expensive  Replication factor is 3  Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.
  • 40. © Copyright 2016. Apps Associates LLC. 40 Replica Selection • Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency. • If there is a replica on the Reader node then that is preferred. • HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one.
  • 41. © Copyright 2016. Apps Associates LLC. 41 Hadoop Components • Hadoop is bundled with two independent components – HDFS (Hadoop Distributed File System) • Designed for scaling in terms of storage and IO bandwidth – MR framework (MapReduce) • Designed for scaling in terms of performance
  • 42. © Copyright 2016. Apps Associates LLC. 42 Understanding file structure 1 GB file File is split into blocks Each block is typically 64MB Each block is stored as two files – one holding data and second for metadata, checksum Bloc k
  • 43. © Copyright 2016. Apps Associates LLC. 43 Hadoop Processes • Processes running on Hadoop – NameNode – DataNode – Secondary NameNode – Task Tracker – Job Tracker
  • 44. © Copyright 2016. Apps Associates LLC. 44 NameNode • Single point of contact • HDFS master • Holds meta information – List of files and directories – Location of blocks • Single node per cluster – Cluster can have thousands of DataNodes and tens of thousands of HDFS client. NameNode
  • 45. © Copyright 2016. Apps Associates LLC. 45 DataNode • Can execute multiple tasks concurrently • Holds actual data blocks, checksum and generation stamp • If block is half full, needs only half of the space of full block • At start-up, connects to NameNode and perform handshake • No binding to IP address or port, uses Storage ID • Sends heartbeat to NameNode DataNode Storage ID: XYZ001
  • 46. © Copyright 2016. Apps Associates LLC. 46 Communication • Total Storage Capacity • Fraction of storage in use • No of data transfer currently in progress • Instructs DataNode • Replicate block to other node • Remove local block replica • Send immediate block report • Shut down the node Every 3 seconds. “I AM ALIVE” NameNod e DataNode Storage ID: XYZ001 DataNode Storage ID: XYZ002 DataNode Storage ID: XYZ003 Reply No heartbeat for 10 minutes Heartbeat
  • 47. © Copyright 2016. Apps Associates LLC. 47
  • 48. Coordination in a distributed system • Coordination: An act that multiple nodes must perform together. • Examples: – Group membership – Locking – Publisher/Subscriber – Leader Election – Synchronization • Getting node coordination correct is very hard!
  • 49.
  • 50. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers. Introducing ZooKeeper - ZooKeeper Wiki ZooKeeper is much more than a distributed lock server!
  • 51. What is ZooKeeper? • An open source, high-performance coordination service for distributed applications. • Exposes common services in simple interface: – naming – configuration management – locks & synchronization – group services … developers don't have to write them from scratch • Build your own on it for specific needs.
  • 52. © Copyright 2016. Apps Associates LLC. 52 HDFS Distributions
  • 53. © Copyright 2016. Apps Associates LLC. 53 Real Time BI • Speed, agility, and intelligence are competitive advantages that nearly all organizations seek. • Existing Traditional Reporting Systems provide information after 24 – 36 hours. • To support Operational Users and influence what should happen next, the data should be available in real time to know what is happening now.
  • 54. © Copyright 2016. Apps Associates LLC. 54 Hadoop 2.0
  • 55. 20092006 1 ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) MapReduce Largely Batch Processing Hadoop w/ MapReduce YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Hadoop2 & YARN based Architecture Silo’d clusters Largely batch system Difficult to integrate MR-279: YARN Hadoop 2 & YARN Interactive Real-TimeBatch Enabled the Modern Data Architecture October 23, 2013
  • 56. © Copyright 2015. Apps Associates LLC. 56 Hadoop 2.0 Multi Use Data Platform Batch, Interactive, Realtime, Online, Streaming, … HADOOP 2 Redundant, Reliable Storage (HDFS) Efficient Cluster Resource Management & Shared Services (YARN) Standard Query Processing Hive Batch MapReduce Online Data Processing Interactive Tez Real Time Stream Processing Others
  • 57. © Copyright 2016. Apps Associates LLC. 57 Hadoop 2.0 with YARN
  • 58. © Copyright 2016. Apps Associates LLC. 58 Resource Manager/Node Manager Components
  • 59. © Copyright 2016. Apps Associates LLC. 59 Problems with this approach in Hadoop 1.0  It limits scalability: JobTracker runs on single machine doing several task like 1) Resource management 2) Job and task scheduling and 3) Monitoring Although there are so many machines (DataNode) available; they are not getting used. This limits scalability.  Availability Issue: In Hadoop 1.0, JobTracker is single Point of availability. This means if JobTracker fails, all jobs must restart.  Distinct map slots and reduce slots  Limitation in running non-MapReduce Application
  • 60. © Copyright 2016. Apps Associates LLC. 60 Yarn Architecture  Rescource Manager: Arbitrates division of resources among all the applications in the system. The Resource Manager has a pluggable scheduler component, which is responsible for allocating resources to the various running applications  Node Manager: per-machine slave, runs on slave nodes, which is responsible for launching the applications’ containers, monitoring their resource usage (CPU, memory, disk, network),and reporting the same to the Resource Manager.  Application Master: Negotiate appropriate resource containers from the Scheduler, tracking their status and monitoring for progress  Container: Unit of allocation incorporating resource elements such as memory, cpu, disk, network etc, to execute a specific task of the application (similar to map/reduce slots in MRv1)
  • 61. © Copyright 2016. Apps Associates LLC. 61 Yarn - Execution Sequence 1) A client program submits the application 2) ResourceManager allocates a specified container to start the ApplicationMaster 3) ApplicationMaster, on boot-up, registers with ResourceManager 4) ApplicationMaster negotiates with ResourceManager for appropriate resource containers 5) On successful container allocations, ApplicationMaster contacts NodeManager to launch the container 6) Application code is executed within the container, and then ApplicationMaster is responded with the execution status 7) During execution, the client communicates directly with ApplicationMaster or ResourceManager to get status, progress updates etc. 8) Once the application is complete, ApplicationMaster unregisters with ResourceManager and shuts down, allowing its own container process
  • 62. © Copyright 2016. Apps Associates LLC. 62 Operational vs. Analytical Databases
  • 63. © Copyright 2016. Apps Associates LLC. 63 A New Technology
  • 65. © Copyright 2016. Apps Associates LLC. 65 Use Cases
  • 66. © Copyright 2016. Apps Associates LLC. 66 Brewer's CAP Theorem
  • 67. © Copyright 2016. Apps Associates LLC. 67 Brewer's CAP Theorem
  • 68. © Copyright 2016. Apps Associates LLC. 68 NoSQL Technology Spectrum
  • 69. Name Site Counter Dick Ebay 507,018 Dick Google 690,414 Jane Google 716,426 Dick Facebook 723,649 Jane Facebook 643,261 Jane ILoveLarry.com 856,767 Dick MadBillFans.com 675,230 NameId Name 1 Dick 2 Jane SiteId SiteName 1 Ebay 2 Google 3 Facebook 4 ILoveLarry.com 5 MadBillFans.com NameId SiteId Counter 1 1 507,018 1 3 690,414 2 3 716,426 1 3 723,649 2 3 643,261 2 4 856,767 1 5 675,230 Id Name Ebay Google Facebook (other columns) MadBillFans.com 1 Dick 507,018 690,414 723,649 . . . . . . . . . . . . . . 675,230 Id Name Google Facebook (other columns) ILoveLarry.com 2 Jane 716,426 643,261 . . . . . . . . . . . . . . 856,767 BigTable Data Model
  • 70. Document databases • Structured documents – XML and JSON (JavaScript Object Notation) become more prevalent within applications • Web programmers start storing these in BLOBS in MySQL • Emergence of XML and JSON databases
  • 71. Graph Database Neo4J Infinite Graph FlockDB Document JSON based MongoDB CouchDB RethinkDB XML based MarkLogic BerkeleyDB XML Key Value MemchacheD B Oracle NoSQL Dynamo Voldemort DynamoDB Riak Table Based BigTable Cassandra Hbase HyperTable Accumulo
  • 72. © Copyright 2016. Apps Associates LLC. 72 Run the Business  Scale-out and scale-up  Collect any data  SQL  Transactional and analytic applications for the enterprise  Secure and highly available RelationalHadoop Change the Business  Scale-out, low cost store  Collect any data  Map-reduce, SQL  Analytic applications NoSQL Scale the Business  Scale-out, low cost store  Collect key-value data  Find data by key  Web applications Multiple Data Stores
  • 73. © Copyright 2016. Apps Associates LLC. 73 Data Analytics Challenge Separate silos of information to analyze
  • 74. © Copyright 2016. Apps Associates LLC. 74 Data Analytics Challenge Separate data access interfaces
  • 75. © Copyright 2016. Apps Associates LLC. 75 SQL on Hadoop is Obvious Stinger
  • 76. © Copyright 2016. Apps Associates LLC. 76 Data Analytics Challenge No comprehensive SQL interface across Oracle, Hadoop and NoSQL
  • 77. © Copyright 2016. Apps Associates LLC. 77 Oracle Big Data Management System Rich, comprehensive SQL access to all enterprise data NoSQL
  • 78. © Copyright 2016. Apps Associates LLC. 78 What Does Unified Query Mean for You? After Data Science ??? Anyone Before PhD
  • 79. © Copyright 2016. Apps Associates LLC. 79 What Does Unified Query Mean for You? After Application Development Before
  • 80. © Copyright 2016. Apps Associates LLC. 80 Storage Layer A New Hadoop Processing Engine Filesystem (HDFS) NoSQL Databases (Oracle NoSQL DB, Hbase) Resource Management (YARN) Processing Layer MapReduce and Hive Spark Impala Search Big Data SQL
  • 81. © Copyright 2016. Apps Associates LLC. 81 Big Data SQL SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id; Relevant SQL runs on BDA nodes 10’s of Gigabytes of Data Only columns and rows needed to answer query are returned Hadoop Cluster B B B Big Data SQL Oracle Database CUSTOMERSWEB_LOGS
  • 82. © Copyright 2016. Apps Associates LLC. 82 Big Data SQL SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id; Relevant SQL runs on BDA nodes 10’s of Gigabytes of Data Only columns and rows needed to answer query are returned Hadoop Cluster B B B Big Data SQL Oracle Database CUSTOMERSWEB_LOGS SQL Push Down in Big Data SQL • Hadoop Scans on Unstructured Data • WHERE Clause Evaluation • Column Projection • Bloom Filters for Better Join Performance • JSON Parsing, Data Mining Model Evaluation
  • 83. © Copyright 2016. Apps Associates LLC. 83 Query All Data without Application Change or Data Conversion Oracle Big Data SQL
  • 85. © Copyright 2016. Apps Associates LLC. 85 Fast Pace Innovation Dec 18th 2015 http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
  • 86. © Copyright 2016. Apps Associates LLC. 86 BDD Value Proposition Note: company logos and images are for illustration purposes only. Not a real use case for the company.
  • 87. © Copyright 2016. Apps Associates LLC. 87 Oracle BDD - Technical Innovation on Hadoop Oracle Big Data Discovery Workloads Hadoop Cluster (BDA or Commodity Hardware) BDD node data node data node data node data node name node Data Processing, Workflow & Monitoring • Profiling: catalog entry creation, data type & language detection,schema configuration • Sampling: dgraph (index) file creation • Transforms: >100 functions • Enrichments: location (geo), text (cleanup, sentiment,entity, key-phrase, whitelisttagging) Self-Service Provisioning & Data Transfer • Personal Data: Upload CSV and XLS to HDFS In-Memory Discovery Indexes • DGraph: Search, Guided Navigation,Analytics Studio • Web UI: Find, Explore, Transform, Discover, Share Hadoop 2.x Filesystem (HDFS) Workload Mgmt (YARN) Metadata (HCatalog) Other Hadoop Workloads MapReduce Spark Hive Pig Oracle Big Data SQL (BDA only)
  • 88. © Copyright 2016. Apps Associates LLC. 88 Sample Enterprise Big Data Architecture Operational RDBMS (Oracle, SQL Server, …) In-memory Analytics (HANA, Exalytics …) In-memory processing (Spark) Hadoop Web DBMS (MySQL, Mongo, Cassandra) ERP & in- house CRM Analytic/BI software (SAS, Tableau Web Server Data Warehouse RDBMS (Oracle, Teradata …)
  • 89. © Copyright 2016. Apps Associates LLC. 89 Cloud Consultant Core Skills 50% Automation 10% Cloud Knowledge 20% Tools & Integration 20 % = + + + How to transition into a Cloud Consultant
  • 90. © Copyright 2016. Apps Associates LLC. 90
  • 92. © Copyright 2016. Apps Associates LLC. 92 www.ora-search.com