SlideShare a Scribd company logo
1 of 56
Cloud and Big Data
Farzad Nozarian (fnozarian@aut.ac.ir)
Amirkabir University of Technology
With the help of Dr. Amir H. Payberah (amir@sics.se)
Big Data Analytics Stack
Hadoop Big Data Analytics
Stack
Spark Big Data Analytics
Stack
Big Data - File systems
• Traditional file-systems are not well-designed for large-scale data
processing systems.
• Efficiency has a higher priority than other features, e.g., directory
service.
• Massive size of data tends to store it across multiple machines in a
distributed way.
• HDFS/GFS, Amazon S3, ...
Big Data - Database
• Relational Databases Management Systems (RDMS) were not designed to be
distributed.
• NoSQL databases relax one or more of the ACID properties: BASE
• Different data models: key/value, column-family, graph, document.
• Hbase/BigTable, Dynamo, Scalaris, Cassandra, MongoDB, Voldemort, Riak,
Neo4J, ...
Big Data - Resource Management
• Different frameworks require different computing resources.
• Large organizations need the ability to share data and resources between
multiple frameworks.
• Resource management share resources in a cluster between multiple
frameworks while providing resource isolation.
• Mesos, YARN, Quincy, ...
Big Data - Execution Engine
• Scalable and fault tolerance parallel data processing on clusters of unreliable
machines.
• Data-parallel programming model for clusters of commodity machines.
• MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...
Big Data - Query/Scripting Language
• Low-level programming of execution engines, e.g., MapReduce, is not easy
for end users.
• Need high-level language to improve the query capabilities of execution
engines.
• It translates user-defined functions to low-level API of the execution
engines.
• Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...
Big Data - Stream Processing
• Providing users with fresh and low latency results.
• Database Management Systems (DBMS) vs. Data Stream Management Systems (DSMS)
• Storm, S4, SEEP, D-Stream, Naiad, ...
Big Data - Graph Processing
• Many problems are expressed using graphs: sparse computational
dependencies, and multiple iterations to converge.
• Data-parallel frameworks, such as MapReduce, are not ideal for these
problems: slow
• Graph processing frameworks are optimized for graph-based problems.
• Pregel, Giraph, GraphX, GraphLab, PowerGraph, GraphChi, ...
Big Data - Machine Learning
• Implementing and consuming machine learning techniques at scale are
difficult tasks for developers and end users.
• There exist platforms that address it by providing scalable machine learning
and data mining libraries.
• Mahout, MLBase, SystemML, Ricardo, Presto, ...
Big Data - Configuration and Synchronization
Service
• A means to synchronize distributed applications accesses to shared resources.
• Allows distributed processes to coordinate with each other.
• Zookeeper, Chubby, ...
Hadoop Ecosystem
Hadoop Ecosystem-HDFS
• A foundational component of the Hadoop ecosystem is the Hadoop
Distributed File System (HDFS).
• HDFS is the mechanism by which a large amount of data can be distributed
over a cluster of computers, and data is written once, but read many times
for analytics.
• It provides the foundation for other tools, such as HBase.
Hadoop Ecosystem-MapReduce
• Hadoop’s main execution framework is MapReduce, a programming model
for distributed, parallel data processing, breaking jobs into mapping phases
and reduce phases.
• Developers write MapReduce jobs for Hadoop, using data stored in HDFS for
fast data access.
• Because of the nature of how MapReduce works, Hadoop brings the
processing to the data in a parallel fashion, resulting in fast implementation.
Hadoop Ecosystem-HBase
• A column-oriented NoSQL database built on top of HDFS, HBase is used
for fast read/write access to large amounts of data.
• HBase uses Zookeeper for its management to ensure that all of its
components are up and running.
Hadoop Ecosystem-Zookeeper
• Zookeeper is Hadoop’s distributed coordination service.
• Designed to run over a cluster of machines, it is a highly available service
used for the management of Hadoop operations, and many components of
Hadoop depend on it.
Hadoop Ecosystem-Oozie
• A scalable workflow system
• Oozie is integrated into the Hadoop stack, and is used to coordinate
execution of multiple MapReduce jobs.
• It is capable of managing a significant amount of complexity, basing
execution on external events that include timing and presence of required
data.
Hadoop Ecosystem-Pig
• An abstraction over the complexity of MapReduce programming
• the Pig platform includes an execution environment and a scripting language
(Pig Latin) used to analyze Hadoop data sets.
• Its compiler translates Pig Latin into sequences of MapReduce programs.
Hadoop Ecosystem-Hive
• An SQL-like, high-level language used to run queries on data stored in
Hadoop
• Hive enables developers not familiar with MapReduce to write data queries
that are translated into MapReduce jobs in Hadoop.
Hadoop Ecosystem-Sqoop
• a connectivity tool for moving data between relational databases and data
warehouses and Hadoop.
• Sqoop leverages database to describe the schema for the imported/exported
data and MapReduce for parallelization operation and fault tolerance.
Hadoop Ecosystem-Flume
• a distributed, reliable, and highly available service for efficiently collecting,
aggregating, and moving large amounts of data from individual machines to
HDFS.
• provides a streaming of data flows
• allowing to move data from multiple machines within an enterprise into
Hadoop.
Beyond the core components
• Whirr — This is a set of libraries that allows users to easily spin-up Hadoop
clusters on top of Amazon EC2, Rackspace, or any virtual infrastructure.
• Mahout — This is a machine-learning and data-mining library that provides
MapReduce implementations for popular algorithms used for clustering, regression
testing, and statistical modeling.
• BigTop — This is a formal process and framework for packaging and
interoperability testing of Hadoop’s sub-projects and related components.
• Ambari — This is a project aimed at simplifying Hadoop management by
providing support for provisioning, managing, and monitoring Hadoop clusters.
Storing Data in Hadoop
HDFS - HBase
HDFS-Architecture
• The HDFS design is based on the design of the Google File System (GFS).
• To be able to store a very large amount of data (terabytes or petabytes)
• HDFS is designed to spread the data across a large number of machines, and to
support much larger file sizes compared to distributed filesystems such as NFS.
• HDFS uses data replication
• To better integrate with Hadoop’s MapReduce, HDFS allows data to be read and
processed locally.
HDFS-Architecture
HDFS is implemented as a block-structured file system
HDFS-Using HDFS Files
• User applications access the HDFS file system using an HDFS client
• ACCESSING HDFS
• FileSystem (FS) shell
• HDFS Java APIs
HDFS-Using HDFS Files
HBase-Architecture
• HBase is a distributed, versioned, column-oriented, multidimensional storage system,
designed for high performance and high availability.
• HBase is an open source implementation of Google’s BigTable architecture.
• Similar to traditional relational database management systems (RDBMSs), data in HBase is
organized in tables.
• Unlike RDBMSs, however, HBase supports a very loose schema definition, and does not
provide any joins, query language, or SQL.
• The main focus of HBase is on Create, Read, Update, and Delete (CRUD) operations on
wide sparse tables.
• HBase leverages HDFS for its persistent data storage.
Processing Data with MapReduce
MapReduce-Roadmap
• Understanding MapReduce fundamentals
• Getting to know MapReduce application execution
• Understanding MapReduce application design
MAPREDUCE-GETTING TO KNOW
• MapReduce is a framework for executing highly parallelizable and
distributable algorithms across huge data sets using a large number of
commodity computers.
• inspired by these concepts and introduced by Google in 2004
• MapReduce was introduced to solve large-data computational problems, and
is specifically designed to run on commodity hardware.
• It is based on divide-and-conquer principles — the input data sets are split into
independent chunks, which are processed by the mappers in parallel.
MAPREDUCE-GETTING TO KNOW
MAPREDUCE-Execution
Pipeline
MAPREDUCE-Runtime Coordination and
Task Management
word count implementation-Map Phase
word count implementation-Reduce Phase
word count implementation-
Driver
DESIGNING MAPREDUCE
IMPLEMENTATIONS
Necessary questions to reformulate the initial
problem in terms of MapReduce
• How do you break up a large problem into smaller tasks? More specifically,
how do you decompose the problem so that the smaller tasks can be
executed in parallel?
• Which key/value pairs can you use as inputs/outputs of every task?
• How do you bring together all the data required for calculation? More
specifically, how do you organize processing the way that all the data
necessary for calculation is in memory at the same time?
Simple Data Processing with
MapReduce
Inverted Indexes Example
Building Joins with MapReduce
• Two “standard” implementations exist for joining data in MapReduce:
• Reduce-side join
• Map-side join
• A most common implementation of a join is a reduce-side join.
• Map-side join is very well in the case of one-to-one joins, where at most one
record from every data set has the same key.
Road Enrichment Example
A simplified road enrichment algorithm
1. Find all links connected to a given node. For example, as shown in Figure,
node N1 has links L1, L2, L3, and L4, while node N2 has links L4, L5, and
L6.
2. Based on the number of lanes for every link at the node, calculate the road
width at the intersection.
3. Based on the road width, calculate the intersection geometry.
4. Based on the intersection geometry, move the road’s end point to tie it to
the intersection geometry.
Algorithm assumptions
• A node is described with an object N with the key NN1… NNm. For example, node
N1 can be described as NN1and N2 as NN2. All nodes are stored in the nodes input
file.
• A link is described with an object L with the key LL1… LLm. For example, link L1
can be described as LL1 , L2 as LL2, and so on. All the links are stored in the links
source file.
• Also introduce an object of the type link or node (LN), which can have any key.
• Finally, it is necessary to define two more types — intersection (S) and road (R).
Phase 1
Calculation of Intersection Geometry and
Moving the Road’s End Points Job
Phase 2
Merge Roads Job
Links Elevation Example
• This problem can be defined as follows. Given a links graph and terrain
model, convert two dimensional (x,y) links into three-dimensional (x, y, z)
links. This process is called link elevation.
Simplified link elevation
algorithm
1. Split every link into fixed-length
fragments (for example, 10 meters).
2. For every piece, calculate heights (from
the terrain model) for both start and
end points of each link.
3. Combine pieces together into original
links.
Phase 1
Split Links into Pieces and Elevate Each
Piece Job
Phase 2
Combine Link’s Pieces into Original Links Job

More Related Content

What's hot

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 

What's hot (19)

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop
HadoopHadoop
Hadoop
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop
Hadoop Hadoop
Hadoop
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
10c introduction
10c introduction10c introduction
10c introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 

Similar to Big Data and Cloud Computing

Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop clusterFurqan Haider
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsrishavkumar1402
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedAnant Kumar
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptManiMaran230751
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computingSachin Gowda
 

Similar to Big Data and Cloud Computing (20)

Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Anju
AnjuAnju
Anju
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Analytics 3
Analytics 3Analytics 3
Analytics 3
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 

More from Farzad Nozarian

SHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL DatabasesSHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL DatabasesFarzad Nozarian
 
Ultimate Goals In Robotics
Ultimate Goals In RoboticsUltimate Goals In Robotics
Ultimate Goals In RoboticsFarzad Nozarian
 
Tank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engineTank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engineFarzad Nozarian
 
The Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring ModelThe Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring ModelFarzad Nozarian
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab AssignmentFarzad Nozarian
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentFarzad Nozarian
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Big Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsBig Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsFarzad Nozarian
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformFarzad Nozarian
 

More from Farzad Nozarian (14)

SHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL DatabasesSHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL Databases
 
Object Based Databases
Object Based DatabasesObject Based Databases
Object Based Databases
 
Ultimate Goals In Robotics
Ultimate Goals In RoboticsUltimate Goals In Robotics
Ultimate Goals In Robotics
 
Tank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engineTank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engine
 
The Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring ModelThe Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring Model
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Shark - Lab Assignment
Shark - Lab AssignmentShark - Lab Assignment
Shark - Lab Assignment
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab Assignment
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Big Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsBig Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing Environments
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing Platform
 

Recently uploaded

Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 

Recently uploaded (20)

Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 

Big Data and Cloud Computing

  • 1. Cloud and Big Data Farzad Nozarian (fnozarian@aut.ac.ir) Amirkabir University of Technology With the help of Dr. Amir H. Payberah (amir@sics.se)
  • 2.
  • 4. Hadoop Big Data Analytics Stack
  • 5. Spark Big Data Analytics Stack
  • 6. Big Data - File systems • Traditional file-systems are not well-designed for large-scale data processing systems. • Efficiency has a higher priority than other features, e.g., directory service. • Massive size of data tends to store it across multiple machines in a distributed way. • HDFS/GFS, Amazon S3, ...
  • 7. Big Data - Database • Relational Databases Management Systems (RDMS) were not designed to be distributed. • NoSQL databases relax one or more of the ACID properties: BASE • Different data models: key/value, column-family, graph, document. • Hbase/BigTable, Dynamo, Scalaris, Cassandra, MongoDB, Voldemort, Riak, Neo4J, ...
  • 8. Big Data - Resource Management • Different frameworks require different computing resources. • Large organizations need the ability to share data and resources between multiple frameworks. • Resource management share resources in a cluster between multiple frameworks while providing resource isolation. • Mesos, YARN, Quincy, ...
  • 9. Big Data - Execution Engine • Scalable and fault tolerance parallel data processing on clusters of unreliable machines. • Data-parallel programming model for clusters of commodity machines. • MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...
  • 10. Big Data - Query/Scripting Language • Low-level programming of execution engines, e.g., MapReduce, is not easy for end users. • Need high-level language to improve the query capabilities of execution engines. • It translates user-defined functions to low-level API of the execution engines. • Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...
  • 11. Big Data - Stream Processing • Providing users with fresh and low latency results. • Database Management Systems (DBMS) vs. Data Stream Management Systems (DSMS) • Storm, S4, SEEP, D-Stream, Naiad, ...
  • 12. Big Data - Graph Processing • Many problems are expressed using graphs: sparse computational dependencies, and multiple iterations to converge. • Data-parallel frameworks, such as MapReduce, are not ideal for these problems: slow • Graph processing frameworks are optimized for graph-based problems. • Pregel, Giraph, GraphX, GraphLab, PowerGraph, GraphChi, ...
  • 13. Big Data - Machine Learning • Implementing and consuming machine learning techniques at scale are difficult tasks for developers and end users. • There exist platforms that address it by providing scalable machine learning and data mining libraries. • Mahout, MLBase, SystemML, Ricardo, Presto, ...
  • 14. Big Data - Configuration and Synchronization Service • A means to synchronize distributed applications accesses to shared resources. • Allows distributed processes to coordinate with each other. • Zookeeper, Chubby, ...
  • 16. Hadoop Ecosystem-HDFS • A foundational component of the Hadoop ecosystem is the Hadoop Distributed File System (HDFS). • HDFS is the mechanism by which a large amount of data can be distributed over a cluster of computers, and data is written once, but read many times for analytics. • It provides the foundation for other tools, such as HBase.
  • 17. Hadoop Ecosystem-MapReduce • Hadoop’s main execution framework is MapReduce, a programming model for distributed, parallel data processing, breaking jobs into mapping phases and reduce phases. • Developers write MapReduce jobs for Hadoop, using data stored in HDFS for fast data access. • Because of the nature of how MapReduce works, Hadoop brings the processing to the data in a parallel fashion, resulting in fast implementation.
  • 18. Hadoop Ecosystem-HBase • A column-oriented NoSQL database built on top of HDFS, HBase is used for fast read/write access to large amounts of data. • HBase uses Zookeeper for its management to ensure that all of its components are up and running.
  • 19. Hadoop Ecosystem-Zookeeper • Zookeeper is Hadoop’s distributed coordination service. • Designed to run over a cluster of machines, it is a highly available service used for the management of Hadoop operations, and many components of Hadoop depend on it.
  • 20. Hadoop Ecosystem-Oozie • A scalable workflow system • Oozie is integrated into the Hadoop stack, and is used to coordinate execution of multiple MapReduce jobs. • It is capable of managing a significant amount of complexity, basing execution on external events that include timing and presence of required data.
  • 21. Hadoop Ecosystem-Pig • An abstraction over the complexity of MapReduce programming • the Pig platform includes an execution environment and a scripting language (Pig Latin) used to analyze Hadoop data sets. • Its compiler translates Pig Latin into sequences of MapReduce programs.
  • 22. Hadoop Ecosystem-Hive • An SQL-like, high-level language used to run queries on data stored in Hadoop • Hive enables developers not familiar with MapReduce to write data queries that are translated into MapReduce jobs in Hadoop.
  • 23. Hadoop Ecosystem-Sqoop • a connectivity tool for moving data between relational databases and data warehouses and Hadoop. • Sqoop leverages database to describe the schema for the imported/exported data and MapReduce for parallelization operation and fault tolerance.
  • 24. Hadoop Ecosystem-Flume • a distributed, reliable, and highly available service for efficiently collecting, aggregating, and moving large amounts of data from individual machines to HDFS. • provides a streaming of data flows • allowing to move data from multiple machines within an enterprise into Hadoop.
  • 25. Beyond the core components • Whirr — This is a set of libraries that allows users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace, or any virtual infrastructure. • Mahout — This is a machine-learning and data-mining library that provides MapReduce implementations for popular algorithms used for clustering, regression testing, and statistical modeling. • BigTop — This is a formal process and framework for packaging and interoperability testing of Hadoop’s sub-projects and related components. • Ambari — This is a project aimed at simplifying Hadoop management by providing support for provisioning, managing, and monitoring Hadoop clusters.
  • 26. Storing Data in Hadoop HDFS - HBase
  • 27. HDFS-Architecture • The HDFS design is based on the design of the Google File System (GFS). • To be able to store a very large amount of data (terabytes or petabytes) • HDFS is designed to spread the data across a large number of machines, and to support much larger file sizes compared to distributed filesystems such as NFS. • HDFS uses data replication • To better integrate with Hadoop’s MapReduce, HDFS allows data to be read and processed locally.
  • 28. HDFS-Architecture HDFS is implemented as a block-structured file system
  • 29. HDFS-Using HDFS Files • User applications access the HDFS file system using an HDFS client • ACCESSING HDFS • FileSystem (FS) shell • HDFS Java APIs
  • 31. HBase-Architecture • HBase is a distributed, versioned, column-oriented, multidimensional storage system, designed for high performance and high availability. • HBase is an open source implementation of Google’s BigTable architecture. • Similar to traditional relational database management systems (RDBMSs), data in HBase is organized in tables. • Unlike RDBMSs, however, HBase supports a very loose schema definition, and does not provide any joins, query language, or SQL. • The main focus of HBase is on Create, Read, Update, and Delete (CRUD) operations on wide sparse tables. • HBase leverages HDFS for its persistent data storage.
  • 32. Processing Data with MapReduce
  • 33. MapReduce-Roadmap • Understanding MapReduce fundamentals • Getting to know MapReduce application execution • Understanding MapReduce application design
  • 34. MAPREDUCE-GETTING TO KNOW • MapReduce is a framework for executing highly parallelizable and distributable algorithms across huge data sets using a large number of commodity computers. • inspired by these concepts and introduced by Google in 2004 • MapReduce was introduced to solve large-data computational problems, and is specifically designed to run on commodity hardware. • It is based on divide-and-conquer principles — the input data sets are split into independent chunks, which are processed by the mappers in parallel.
  • 36.
  • 43. Necessary questions to reformulate the initial problem in terms of MapReduce • How do you break up a large problem into smaller tasks? More specifically, how do you decompose the problem so that the smaller tasks can be executed in parallel? • Which key/value pairs can you use as inputs/outputs of every task? • How do you bring together all the data required for calculation? More specifically, how do you organize processing the way that all the data necessary for calculation is in memory at the same time?
  • 44. Simple Data Processing with MapReduce
  • 46.
  • 47. Building Joins with MapReduce • Two “standard” implementations exist for joining data in MapReduce: • Reduce-side join • Map-side join • A most common implementation of a join is a reduce-side join. • Map-side join is very well in the case of one-to-one joins, where at most one record from every data set has the same key.
  • 49. A simplified road enrichment algorithm 1. Find all links connected to a given node. For example, as shown in Figure, node N1 has links L1, L2, L3, and L4, while node N2 has links L4, L5, and L6. 2. Based on the number of lanes for every link at the node, calculate the road width at the intersection. 3. Based on the road width, calculate the intersection geometry. 4. Based on the intersection geometry, move the road’s end point to tie it to the intersection geometry.
  • 50. Algorithm assumptions • A node is described with an object N with the key NN1… NNm. For example, node N1 can be described as NN1and N2 as NN2. All nodes are stored in the nodes input file. • A link is described with an object L with the key LL1… LLm. For example, link L1 can be described as LL1 , L2 as LL2, and so on. All the links are stored in the links source file. • Also introduce an object of the type link or node (LN), which can have any key. • Finally, it is necessary to define two more types — intersection (S) and road (R).
  • 51. Phase 1 Calculation of Intersection Geometry and Moving the Road’s End Points Job
  • 53. Links Elevation Example • This problem can be defined as follows. Given a links graph and terrain model, convert two dimensional (x,y) links into three-dimensional (x, y, z) links. This process is called link elevation.
  • 54. Simplified link elevation algorithm 1. Split every link into fixed-length fragments (for example, 10 meters). 2. For every piece, calculate heights (from the terrain model) for both start and end points of each link. 3. Combine pieces together into original links.
  • 55. Phase 1 Split Links into Pieces and Elevate Each Piece Job
  • 56. Phase 2 Combine Link’s Pieces into Original Links Job