Very basic Introduction to Big Data. Touches on what it is, characteristics, some examples of Big Data frameworks. Hadoop 2.0 example - Yarn, HDFS and Map-Reduce with Zookeeper.
2. Definitions:
Webopedia[1]
Big data is used to describe a massive volume of both structured and
unstructured data that is so large that it's difficult to process using
traditional database and software techniques.
Gartner [2]
Big data is high volume, high velocity, and/or high variety information
assets that require new forms of processing to enable enhanced
decision making, insight discovery and process optimization.
What is BIG Data?
3. Definitions:
National Institute of Standards and Technology (USA) [3]
Big Data consists of extensive datasets, primarily in the
characteristics of volume, velocity, and/or variety that require a
scalable architecture for efficient storage, manipulation, and
analysis.
Data set characteristics that force a new architecture are:
1. the dataset at rest characteristics of : Volume and Variety (i.e., data
from multiple repositories, domains, or types), and
2. from the data in motion characteristics of Velocity (i.e., rate of
flow) and Variability (i.e., the change in velocity.)
These characteristics, known as the ‘V’s’ of Big Data
What is Big Data
5. Big Data - 4Vs
IBM’s 4Vs of Big Data:
Volume Variety Velocity Veracity
Data at Scale
Terabytes to
petabytes of data
Data in Many Forms
Structured, unstructured,
text, multimedia
Data in Motion
Analysis of streaming data
to enable decisions within
fractions of a second.
Data Uncertainty
Managing the reliability and
predictability of inherently
imprecise data types.
6. Validity: Refers to Quality of data and Accuracy for intended
purpose for which it is collected.
Volatility: Tendency for data structures to change over time. In
this world of real time data you need to determine at what point
is data no longer relevant to the current analysis
Value: The value through insight provided Big Data analytics
More Vs of Big Data [10,3]
16. Existing Storage and Processing systems have following limitations with respect to
handling Big Data:
Way too much data to process within acceptable time limits: Network bottlenecks,
Compute bottlenecks
Data needs to be structured before storing: Months needed to design/ implement
new schemas, everytime new business need arises
Hard to retrieve archived data: Not trivial to find archive tapes and find relevant
data
Limitations of Existing Systems[15]
17. Distributed Computing: Horizontal Scaling instead of Vertical Scaling
Computations are done closer to where data is stored
Instead of centrally located parallel computing architecture with super-computing
capabilities (Giga/Teraflops), low capacity distributed storage/computing solution is
used
Use of Low Cost Commodity hardware
Big Data solutions use large number of low cost, commodity hardware, organized in
clusters to carry out storage/computing tasks
Reliability, Fault Tolerance and Recovery
Individual nodes can fail anytime, so to ensure reliability, data is replicated across
multiple nodes
Scaling with Demand
The solutions are scalable and allow cluster sizes to grow as per requirement
Storage of unstructured Data
Traditional RDBMS systems require well defined schema to be created, before data
can be stored (schema on write)
New data storage paradigm – ‘NoSQL’ has evolved to cater to need to store any type
of data. This provides for schema on read i.e. schema is applied when data is read.
No Archiving
Data is always online, so no archiving. The big data solutions do not assume what
data queries will be using, so rule is to store all data in raw form.
Characteristics of Big Data Systems
19. Key Points to Note:
Comparison Traditional vs Big
Data Storage [15] – 1/2
# Parameter Traditional Systems Big Data Storage
1 Schema Schema on Write: Schema
must be created before data
can be loaded
Schema on Read: Data is simply
stored, no transformation
2 Transformation Explicit load operation has to
take place which transform
data to DB internal structure
A SerDes (Serializer/De-serializer)
is applied during read time to
extract the required columns
3 Storage
Mechanism
Single Seamless Store of
Data mostly single
machine/location
Distributed Storage across
multiple nodes/locations
4 Distillation
(Organizing
data for read)
Already distilled data as in
structured format
Done on demand based on
business needs, allowing for
identifying new patterns and
relationships in existing data.
20. Key Points to Note Contd..:
Comparison Traditional vs Big
Data Storage – 2/2
# Parameter Traditional Systems Big Data Storage
5 Store
Process
Data is stored after
preparation (for example
after the extract-transform-
load and cleansing processes)
1. In a high velocity use case, the data
is prepared and analyzed for
alerting, and only then is it stored
2. In a volume use case, the data is
often stored in the raw state in
which it was produced.
6 Insights Analysis needs to be defined
upfront and hence is rigid to
the business need
Ability to analyze data as required.
Allows for data exploration and so
enables the discovery of new insights
that were not directly visible
7 Action Technically feasible, but not
effective due to data latency
Ability to integrate with Business
Decisioning systems for the next best
action
21. NoSQL database refers to class of database that do not use relational
model for data storage (relational model uses tables and rows)
There are many NoSQL solutions, these are widely classified as:
1. Key-Value
2. Column-Family
3. Document
4. Graph
NoSQL (Not Only SQL) Databases
• First three models
are aggregate
oriented
• Aggregate is a
collection of related
objects, treated as
a single unit
22. Google BigTable is a compressed, high performance and proprietary data
storage system used in Google projects. It is Column Family Database.
BigTable maps two arbitrary string values (row key and column key) and
timestamp (hence three-dimensional mapping) into an associated arbitrary
byte array. It is not a relational database and can be better defined as a
sparse, distributed multi-dimensional sorted map
Google BigTable[16]
23. Apache HBase is an open-source, distributed, versioned, non-relational
database modeled after Google's BigTable. It is a column oriented DBMS that
runs on top of HDFS
Apache Hbase[18]
24. Apache Cassandra is column-family database system. It designed as a
distributed storage system for managing very large amounts of structured
data spread out across many commodity servers, while providing highly
available service with no single point of failure.
Column-Family Database[18]
25. MongoDB stores data in the form of documents, which are JSON[21]-like field
and value pairs. Documents are analogous to structures in programming
languages that associate keys with values (e.g. dictionaries, hashes, maps,
and associative arrays). Formally, MongoDB documents
are BSON[20] documents. BSON is a binary representation of JSON with
additional type information.
Document Database [19]
MongoDB supports search by
field, range queries, regular
expression searches. Queries
can return specific fields of
documents and also include
user-defined JavaScript
functions.
Any field in a MongoDB
document can be indexed.
Secondary indices are also
available.
26. Neo4j is an open-source NoSQL graph database implemented in Java and
Scala
Graph Database [30]
The property graph contains connected entities (the nodes) which can hold any number of
attributes (key-value-pairs).
Nodes can be tagged with labels, which in addition to contextualizing node and relationship
properties may also serve to attach metadata—index or constraint information—to certain nodes.
Relationships provide directed, named semantically relevant connections between two node-
entities.
28. Big data analytics refers to the process of collecting, organizing
and analysing large sets of data to discover patterns and other
useful information[23].
Conceptual Framework for Big Data analytics[24]:
Big Data Analytics – 1/
29. The data analytics project life cycle stages[27]:
Big Data Analytics – 2/
30. Following are types of Big Data Analytics[27]:
Big Data Analytics – 3/
Diagnostic
Analytics
Descriptive
Analytics
Predictive
Analytics
Prescriptive
Analytics
32. Apache Hadoop [6]
Apache Hadoop is widely used, open-source software for reliable, scalable, distributed
computing. Hadoop is a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage
33. Microsoft Dryad is a R&D project, which provides an infrastructure to allow a
programmer to use the resources of a computer cluster or a data center for
running data-parallel programs
A Dryad programmer can use thousands of machines, each of them with multiple
processors or cores
A Dryad job is a graph generator which can synthesize any directed acyclic graph
These graphs can even change during execution, in response to important events in the
computation.
Big Data Solutions: Dryad[12]
34. LexisNexis – HPCC[22]
HPCC (High-Performance Computing Cluster), also known as DAS (Data
Analytics Supercomputer), is an open source, data-intensive computing
system platform developed by LexisNexis Risk Solutions. The HPCC platform
incorporates a software architecture implemented on commodity computing
clusters to provide high-performance, data-parallel processing for applications
utilizing big data.
Thor: Batch Processing Engine Roxie: High Perf. Query Engine
35. Apache Spark is an open-source cluster computing framework originally
developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage
disk-based MapReduce paradigm, Spark's in-memory primitives provide
performance up to 100 times faster for certain applications
Spark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object(aka driver program).
Apache Spark [28]
Lightning-fast cluster computing
36. The SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager or Mesos or YARN), which allocate
resources across applications
Once connected, first it acquires executors (processes that run computations
and store data) on nodes. Next, it sends application code (defined by JAR or
Python files passed to SparkContext) to the executors.
Finally, SparkContext sends tasks for the executors to run.
Apache Spark [28]
Lightning-fast cluster computing
37. Quiz (Match The Following)
Big Data is generally defined by:
HPCC is example of:
Cassandra is a:
MongoDB is a:
One of key Characteristic of Big Data
Solution:
Characteristic of Traditional storage
system:
NoSQL DB characteristic:
Schema design before Storage
(Schema on write)
RDBMS
Column-family Database
Key-Value Database
Graph Database
Document Database
3Vs - Volume, Veracity, Variety
4Vs – Volume, Velocity, Variety,
Value
Big Data Solution
Processing is closer to data
location
Schema on Read
40. Hadoop Cluster consists of set of cheap commodity hardware
networked together as set of servers in racks
Hadoop – 1/
41. Hadoop framework allows for the distributed processing of large data sets across
clusters of computers. Can scale up from single servers to thousands of machines each
offering local computation and storage.
Designed to detect and handle failures so as to deliver a highly-available service on top
of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides
high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data
sets.
Hadoop – 2/2 [6, 31]
42. HDFS is a Fault Tolerant Distributed File System
HDFS provides for all POSIX File System features:
File, Directory and Sub-Directory Structure
Permission (rwx)
Access (owner, Group, Others) and Super User
Optimized for storing large files, with streaming data access (not
random access)
File System keeps checksum (CRC32 per 512 byte) of data for
corruption detection and recovery
Files are stored on across multiple commodity hardware machines in a
cluster
Files are divided into uniform sized blocks (64MB, 128MB, 256 MB)
Blocks are replicated across multiple machines to handle failures
Provides access to block locations (servers/racks), so computations
can be done on same locations (same servers/racks on which data
resides)
Hadoop Distributed File System -1/9
43. HDFS – 2/9
HDFS is implemented as a master-slave architecture
NameNode is master, it has a secondary NameNode as backup
DataNodes are slaves
Name
Node
Secondary
Name
Node
Checkpoints
Data
Node
Data
Node
Data
Node
Metadata
• Read/Write
Commands
• Data
44. HDFS – 3/
NameNode manages the file system :
File System Names (e.g. /home/foo/data/ .. ) and meta data
Maps a file name to set of blocks
Maps a block to DataNodes where it resides
Blocks within the file system and their replicas
Manages Cluster Configuration
Managing data nodes
In case of NameNode Failure, SecondaryName node takes over
DN1 DN2 DN3 DN4 DN5 ..DNN
Files
Meta-Data
45. HDFS – 4/9
Meta Data
HDFS namespace is hierarchy of files and directories
Entire Meta-data is in Memory
No Demand Paging
Consists of list of blocks for each file, file attributes e.g. access time,
replication factor etc.,
Changes to HDFS are recorded in log called ‘Transaction Log’
Block Placement, default 3 replicas, configurable
One replica on local node, Second on remote rack, Third on same remote
rack
Additional copies randomly placed
Clients Read from nearest replica
46. HDFS: Data Nodes – 5/9
DataNode
Slave Daemon process that reads/writes HDFS blocks from/to files in their local
files system
During startup performs handshake with NameNode to verify namespace,
software version of data node (if version mismatch, datanode shuts down)
Periodically sends heartbeat, block reports to NameNode
Heartbeat carries total storage capacity, fraction used, ongoing data transfers etc.
These stats are used by NameNode for block placement and load balancing
Block Report has Block ID, Timestamp, block length for each replica
Has no awareness of HDFS file system
Does block creation, deletion, replication, shutdown etc. when NameNode
commands
Namenode commands are sent as replies to heartbeat messages received
Store each HDFS block in separate file as underlying OS’s files
Maintains optimal number of files per directory, creates new directories as
needed
Interacts directly with client to read/write blocks
47. Java/C++ APIs are available to access Files on HDFS
Sample code illustrates, writing to HDFS as a 3 step process:
HDFS Write – Sample Code: 6/9
48. Figure below illustrates how write takes place (how blocks and
their replicas are updated):
HDFS Write Operations: 7/9
49. Sample code illustrates, reading HDFS as a 4 step process:
HDFS Read– Sample Code: 8/9
50. Figure below illustrates how read takes place
HDFS Read Operations: 9/9
51. YARN- Yet Another Resource
Negotiator[33]
Manages Compute resources across the clusters
Consists of Following Nodes:
Resource Manager(RM)
Manages and Allocates Cluster Compute Resources
Node Manager on each Node (NM)
Manages and enforces node resources
allocations
Application Master
Per application
Manages app lifecycle and
tasks scheduling
Container
Basic Unit of allocation
Allows fine grained resource
allocations
52. YARN: Resource Manager
Resource Manager
Manages Nodes – Tracks heartbeats from NodeManagers
Managers Containers
Handles AM request for resources
De-allocates containers when they expire or application
completes
Manages AM (ApplicationMasters)
Creates a container for AMs and tracks heartbeats
Manages Security
Support Kerberos
53. YARN: NodeManager
Node Manager resides on each Node
Registers with ResourceManager (RM) and provides info on
node resources
Sends periodic heartbeats and container status
Managers processes in container
Launches AMs on request from RM
Launches application processes on request from AM
Monitors resource usage by containers; kills rogue processes
Provides logging services to applications
Aggregates logs for an application and saves to HDFS
Maintains node level security via ACLs
54. Container
Created by Resource Manager upon request
Allocate a certain amount of resources (CPU, Memory)
Applications run in one or more containers
Application Master (AM)
One per application
Framework/application specific
Runs in a container
Requests more containers to run application tasks
YARN: Containers and AMs
55. Client Requests RM an Application to be launched:
RM launched Application Master on one NodeManager
YARN: Starting an App : 1/
56. Application Master (AM) requests resources from RM; RM allocates
resources on Node Managers
RM confirms resources allocations to AM with details, AM launched App
YARN: Starting an App: 2/
Resource Request
Resource Name (Hostname, Rack#)
Priority (within this app)
Resource Required:
• Memory (MB) , CPU (# of cores) etc.
Number of Containers
Allocates
Resources
Allocates
Resources
C1
C2
Container ID, Node
• C1@NM1
• C2@NM2
MyApp
MyApp
Container Launch
Context
Container ID
Commands
Environment
Local Resources
Container Launch
Context
• Container ID
• Commands (to
start MyApp)
• Environment
(configuration)
• Local Resources
(e.g. MyApp
binary, HDFS
files)
NM2
NM3
NM4
NM1
57. MapReduce, originally proprietary Google technology, is a programming
model for processing large amounts of data in a parallel and distributed
fashion. It is useful for large, long-running jobs that cannot be handled within
the scope of a single request, tasks like:
Analyzing application logs
Aggregating related data from external sources
Transforming data from one format to another
Exporting data for external analysis
Map-Reduce[34]
59. Zookeeper exposes primitives that distributed applications can build upon to implement
higher level services for synchronization, configuration maintenance, and groups and
naming.
Clients connect to servers to access name space which is much like that of a standard file
system to store/retrieve co-ordination data - status information, configuration, location
information, etc., data is usually small, in the byte to kilobyte range.
Guarantees:
Sequential Consistency - Updates will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image – Same view of the service regardless of the server used
Reliability - Once an update has been applied, it persists until updated.
Timeliness - View of the system is guaranteed to be up-to-date within a time bound.
Zookeeper: A Distributed Coordination
Service for Distributed Applications[29]
67. Government Operation: National Archives and Records Administration, Census
Bureau
Commercial: Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web
Search, Digital Materials, Cargo shipping (as in UPS)
Defense: Sensors, Image surveillance, Situation Assessment
Healthcare and Life Sciences: Medical records, Graph and Probabilistic analysis,
Pathology, Bioimaging, Genomics, Epidemiology, People Activity models,
Biodiversity
Deep Learning and Social Media: Driving Car, Geolocate images/cameras,
Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets
The Ecosystem for Research: Metadata, Collaboration, Language Translation,
Light source experiments
Astronomy and Physics: Sky Surveys compared to simulation, Large Hadron
Collider at CERN, Belle Accelerator II in Japan
Earth, Environmental and Polar Science: Radar Scattering in Atmosphere,
Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar
mapping, Climate simulation datasets, Atmospheric turbulence identification,
Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET
gas sensors
Energy: Smart grid
Big Data Applications
68. Real Time Analytics: Banking and Finance, Disaster detection and recovery,
even monitoring etc. applications need vast data, coming at very fast pace to
be processed within strict time limits
Artificial Intelligence/Business Intelligence:
Intelligent Maintenance Systems: is a system that utilizes the collected data from
the machinery in order to predict and prevent the potential failures in them
IoT/M2M: These applications are generating data at a very fast rate (high
velocity, from huge number of sources (high volume) and require big data
solutions to process and derive meaningful information.
Transreality gaming, sometimes written as trans-reality gaming, describes a
type or a mode of gameplay that combines playing a game in a virtual
environment with game-related, physical experiences in the real world and
vice versa.
Emerging Trends in Big Data
69. Cloud computing advances have helped Big Data emerge as a
mass scale solution
Leased/Rented data storage, computing clusters, enable even
startups to have global scale Big Data capability, without major
capital investment
Emerging Trends in Cloud Computing
– Complementary Technologies
70. Massively parallel processing refers to a multitude of individual processors
working in parallel to execute a particular program
The Big Data paradigm consists of the distribution of data systems across
horizontally coupled, independent resources to achieve the scalability needed for
the efficient processing of extensive datasets.
Big Data Engineering: Advanced techniques that harness independent resources
for building scalable data systems when the characteristics of the datasets
require new architectures for efficient storage, manipulation, and analysis.
NoSQL: Non-relational models, also known as NoSQL, refer to logical data models
that do not follow relational algebra for the storage and manipulation of data.
Federated database system is a type of meta2-database management system
(DBMS), which transparently maps multiple autonomous database systems into a
single federated database.
Terms[3] – 1/
71. The data science paradigm is extraction of actionable knowledge directly from
data through a process of discovery, hypothesis, and hypothesis testing.
The data lifecycle is the set of processes that transform raw data into actionable
knowledge.
Analytics is the extraction of knowledge from information.
Data science is the construction of actionable knowledge from raw data through
the complete data lifecycle process.
A data scientist is a practitioner who has sufficient knowledge in the overlapping
regimes of business needs, domain knowledge, analytical skills, and software and
systems engineering to manage the end-to-end data processes through each
stage in the data lifecycle.
Schema-on-read is the application of a data schema through preparation steps
such as transformations, cleansing, and integration at the time the data is read
from the database.
Computational portability is the movement of the computation to the location of
the data.
Terms[3] – 2/
72. Transaction processing is a style of computing that divides work into individual,
indivisible operations, called transactions.
Relational databases have traditionally supported the ACID transaction model.
ACID transactions are:
Atomic Either all of the actions in a transaction are completed (i.e., transaction is
committed) or none of them are completed (i.e., transaction is rolled back).
Consistent The transaction must begin and end with the database in a consistent state
and must comply with all protocols (i.e., rules) of the database.
Isolated The transaction will behave as if it is the only operation being performed upon
the database.
Durable The results of a committed transaction can survive system malfunctions.
The BASE acronym is often used to describe the types of transactions typically
supported by nonrelational databases. A BASE System is described in contrast to
an ACID-compliant systems as:
Basically Available, Soft state, and Eventually Consistent
BASE transactions allow a database to be in a temporarily inconsistent state that will
eventually be resolved.
Terms[3] – 3/
73. CAP Theorem states that a distributed system can support only two of the
following three characteristics:
Consistency The client perceives that a set of operations has occurred all at once.
Availability Every operation must terminate in an intended response.
Partition tolerance Operations will complete, even if individual components are
unavailable.
Terms[3] – 4/
74. 1. Webopedia: http://www.webopedia.com/TERM/B/big_data.html
2. Gartner Big Data Article: Laney, Douglas. "The Importance of 'Big Data': A Definition". Gartner.
Retrieved21 June 2012
3. NIST definitions: http://bigdatawg.nist.gov/_uploadfiles/BD_Vol1-Definitions_V1Draft_Pre-
release.pdf
4. Extreme Big Data: http://www.forbes.com/sites/oracle/2013/10/09/extreme-big-data-beyond-
zettabytes-and-yottabytes/
5. Presto Project: https://prestodb.io/
6. Hadoop Project: http://hadoop.apache.org/
7. Xoriant Big Data Report: http://www.xoriant.com/big-data-services
8. Big Data Article: http://www.slideshare.net/Codemotion/codemotionws-bigdata-conf
9. Big Data Article at Data science central: www.datasciencecentral.com
10. Big Data Article by IBM: http://www.ibmbigdatahub.com/infographic/four-vs-big-data
11. Big Data Article: http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-
data-veracity/
12. Dyrad Project: http://research.microsoft.com/en-us/projects/Dryad/
13. Data Variety: http://www.bi-bestpractices.com/view-articles/5643
References
75. 14. Data Growth Article: http://www.businessinsider.in/Social-Networks-Like-Facebook-Are-Finally-
Going-After-The-Massive-Amount-Of-Unstructured-Data-Theyre-
Collecting/articleshow/31055495.cms
15. Coudera Modern Data Operating System: http://www.slideshare.net/awadallah/introducing-
apache-hadoop-the-modern-data-operating-system-stanford-ee380
16. Google BigTable: www.research.google.com/archive/bigtable-osdi06.pdf
17. Google Spanner: www.research.google.com/archive/spanner-osdi2012.pdf
18. Apache Hbase: hbase.apache.org/
19. Mongo DB: www.mongodb.org/
20. BSON Specs: http://bsonspec.org/
21. JSON Specs: http://www.json.org/
22. LexisNexis HPCC: http://hpccsystems.com/
23. Definition of Big Data Analytics: http://www.webopedia.com/TERM/B/big_data_analytics.html
24. Big Data, Mining, and Analytics: Components of Strategic Decision Making, Mar 2014, Stephan
Kudyba, CRC Press.
25. Big Data Use Cases: http://bigdatawg.nist.gov/usecases.php
26. Big Data Analytics with R and Hadoop, Vignesh Prajapati, PACKT publishing.
References
76. 27. IBM Article: Transforming Energy and Utilities through Big Data & Analytics:
http://www.slideshare.net/AndersQuitzauIbm/big-data-analyticsin-energy-utilities
28. Apache Spark: https://spark.apache.org/
29. Zookeeper: http://zookeeper.apache.org/doc/trunk/zookeeperOver.html
30. Neo4j Database: http://neo4j.com/developer/graph-database/
31. Apache Hadoop: http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F
32. HDFS: http://www.slideshare.net/hanborq/hadoop-hdfs-detailed-introduction
33. YARN: http://www.slideshare.net/hortonworks/apache-hadoop-yarn-enabling-nex
34. MapReduce: https://cloud.google.com/appengine/docs/python/dataprocessing/
35. MapReduce@Wiki:http://en.wikipedia.org/wiki/MapReduce
36. Investments in Big Data:
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
37. Big Data Challenges: http://infographicsmania.com/big-data-challenges/
References