Introduction to Big Data

Introduction to Big Data
Vipin Batra

Definitions:
 Webopedia[1]
Big data is used to describe a massive volume of both structured and
unstructured data that is so large that it's difficult to process using
traditional database and software techniques.
Gartner [2]
Big data is high volume, high velocity, and/or high variety information
assets that require new forms of processing to enable enhanced
decision making, insight discovery and process optimization.
What is BIG Data?

Definitions:
 National Institute of Standards and Technology (USA) [3]
 Big Data consists of extensive datasets, primarily in the
characteristics of volume, velocity, and/or variety that require a
scalable architecture for efficient storage, manipulation, and
analysis.
Data set characteristics that force a new architecture are:
1. the dataset at rest characteristics of : Volume and Variety (i.e., data
from multiple repositories, domains, or types), and
2. from the data in motion characteristics of Velocity (i.e., rate of
flow) and Variability (i.e., the change in velocity.)
These characteristics, known as the ‘V’s’ of Big Data
What is Big Data

Big Data - 4Vs
 IBM’s 4Vs of Big Data:
Volume Variety Velocity Veracity
Data at Scale
Terabytes to
petabytes of data
Data in Many Forms
Structured, unstructured,
text, multimedia
Data in Motion
Analysis of streaming data
to enable decisions within
fractions of a second.
Data Uncertainty
Managing the reliability and
predictability of inherently
imprecise data types.

 Validity: Refers to Quality of data and Accuracy for intended
purpose for which it is collected.
 Volatility: Tendency for data structures to change over time. In
this world of real time data you need to determine at what point
is data no longer relevant to the current analysis
 Value: The value through insight provided Big Data analytics
More Vs of Big Data [10,3]

1024 Bytes = 210 Bytes
1024 KB = 220 Bytes
1024 MB = 230 Bytes
1024 GB = 240 Bytes
1024 TB = 250 Bytes
1024 PB = 260 Bytes
1024 EB = 270 Bytes
1YB=1024 ZB = 280 Bytes = 1,208,925,819,614,629,174,706,176 Bytes
• 290 Bytes= Brontobytes, Hellabytes, or Ninabytes?
• 2100 Bytes= Geopbytes, Gegobytes, or Tenabytes?
Big Volume [4] – 1/
1 GeopByte=1024 BB = 2100 Bytes = 1,267,650,600,228,229,401,496,703,205,376 Bytes

Big Variety[13,14]
 ~90% of all data is unstructured and it is growing faster than
structured data.

 Existing Storage and Processing systems have following limitations with respect to
handling Big Data:
 Way too much data to process within acceptable time limits: Network bottlenecks,
Compute bottlenecks
 Data needs to be structured before storing: Months needed to design/ implement
new schemas, everytime new business need arises
 Hard to retrieve archived data: Not trivial to find archive tapes and find relevant
data
Limitations of Existing Systems[15]

 Distributed Computing: Horizontal Scaling instead of Vertical Scaling
 Computations are done closer to where data is stored
 Instead of centrally located parallel computing architecture with super-computing
capabilities (Giga/Teraflops), low capacity distributed storage/computing solution is
used
 Use of Low Cost Commodity hardware
 Big Data solutions use large number of low cost, commodity hardware, organized in
clusters to carry out storage/computing tasks
 Reliability, Fault Tolerance and Recovery
 Individual nodes can fail anytime, so to ensure reliability, data is replicated across
multiple nodes
 Scaling with Demand
 The solutions are scalable and allow cluster sizes to grow as per requirement
 Storage of unstructured Data
 Traditional RDBMS systems require well defined schema to be created, before data
can be stored (schema on write)
 New data storage paradigm – ‘NoSQL’ has evolved to cater to need to store any type
of data. This provides for schema on read i.e. schema is applied when data is read.
 No Archiving
 Data is always online, so no archiving. The big data solutions do not assume what
data queries will be using, so rule is to store all data in raw form.
Characteristics of Big Data Systems

 Key Points to Note:
Comparison Traditional vs Big
Data Storage [15] – 1/2
# Parameter Traditional Systems Big Data Storage
1 Schema Schema on Write: Schema
must be created before data
can be loaded
Schema on Read: Data is simply
stored, no transformation
2 Transformation Explicit load operation has to
take place which transform
data to DB internal structure
A SerDes (Serializer/De-serializer)
is applied during read time to
extract the required columns
3 Storage
Mechanism
Single Seamless Store of
Data mostly single
machine/location
Distributed Storage across
multiple nodes/locations
4 Distillation
(Organizing
data for read)
Already distilled data as in
structured format
Done on demand based on
business needs, allowing for
identifying new patterns and
relationships in existing data.

 Key Points to Note Contd..:
Comparison Traditional vs Big
Data Storage – 2/2
# Parameter Traditional Systems Big Data Storage
5 Store
Process
Data is stored after
preparation (for example
after the extract-transform-
load and cleansing processes)
1. In a high velocity use case, the data
is prepared and analyzed for
alerting, and only then is it stored
2. In a volume use case, the data is
often stored in the raw state in
which it was produced.
6 Insights Analysis needs to be defined
upfront and hence is rigid to
the business need
Ability to analyze data as required.
Allows for data exploration and so
enables the discovery of new insights
that were not directly visible
7 Action Technically feasible, but not
effective due to data latency
Ability to integrate with Business
Decisioning systems for the next best
action

 NoSQL database refers to class of database that do not use relational
model for data storage (relational model uses tables and rows)
 There are many NoSQL solutions, these are widely classified as:
1. Key-Value
2. Column-Family
3. Document
4. Graph
NoSQL (Not Only SQL) Databases
• First three models
are aggregate
oriented
• Aggregate is a
collection of related
objects, treated as
a single unit

 Google BigTable is a compressed, high performance and proprietary data
storage system used in Google projects. It is Column Family Database.
 BigTable maps two arbitrary string values (row key and column key) and
timestamp (hence three-dimensional mapping) into an associated arbitrary
byte array. It is not a relational database and can be better defined as a
sparse, distributed multi-dimensional sorted map
Google BigTable[16]

 Apache HBase is an open-source, distributed, versioned, non-relational
database modeled after Google's BigTable. It is a column oriented DBMS that
runs on top of HDFS
Apache Hbase[18]

 Apache Cassandra is column-family database system. It designed as a
distributed storage system for managing very large amounts of structured
data spread out across many commodity servers, while providing highly
available service with no single point of failure.
Column-Family Database[18]

 MongoDB stores data in the form of documents, which are JSON[21]-like field
and value pairs. Documents are analogous to structures in programming
languages that associate keys with values (e.g. dictionaries, hashes, maps,
and associative arrays). Formally, MongoDB documents
are BSON[20] documents. BSON is a binary representation of JSON with
additional type information.
Document Database [19]
 MongoDB supports search by
field, range queries, regular
expression searches. Queries
can return specific fields of
documents and also include
user-defined JavaScript
functions.
 Any field in a MongoDB
document can be indexed.
Secondary indices are also
available.

 Neo4j is an open-source NoSQL graph database implemented in Java and
Scala
Graph Database [30]
 The property graph contains connected entities (the nodes) which can hold any number of
attributes (key-value-pairs).
 Nodes can be tagged with labels, which in addition to contextualizing node and relationship
properties may also serve to attach metadata—index or constraint information—to certain nodes.
 Relationships provide directed, named semantically relevant connections between two node-
entities.

 Big data analytics refers to the process of collecting, organizing
and analysing large sets of data to discover patterns and other
useful information[23].
 Conceptual Framework for Big Data analytics[24]:
Big Data Analytics – 1/

 The data analytics project life cycle stages[27]:

 Following are types of Big Data Analytics[27]:
Diagnostic
Analytics
Descriptive
Analytics
Predictive
Analytics
Prescriptive
Analytics

Big Data Solutions and
Frameworks

Apache Hadoop [6]
 Apache Hadoop is widely used, open-source software for reliable, scalable, distributed
computing. Hadoop is a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage

 Microsoft Dryad is a R&D project, which provides an infrastructure to allow a
programmer to use the resources of a computer cluster or a data center for
running data-parallel programs
 A Dryad programmer can use thousands of machines, each of them with multiple
processors or cores
 A Dryad job is a graph generator which can synthesize any directed acyclic graph
 These graphs can even change during execution, in response to important events in the
computation.
Big Data Solutions: Dryad[12]

LexisNexis – HPCC[22]
 HPCC (High-Performance Computing Cluster), also known as DAS (Data
Analytics Supercomputer), is an open source, data-intensive computing
system platform developed by LexisNexis Risk Solutions. The HPCC platform
incorporates a software architecture implemented on commodity computing
clusters to provide high-performance, data-parallel processing for applications
utilizing big data.
Thor: Batch Processing Engine Roxie: High Perf. Query Engine

 Apache Spark is an open-source cluster computing framework originally
developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage
disk-based MapReduce paradigm, Spark's in-memory primitives provide
performance up to 100 times faster for certain applications
 Spark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object(aka driver program).
Apache Spark [28]
Lightning-fast cluster computing

 The SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager or Mesos or YARN), which allocate
resources across applications
 Once connected, first it acquires executors (processes that run computations
and store data) on nodes. Next, it sends application code (defined by JAR or
Python files passed to SparkContext) to the executors.
 Finally, SparkContext sends tasks for the executors to run.
Apache Spark [28]
Lightning-fast cluster computing

Quiz (Match The Following)
 Big Data is generally defined by:
 HPCC is example of:
 Cassandra is a:
 MongoDB is a:
 One of key Characteristic of Big Data
Solution:
 Characteristic of Traditional storage
system:
 NoSQL DB characteristic:
 Schema design before Storage
(Schema on write)
 RDBMS
 Column-family Database
 Key-Value Database
 Graph Database
 Document Database
 3Vs - Volume, Veracity, Variety
 4Vs – Volume, Velocity, Variety,
Value
 Big Data Solution
 Processing is closer to data
location
 Schema on Read

 Hadoop Cluster consists of set of cheap commodity hardware
networked together as set of servers in racks
Hadoop – 1/

 Hadoop framework allows for the distributed processing of large data sets across
clusters of computers. Can scale up from single servers to thousands of machines each
offering local computation and storage.
 Designed to detect and handle failures so as to deliver a highly-available service on top
of a cluster of computers, each of which may be prone to failures.
 The project includes these modules:
 Hadoop Common: The common utilities that support the other Hadoop modules.
 Hadoop Distributed File System (HDFS™): A distributed file system that provides
high-throughput access to application data.
 Hadoop YARN: A framework for job scheduling and cluster resource management.
 Hadoop MapReduce: A YARN-based system for parallel processing of large data
sets.
Hadoop – 2/2 [6, 31]

 HDFS is a Fault Tolerant Distributed File System
 HDFS provides for all POSIX File System features:
 File, Directory and Sub-Directory Structure
 Permission (rwx)
 Access (owner, Group, Others) and Super User
 Optimized for storing large files, with streaming data access (not
random access)
 File System keeps checksum (CRC32 per 512 byte) of data for
corruption detection and recovery
 Files are stored on across multiple commodity hardware machines in a
cluster
 Files are divided into uniform sized blocks (64MB, 128MB, 256 MB)
 Blocks are replicated across multiple machines to handle failures
 Provides access to block locations (servers/racks), so computations
can be done on same locations (same servers/racks on which data
resides)
Hadoop Distributed File System -1/9

HDFS – 2/9
 HDFS is implemented as a master-slave architecture
 NameNode is master, it has a secondary NameNode as backup
 DataNodes are slaves
Name
Node
Secondary
Name
Node
Checkpoints
Data
Node
Data
Node
Data
Node
Metadata
• Read/Write
Commands
• Data

HDFS – 3/
 NameNode manages the file system :
 File System Names (e.g. /home/foo/data/ .. ) and meta data
 Maps a file name to set of blocks
 Maps a block to DataNodes where it resides
 Blocks within the file system and their replicas
 Manages Cluster Configuration
 Managing data nodes
 In case of NameNode Failure, SecondaryName node takes over
DN1 DN2 DN3 DN4 DN5 ..DNN
Files
Meta-Data

HDFS – 4/9
 Meta Data
 HDFS namespace is hierarchy of files and directories
 Entire Meta-data is in Memory
 No Demand Paging
 Consists of list of blocks for each file, file attributes e.g. access time,
replication factor etc.,
 Changes to HDFS are recorded in log called ‘Transaction Log’
 Block Placement, default 3 replicas, configurable
 One replica on local node, Second on remote rack, Third on same remote
rack
 Additional copies randomly placed
 Clients Read from nearest replica

HDFS: Data Nodes – 5/9
 DataNode
 Slave Daemon process that reads/writes HDFS blocks from/to files in their local
files system
 During startup performs handshake with NameNode to verify namespace,
software version of data node (if version mismatch, datanode shuts down)
 Periodically sends heartbeat, block reports to NameNode
 Heartbeat carries total storage capacity, fraction used, ongoing data transfers etc.
 These stats are used by NameNode for block placement and load balancing
 Block Report has Block ID, Timestamp, block length for each replica
 Has no awareness of HDFS file system
 Does block creation, deletion, replication, shutdown etc. when NameNode
commands
 Namenode commands are sent as replies to heartbeat messages received
 Store each HDFS block in separate file as underlying OS’s files
 Maintains optimal number of files per directory, creates new directories as
needed
 Interacts directly with client to read/write blocks

 Java/C++ APIs are available to access Files on HDFS
 Sample code illustrates, writing to HDFS as a 3 step process:
HDFS Write – Sample Code: 6/9

 Figure below illustrates how write takes place (how blocks and
their replicas are updated):
HDFS Write Operations: 7/9

 Sample code illustrates, reading HDFS as a 4 step process:
HDFS Read– Sample Code: 8/9

 Figure below illustrates how read takes place
HDFS Read Operations: 9/9

YARN- Yet Another Resource
Negotiator[33]
 Manages Compute resources across the clusters
 Consists of Following Nodes:
 Resource Manager(RM)
 Manages and Allocates Cluster Compute Resources
 Node Manager on each Node (NM)
 Manages and enforces node resources
allocations
 Application Master
 Per application
 Manages app lifecycle and
tasks scheduling
 Container
 Basic Unit of allocation
 Allows fine grained resource
allocations

YARN: Resource Manager
 Resource Manager
 Manages Nodes – Tracks heartbeats from NodeManagers
 Managers Containers
 Handles AM request for resources
 De-allocates containers when they expire or application
completes
 Manages AM (ApplicationMasters)
 Creates a container for AMs and tracks heartbeats
 Manages Security
 Support Kerberos

YARN: NodeManager
 Node Manager resides on each Node
 Registers with ResourceManager (RM) and provides info on
node resources
 Sends periodic heartbeats and container status
 Managers processes in container
 Launches AMs on request from RM
 Launches application processes on request from AM
 Monitors resource usage by containers; kills rogue processes
 Provides logging services to applications
 Aggregates logs for an application and saves to HDFS
 Maintains node level security via ACLs

 Container
 Created by Resource Manager upon request
 Allocate a certain amount of resources (CPU, Memory)
 Applications run in one or more containers
 Application Master (AM)
 One per application
 Framework/application specific
 Runs in a container
 Requests more containers to run application tasks
YARN: Containers and AMs

 Client Requests RM an Application to be launched:
 RM launched Application Master on one NodeManager
YARN: Starting an App : 1/

 Application Master (AM) requests resources from RM; RM allocates
resources on Node Managers
 RM confirms resources allocations to AM with details, AM launched App
YARN: Starting an App: 2/
Resource Request
Resource Name (Hostname, Rack#)
Priority (within this app)
Resource Required:
• Memory (MB) , CPU (# of cores) etc.
Number of Containers
Allocates
Resources
Allocates
Resources
C1
C2
Container ID, Node
• C1@NM1
• C2@NM2
MyApp
MyApp
Container Launch
Context
Container ID
Commands
Environment
Local Resources
Container Launch
Context
• Container ID
• Commands (to
start MyApp)
• Environment
(configuration)
• Local Resources
(e.g. MyApp
binary, HDFS
files)
NM2
NM3
NM4
NM1

 MapReduce, originally proprietary Google technology, is a programming
model for processing large amounts of data in a parallel and distributed
fashion. It is useful for large, long-running jobs that cannot be handled within
the scope of a single request, tasks like:
 Analyzing application logs
 Aggregating related data from external sources
 Transforming data from one format to another
 Exporting data for external analysis
Map-Reduce[34]

Input from
DB/HDFS
MapReduce
MapReduce Operation
Map Shuffle Reduce
Output to
DB/HDFS

 Zookeeper exposes primitives that distributed applications can build upon to implement
higher level services for synchronization, configuration maintenance, and groups and
naming.
 Clients connect to servers to access name space which is much like that of a standard file
system to store/retrieve co-ordination data - status information, configuration, location
information, etc., data is usually small, in the byte to kilobyte range.
 Guarantees:
 Sequential Consistency - Updates will be applied in the order that they were sent.
 Atomicity - Updates either succeed or fail. No partial results.
 Single System Image – Same view of the service regardless of the server used
 Reliability - Once an update has been applied, it persists until updated.
 Timeliness - View of the system is guaranteed to be up-to-date within a time bound.
Zookeeper: A Distributed Coordination
Service for Distributed Applications[29]

 Top 10 Big Data Challenges
Big Data Challenges

Big Data: Revenues
1368
869
652
545 518 491 480
418 415
312 305 300 295 283 280 275 260
188 175 175
0
200
400
600
800
1000
1200
1400
1600
2013 Big Data Revenue ($ millions)

 Government Operation: National Archives and Records Administration, Census
Bureau
 Commercial: Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web
Search, Digital Materials, Cargo shipping (as in UPS)
 Defense: Sensors, Image surveillance, Situation Assessment
 Healthcare and Life Sciences: Medical records, Graph and Probabilistic analysis,
Pathology, Bioimaging, Genomics, Epidemiology, People Activity models,
Biodiversity
 Deep Learning and Social Media: Driving Car, Geolocate images/cameras,
Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets
 The Ecosystem for Research: Metadata, Collaboration, Language Translation,
Light source experiments
 Astronomy and Physics: Sky Surveys compared to simulation, Large Hadron
Collider at CERN, Belle Accelerator II in Japan
 Earth, Environmental and Polar Science: Radar Scattering in Atmosphere,
Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar
mapping, Climate simulation datasets, Atmospheric turbulence identification,
Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET
gas sensors
 Energy: Smart grid
Big Data Applications

 Real Time Analytics: Banking and Finance, Disaster detection and recovery,
even monitoring etc. applications need vast data, coming at very fast pace to
be processed within strict time limits
 Artificial Intelligence/Business Intelligence:
 Intelligent Maintenance Systems: is a system that utilizes the collected data from
the machinery in order to predict and prevent the potential failures in them
 IoT/M2M: These applications are generating data at a very fast rate (high
velocity, from huge number of sources (high volume) and require big data
solutions to process and derive meaningful information.
 Transreality gaming, sometimes written as trans-reality gaming, describes a
type or a mode of gameplay that combines playing a game in a virtual
environment with game-related, physical experiences in the real world and
vice versa.
Emerging Trends in Big Data

 Cloud computing advances have helped Big Data emerge as a
mass scale solution
 Leased/Rented data storage, computing clusters, enable even
startups to have global scale Big Data capability, without major
capital investment
Emerging Trends in Cloud Computing
– Complementary Technologies

 Massively parallel processing refers to a multitude of individual processors
working in parallel to execute a particular program
 The Big Data paradigm consists of the distribution of data systems across
horizontally coupled, independent resources to achieve the scalability needed for
the efficient processing of extensive datasets.
 Big Data Engineering: Advanced techniques that harness independent resources
for building scalable data systems when the characteristics of the datasets
require new architectures for efficient storage, manipulation, and analysis.
 NoSQL: Non-relational models, also known as NoSQL, refer to logical data models
that do not follow relational algebra for the storage and manipulation of data.
 Federated database system is a type of meta2-database management system
(DBMS), which transparently maps multiple autonomous database systems into a
single federated database.
Terms[3] – 1/

 The data science paradigm is extraction of actionable knowledge directly from
data through a process of discovery, hypothesis, and hypothesis testing.
 The data lifecycle is the set of processes that transform raw data into actionable
knowledge.
 Analytics is the extraction of knowledge from information.
 Data science is the construction of actionable knowledge from raw data through
the complete data lifecycle process.
 A data scientist is a practitioner who has sufficient knowledge in the overlapping
regimes of business needs, domain knowledge, analytical skills, and software and
systems engineering to manage the end-to-end data processes through each
stage in the data lifecycle.
 Schema-on-read is the application of a data schema through preparation steps
such as transformations, cleansing, and integration at the time the data is read
from the database.
 Computational portability is the movement of the computation to the location of
the data.
Terms[3] – 2/

 Transaction processing is a style of computing that divides work into individual,
indivisible operations, called transactions.
 Relational databases have traditionally supported the ACID transaction model.
ACID transactions are:
 Atomic Either all of the actions in a transaction are completed (i.e., transaction is
committed) or none of them are completed (i.e., transaction is rolled back).
 Consistent The transaction must begin and end with the database in a consistent state
and must comply with all protocols (i.e., rules) of the database.
 Isolated The transaction will behave as if it is the only operation being performed upon
the database.
 Durable The results of a committed transaction can survive system malfunctions.
 The BASE acronym is often used to describe the types of transactions typically
supported by nonrelational databases. A BASE System is described in contrast to
an ACID-compliant systems as:
 Basically Available, Soft state, and Eventually Consistent
 BASE transactions allow a database to be in a temporarily inconsistent state that will
eventually be resolved.
Terms[3] – 3/

 CAP Theorem states that a distributed system can support only two of the
following three characteristics:
 Consistency The client perceives that a set of operations has occurred all at once.
 Availability Every operation must terminate in an intended response.
 Partition tolerance Operations will complete, even if individual components are
unavailable.
Terms[3] – 4/

1. Webopedia: http://www.webopedia.com/TERM/B/big_data.html
2. Gartner Big Data Article: Laney, Douglas. "The Importance of 'Big Data': A Definition". Gartner.
Retrieved21 June 2012
3. NIST definitions: http://bigdatawg.nist.gov/_uploadfiles/BD_Vol1-Definitions_V1Draft_Pre-
release.pdf
4. Extreme Big Data: http://www.forbes.com/sites/oracle/2013/10/09/extreme-big-data-beyond-
zettabytes-and-yottabytes/
5. Presto Project: https://prestodb.io/
6. Hadoop Project: http://hadoop.apache.org/
7. Xoriant Big Data Report: http://www.xoriant.com/big-data-services
8. Big Data Article: http://www.slideshare.net/Codemotion/codemotionws-bigdata-conf
9. Big Data Article at Data science central: www.datasciencecentral.com
10. Big Data Article by IBM: http://www.ibmbigdatahub.com/infographic/four-vs-big-data
11. Big Data Article: http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-
data-veracity/
12. Dyrad Project: http://research.microsoft.com/en-us/projects/Dryad/
13. Data Variety: http://www.bi-bestpractices.com/view-articles/5643
References

14. Data Growth Article: http://www.businessinsider.in/Social-Networks-Like-Facebook-Are-Finally-
Going-After-The-Massive-Amount-Of-Unstructured-Data-Theyre-
Collecting/articleshow/31055495.cms
15. Coudera Modern Data Operating System: http://www.slideshare.net/awadallah/introducing-
apache-hadoop-the-modern-data-operating-system-stanford-ee380
16. Google BigTable: www.research.google.com/archive/bigtable-osdi06.pdf
17. Google Spanner: www.research.google.com/archive/spanner-osdi2012.pdf
18. Apache Hbase: hbase.apache.org/
19. Mongo DB: www.mongodb.org/
20. BSON Specs: http://bsonspec.org/
21. JSON Specs: http://www.json.org/
22. LexisNexis HPCC: http://hpccsystems.com/
23. Definition of Big Data Analytics: http://www.webopedia.com/TERM/B/big_data_analytics.html
24. Big Data, Mining, and Analytics: Components of Strategic Decision Making, Mar 2014, Stephan
Kudyba, CRC Press.
25. Big Data Use Cases: http://bigdatawg.nist.gov/usecases.php
26. Big Data Analytics with R and Hadoop, Vignesh Prajapati, PACKT publishing.
References

27. IBM Article: Transforming Energy and Utilities through Big Data & Analytics:
http://www.slideshare.net/AndersQuitzauIbm/big-data-analyticsin-energy-utilities
28. Apache Spark: https://spark.apache.org/
29. Zookeeper: http://zookeeper.apache.org/doc/trunk/zookeeperOver.html
30. Neo4j Database: http://neo4j.com/developer/graph-database/
31. Apache Hadoop: http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F
32. HDFS: http://www.slideshare.net/hanborq/hadoop-hdfs-detailed-introduction
33. YARN: http://www.slideshare.net/hortonworks/apache-hadoop-yarn-enabling-nex
34. MapReduce: https://cloud.google.com/appengine/docs/python/dataprocessing/
35. MapReduce@Wiki:http://en.wikipedia.org/wiki/MapReduce
36. Investments in Big Data:
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
37. Big Data Challenges: http://infographicsmania.com/big-data-challenges/
References

Introduction to Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Big Data

Similar to Introduction to Big Data (20)

Recently uploaded

Recently uploaded (20)

Introduction to Big Data