SlideShare a Scribd company logo
1 of 77
Introduction to Big Data
Vipin Batra
Definitions:
 Webopedia[1]
Big data is used to describe a massive volume of both structured and
unstructured data that is so large that it's difficult to process using
traditional database and software techniques.
Gartner [2]
Big data is high volume, high velocity, and/or high variety information
assets that require new forms of processing to enable enhanced
decision making, insight discovery and process optimization.
What is BIG Data?
Definitions:
 National Institute of Standards and Technology (USA) [3]
 Big Data consists of extensive datasets, primarily in the
characteristics of volume, velocity, and/or variety that require a
scalable architecture for efficient storage, manipulation, and
analysis.
Data set characteristics that force a new architecture are:
1. the dataset at rest characteristics of : Volume and Variety (i.e., data
from multiple repositories, domains, or types), and
2. from the data in motion characteristics of Velocity (i.e., rate of
flow) and Variability (i.e., the change in velocity.)
These characteristics, known as the ‘V’s’ of Big Data
What is Big Data
Big Data – 3 Vs
Big Data - 4Vs
 IBM’s 4Vs of Big Data:
Volume Variety Velocity Veracity
Data at Scale
Terabytes to
petabytes of data
Data in Many Forms
Structured, unstructured,
text, multimedia
Data in Motion
Analysis of streaming data
to enable decisions within
fractions of a second.
Data Uncertainty
Managing the reliability and
predictability of inherently
imprecise data types.
 Validity: Refers to Quality of data and Accuracy for intended
purpose for which it is collected.
 Volatility: Tendency for data structures to change over time. In
this world of real time data you need to determine at what point
is data no longer relevant to the current analysis
 Value: The value through insight provided Big Data analytics
More Vs of Big Data [10,3]
1024 Bytes = 210 Bytes
1024 KB = 220 Bytes
1024 MB = 230 Bytes
1024 GB = 240 Bytes
1024 TB = 250 Bytes
1024 PB = 260 Bytes
1024 EB = 270 Bytes
1YB=1024 ZB = 280 Bytes = 1,208,925,819,614,629,174,706,176 Bytes
• 290 Bytes= Brontobytes, Hellabytes, or Ninabytes?
• 2100 Bytes= Geopbytes, Gegobytes, or Tenabytes?
Big Volume [4] – 1/
1 GeopByte=1024 BB = 2100 Bytes = 1,267,650,600,228,229,401,496,703,205,376 Bytes
Big Volume – 2/
Big Volume – 3/
Big Velocity [8]– 1/
Big Velocity
 Variety..
Big Variety
Big Variety[13,14]
 ~90% of all data is unstructured and it is growing faster than
structured data.
Big Variety
BIG Challenges
 Existing Storage and Processing systems have following limitations with respect to
handling Big Data:
 Way too much data to process within acceptable time limits: Network bottlenecks,
Compute bottlenecks
 Data needs to be structured before storing: Months needed to design/ implement
new schemas, everytime new business need arises
 Hard to retrieve archived data: Not trivial to find archive tapes and find relevant
data
Limitations of Existing Systems[15]
 Distributed Computing: Horizontal Scaling instead of Vertical Scaling
 Computations are done closer to where data is stored
 Instead of centrally located parallel computing architecture with super-computing
capabilities (Giga/Teraflops), low capacity distributed storage/computing solution is
used
 Use of Low Cost Commodity hardware
 Big Data solutions use large number of low cost, commodity hardware, organized in
clusters to carry out storage/computing tasks
 Reliability, Fault Tolerance and Recovery
 Individual nodes can fail anytime, so to ensure reliability, data is replicated across
multiple nodes
 Scaling with Demand
 The solutions are scalable and allow cluster sizes to grow as per requirement
 Storage of unstructured Data
 Traditional RDBMS systems require well defined schema to be created, before data
can be stored (schema on write)
 New data storage paradigm – ‘NoSQL’ has evolved to cater to need to store any type
of data. This provides for schema on read i.e. schema is applied when data is read.
 No Archiving
 Data is always online, so no archiving. The big data solutions do not assume what
data queries will be using, so rule is to store all data in raw form.
Characteristics of Big Data Systems
Big Data Storage
 Key Points to Note:
Comparison Traditional vs Big
Data Storage [15] – 1/2
# Parameter Traditional Systems Big Data Storage
1 Schema Schema on Write: Schema
must be created before data
can be loaded
Schema on Read: Data is simply
stored, no transformation
2 Transformation Explicit load operation has to
take place which transform
data to DB internal structure
A SerDes (Serializer/De-serializer)
is applied during read time to
extract the required columns
3 Storage
Mechanism
Single Seamless Store of
Data mostly single
machine/location
Distributed Storage across
multiple nodes/locations
4 Distillation
(Organizing
data for read)
Already distilled data as in
structured format
Done on demand based on
business needs, allowing for
identifying new patterns and
relationships in existing data.
 Key Points to Note Contd..:
Comparison Traditional vs Big
Data Storage – 2/2
# Parameter Traditional Systems Big Data Storage
5 Store
Process
Data is stored after
preparation (for example
after the extract-transform-
load and cleansing processes)
1. In a high velocity use case, the data
is prepared and analyzed for
alerting, and only then is it stored
2. In a volume use case, the data is
often stored in the raw state in
which it was produced.
6 Insights Analysis needs to be defined
upfront and hence is rigid to
the business need
Ability to analyze data as required.
Allows for data exploration and so
enables the discovery of new insights
that were not directly visible
7 Action Technically feasible, but not
effective due to data latency
Ability to integrate with Business
Decisioning systems for the next best
action
 NoSQL database refers to class of database that do not use relational
model for data storage (relational model uses tables and rows)
 There are many NoSQL solutions, these are widely classified as:
1. Key-Value
2. Column-Family
3. Document
4. Graph
NoSQL (Not Only SQL) Databases
• First three models
are aggregate
oriented
• Aggregate is a
collection of related
objects, treated as
a single unit
 Google BigTable is a compressed, high performance and proprietary data
storage system used in Google projects. It is Column Family Database.
 BigTable maps two arbitrary string values (row key and column key) and
timestamp (hence three-dimensional mapping) into an associated arbitrary
byte array. It is not a relational database and can be better defined as a
sparse, distributed multi-dimensional sorted map
Google BigTable[16]
 Apache HBase is an open-source, distributed, versioned, non-relational
database modeled after Google's BigTable. It is a column oriented DBMS that
runs on top of HDFS
Apache Hbase[18]
 Apache Cassandra is column-family database system. It designed as a
distributed storage system for managing very large amounts of structured
data spread out across many commodity servers, while providing highly
available service with no single point of failure.
Column-Family Database[18]
 MongoDB stores data in the form of documents, which are JSON[21]-like field
and value pairs. Documents are analogous to structures in programming
languages that associate keys with values (e.g. dictionaries, hashes, maps,
and associative arrays). Formally, MongoDB documents
are BSON[20] documents. BSON is a binary representation of JSON with
additional type information.
Document Database [19]
 MongoDB supports search by
field, range queries, regular
expression searches. Queries
can return specific fields of
documents and also include
user-defined JavaScript
functions.
 Any field in a MongoDB
document can be indexed.
Secondary indices are also
available.
 Neo4j is an open-source NoSQL graph database implemented in Java and
Scala
Graph Database [30]
 The property graph contains connected entities (the nodes) which can hold any number of
attributes (key-value-pairs).
 Nodes can be tagged with labels, which in addition to contextualizing node and relationship
properties may also serve to attach metadata—index or constraint information—to certain nodes.
 Relationships provide directed, named semantically relevant connections between two node-
entities.
Big Data Analytics
 Big data analytics refers to the process of collecting, organizing
and analysing large sets of data to discover patterns and other
useful information[23].
 Conceptual Framework for Big Data analytics[24]:
Big Data Analytics – 1/
 The data analytics project life cycle stages[27]:
Big Data Analytics – 2/
 Following are types of Big Data Analytics[27]:
Big Data Analytics – 3/
Diagnostic
Analytics
Descriptive
Analytics
Predictive
Analytics
Prescriptive
Analytics
Big Data Solutions and
Frameworks
Apache Hadoop [6]
 Apache Hadoop is widely used, open-source software for reliable, scalable, distributed
computing. Hadoop is a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage
 Microsoft Dryad is a R&D project, which provides an infrastructure to allow a
programmer to use the resources of a computer cluster or a data center for
running data-parallel programs
 A Dryad programmer can use thousands of machines, each of them with multiple
processors or cores
 A Dryad job is a graph generator which can synthesize any directed acyclic graph
 These graphs can even change during execution, in response to important events in the
computation.
Big Data Solutions: Dryad[12]
LexisNexis – HPCC[22]
 HPCC (High-Performance Computing Cluster), also known as DAS (Data
Analytics Supercomputer), is an open source, data-intensive computing
system platform developed by LexisNexis Risk Solutions. The HPCC platform
incorporates a software architecture implemented on commodity computing
clusters to provide high-performance, data-parallel processing for applications
utilizing big data.
Thor: Batch Processing Engine Roxie: High Perf. Query Engine
 Apache Spark is an open-source cluster computing framework originally
developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage
disk-based MapReduce paradigm, Spark's in-memory primitives provide
performance up to 100 times faster for certain applications
 Spark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object(aka driver program).
Apache Spark [28]
Lightning-fast cluster computing
 The SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager or Mesos or YARN), which allocate
resources across applications
 Once connected, first it acquires executors (processes that run computations
and store data) on nodes. Next, it sends application code (defined by JAR or
Python files passed to SparkContext) to the executors.
 Finally, SparkContext sends tasks for the executors to run.
Apache Spark [28]
Lightning-fast cluster computing
Quiz (Match The Following)
 Big Data is generally defined by:
 HPCC is example of:
 Cassandra is a:
 MongoDB is a:
 One of key Characteristic of Big Data
Solution:
 Characteristic of Traditional storage
system:
 NoSQL DB characteristic:
 Schema design before Storage
(Schema on write)
 RDBMS
 Column-family Database
 Key-Value Database
 Graph Database
 Document Database
 3Vs - Volume, Veracity, Variety
 4Vs – Volume, Velocity, Variety,
Value
 Big Data Solution
 Processing is closer to data
location
 Schema on Read
Brief History..
 Hadoop Cluster consists of set of cheap commodity hardware
networked together as set of servers in racks
Hadoop – 1/
 Hadoop framework allows for the distributed processing of large data sets across
clusters of computers. Can scale up from single servers to thousands of machines each
offering local computation and storage.
 Designed to detect and handle failures so as to deliver a highly-available service on top
of a cluster of computers, each of which may be prone to failures.
 The project includes these modules:
 Hadoop Common: The common utilities that support the other Hadoop modules.
 Hadoop Distributed File System (HDFS™): A distributed file system that provides
high-throughput access to application data.
 Hadoop YARN: A framework for job scheduling and cluster resource management.
 Hadoop MapReduce: A YARN-based system for parallel processing of large data
sets.
Hadoop – 2/2 [6, 31]
 HDFS is a Fault Tolerant Distributed File System
 HDFS provides for all POSIX File System features:
 File, Directory and Sub-Directory Structure
 Permission (rwx)
 Access (owner, Group, Others) and Super User
 Optimized for storing large files, with streaming data access (not
random access)
 File System keeps checksum (CRC32 per 512 byte) of data for
corruption detection and recovery
 Files are stored on across multiple commodity hardware machines in a
cluster
 Files are divided into uniform sized blocks (64MB, 128MB, 256 MB)
 Blocks are replicated across multiple machines to handle failures
 Provides access to block locations (servers/racks), so computations
can be done on same locations (same servers/racks on which data
resides)
Hadoop Distributed File System -1/9
HDFS – 2/9
 HDFS is implemented as a master-slave architecture
 NameNode is master, it has a secondary NameNode as backup
 DataNodes are slaves
Name
Node
Secondary
Name
Node
Checkpoints
Data
Node
Data
Node
Data
Node
Metadata
• Read/Write
Commands
• Data
HDFS – 3/
 NameNode manages the file system :
 File System Names (e.g. /home/foo/data/ .. ) and meta data
 Maps a file name to set of blocks
 Maps a block to DataNodes where it resides
 Blocks within the file system and their replicas
 Manages Cluster Configuration
 Managing data nodes
 In case of NameNode Failure, SecondaryName node takes over
DN1 DN2 DN3 DN4 DN5 ..DNN
Files
Meta-Data
HDFS – 4/9
 Meta Data
 HDFS namespace is hierarchy of files and directories
 Entire Meta-data is in Memory
 No Demand Paging
 Consists of list of blocks for each file, file attributes e.g. access time,
replication factor etc.,
 Changes to HDFS are recorded in log called ‘Transaction Log’
 Block Placement, default 3 replicas, configurable
 One replica on local node, Second on remote rack, Third on same remote
rack
 Additional copies randomly placed
 Clients Read from nearest replica
HDFS: Data Nodes – 5/9
 DataNode
 Slave Daemon process that reads/writes HDFS blocks from/to files in their local
files system
 During startup performs handshake with NameNode to verify namespace,
software version of data node (if version mismatch, datanode shuts down)
 Periodically sends heartbeat, block reports to NameNode
 Heartbeat carries total storage capacity, fraction used, ongoing data transfers etc.
 These stats are used by NameNode for block placement and load balancing
 Block Report has Block ID, Timestamp, block length for each replica
 Has no awareness of HDFS file system
 Does block creation, deletion, replication, shutdown etc. when NameNode
commands
 Namenode commands are sent as replies to heartbeat messages received
 Store each HDFS block in separate file as underlying OS’s files
 Maintains optimal number of files per directory, creates new directories as
needed
 Interacts directly with client to read/write blocks
 Java/C++ APIs are available to access Files on HDFS
 Sample code illustrates, writing to HDFS as a 3 step process:
HDFS Write – Sample Code: 6/9
 Figure below illustrates how write takes place (how blocks and
their replicas are updated):
HDFS Write Operations: 7/9
 Sample code illustrates, reading HDFS as a 4 step process:
HDFS Read– Sample Code: 8/9
 Figure below illustrates how read takes place
HDFS Read Operations: 9/9
YARN- Yet Another Resource
Negotiator[33]
 Manages Compute resources across the clusters
 Consists of Following Nodes:
 Resource Manager(RM)
 Manages and Allocates Cluster Compute Resources
 Node Manager on each Node (NM)
 Manages and enforces node resources
allocations
 Application Master
 Per application
 Manages app lifecycle and
tasks scheduling
 Container
 Basic Unit of allocation
 Allows fine grained resource
allocations
YARN: Resource Manager
 Resource Manager
 Manages Nodes – Tracks heartbeats from NodeManagers
 Managers Containers
 Handles AM request for resources
 De-allocates containers when they expire or application
completes
 Manages AM (ApplicationMasters)
 Creates a container for AMs and tracks heartbeats
 Manages Security
 Support Kerberos
YARN: NodeManager
 Node Manager resides on each Node
 Registers with ResourceManager (RM) and provides info on
node resources
 Sends periodic heartbeats and container status
 Managers processes in container
 Launches AMs on request from RM
 Launches application processes on request from AM
 Monitors resource usage by containers; kills rogue processes
 Provides logging services to applications
 Aggregates logs for an application and saves to HDFS
 Maintains node level security via ACLs
 Container
 Created by Resource Manager upon request
 Allocate a certain amount of resources (CPU, Memory)
 Applications run in one or more containers
 Application Master (AM)
 One per application
 Framework/application specific
 Runs in a container
 Requests more containers to run application tasks
YARN: Containers and AMs
 Client Requests RM an Application to be launched:
 RM launched Application Master on one NodeManager
YARN: Starting an App : 1/
 Application Master (AM) requests resources from RM; RM allocates
resources on Node Managers
 RM confirms resources allocations to AM with details, AM launched App
YARN: Starting an App: 2/
Resource Request
Resource Name (Hostname, Rack#)
Priority (within this app)
Resource Required:
• Memory (MB) , CPU (# of cores) etc.
Number of Containers
Allocates
Resources
Allocates
Resources
C1
C2
Container ID, Node
• C1@NM1
• C2@NM2
MyApp
MyApp
Container Launch
Context
Container ID
Commands
Environment
Local Resources
Container Launch
Context
• Container ID
• Commands (to
start MyApp)
• Environment
(configuration)
• Local Resources
(e.g. MyApp
binary, HDFS
files)
NM2
NM3
NM4
NM1
 MapReduce, originally proprietary Google technology, is a programming
model for processing large amounts of data in a parallel and distributed
fashion. It is useful for large, long-running jobs that cannot be handled within
the scope of a single request, tasks like:
 Analyzing application logs
 Aggregating related data from external sources
 Transforming data from one format to another
 Exporting data for external analysis
Map-Reduce[34]
Input from
DB/HDFS
MapReduce
MapReduce Operation
Map Shuffle Reduce
Output to
DB/HDFS
 Zookeeper exposes primitives that distributed applications can build upon to implement
higher level services for synchronization, configuration maintenance, and groups and
naming.
 Clients connect to servers to access name space which is much like that of a standard file
system to store/retrieve co-ordination data - status information, configuration, location
information, etc., data is usually small, in the byte to kilobyte range.
 Guarantees:
 Sequential Consistency - Updates will be applied in the order that they were sent.
 Atomicity - Updates either succeed or fail. No partial results.
 Single System Image – Same view of the service regardless of the server used
 Reliability - Once an update has been applied, it persists until updated.
 Timeliness - View of the system is guaranteed to be up-to-date within a time bound.
Zookeeper: A Distributed Coordination
Service for Distributed Applications[29]
Big Data – Business
Trends
BD – Landscape
Impact of Big Data on Economy
 Top 10 Big Data Challenges
Big Data Challenges
Big Data Trends
Big Data Market Forecast
Big Data: Revenues
1368
869
652
545 518 491 480
418 415
312 305 300 295 283 280 275 260
188 175 175
0
200
400
600
800
1000
1200
1400
1600
2013 Big Data Revenue ($ millions)
 Government Operation: National Archives and Records Administration, Census
Bureau
 Commercial: Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web
Search, Digital Materials, Cargo shipping (as in UPS)
 Defense: Sensors, Image surveillance, Situation Assessment
 Healthcare and Life Sciences: Medical records, Graph and Probabilistic analysis,
Pathology, Bioimaging, Genomics, Epidemiology, People Activity models,
Biodiversity
 Deep Learning and Social Media: Driving Car, Geolocate images/cameras,
Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets
 The Ecosystem for Research: Metadata, Collaboration, Language Translation,
Light source experiments
 Astronomy and Physics: Sky Surveys compared to simulation, Large Hadron
Collider at CERN, Belle Accelerator II in Japan
 Earth, Environmental and Polar Science: Radar Scattering in Atmosphere,
Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar
mapping, Climate simulation datasets, Atmospheric turbulence identification,
Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET
gas sensors
 Energy: Smart grid
Big Data Applications
 Real Time Analytics: Banking and Finance, Disaster detection and recovery,
even monitoring etc. applications need vast data, coming at very fast pace to
be processed within strict time limits
 Artificial Intelligence/Business Intelligence:
 Intelligent Maintenance Systems: is a system that utilizes the collected data from
the machinery in order to predict and prevent the potential failures in them
 IoT/M2M: These applications are generating data at a very fast rate (high
velocity, from huge number of sources (high volume) and require big data
solutions to process and derive meaningful information.
 Transreality gaming, sometimes written as trans-reality gaming, describes a
type or a mode of gameplay that combines playing a game in a virtual
environment with game-related, physical experiences in the real world and
vice versa.
Emerging Trends in Big Data
 Cloud computing advances have helped Big Data emerge as a
mass scale solution
 Leased/Rented data storage, computing clusters, enable even
startups to have global scale Big Data capability, without major
capital investment
Emerging Trends in Cloud Computing
– Complementary Technologies
 Massively parallel processing refers to a multitude of individual processors
working in parallel to execute a particular program
 The Big Data paradigm consists of the distribution of data systems across
horizontally coupled, independent resources to achieve the scalability needed for
the efficient processing of extensive datasets.
 Big Data Engineering: Advanced techniques that harness independent resources
for building scalable data systems when the characteristics of the datasets
require new architectures for efficient storage, manipulation, and analysis.
 NoSQL: Non-relational models, also known as NoSQL, refer to logical data models
that do not follow relational algebra for the storage and manipulation of data.
 Federated database system is a type of meta2-database management system
(DBMS), which transparently maps multiple autonomous database systems into a
single federated database.
Terms[3] – 1/
 The data science paradigm is extraction of actionable knowledge directly from
data through a process of discovery, hypothesis, and hypothesis testing.
 The data lifecycle is the set of processes that transform raw data into actionable
knowledge.
 Analytics is the extraction of knowledge from information.
 Data science is the construction of actionable knowledge from raw data through
the complete data lifecycle process.
 A data scientist is a practitioner who has sufficient knowledge in the overlapping
regimes of business needs, domain knowledge, analytical skills, and software and
systems engineering to manage the end-to-end data processes through each
stage in the data lifecycle.
 Schema-on-read is the application of a data schema through preparation steps
such as transformations, cleansing, and integration at the time the data is read
from the database.
 Computational portability is the movement of the computation to the location of
the data.
Terms[3] – 2/
 Transaction processing is a style of computing that divides work into individual,
indivisible operations, called transactions.
 Relational databases have traditionally supported the ACID transaction model.
ACID transactions are:
 Atomic Either all of the actions in a transaction are completed (i.e., transaction is
committed) or none of them are completed (i.e., transaction is rolled back).
 Consistent The transaction must begin and end with the database in a consistent state
and must comply with all protocols (i.e., rules) of the database.
 Isolated The transaction will behave as if it is the only operation being performed upon
the database.
 Durable The results of a committed transaction can survive system malfunctions.
 The BASE acronym is often used to describe the types of transactions typically
supported by nonrelational databases. A BASE System is described in contrast to
an ACID-compliant systems as:
 Basically Available, Soft state, and Eventually Consistent
 BASE transactions allow a database to be in a temporarily inconsistent state that will
eventually be resolved.
Terms[3] – 3/
 CAP Theorem states that a distributed system can support only two of the
following three characteristics:
 Consistency The client perceives that a set of operations has occurred all at once.
 Availability Every operation must terminate in an intended response.
 Partition tolerance Operations will complete, even if individual components are
unavailable.
Terms[3] – 4/
1. Webopedia: http://www.webopedia.com/TERM/B/big_data.html
2. Gartner Big Data Article: Laney, Douglas. "The Importance of 'Big Data': A Definition". Gartner.
Retrieved21 June 2012
3. NIST definitions: http://bigdatawg.nist.gov/_uploadfiles/BD_Vol1-Definitions_V1Draft_Pre-
release.pdf
4. Extreme Big Data: http://www.forbes.com/sites/oracle/2013/10/09/extreme-big-data-beyond-
zettabytes-and-yottabytes/
5. Presto Project: https://prestodb.io/
6. Hadoop Project: http://hadoop.apache.org/
7. Xoriant Big Data Report: http://www.xoriant.com/big-data-services
8. Big Data Article: http://www.slideshare.net/Codemotion/codemotionws-bigdata-conf
9. Big Data Article at Data science central: www.datasciencecentral.com
10. Big Data Article by IBM: http://www.ibmbigdatahub.com/infographic/four-vs-big-data
11. Big Data Article: http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-
data-veracity/
12. Dyrad Project: http://research.microsoft.com/en-us/projects/Dryad/
13. Data Variety: http://www.bi-bestpractices.com/view-articles/5643
References
14. Data Growth Article: http://www.businessinsider.in/Social-Networks-Like-Facebook-Are-Finally-
Going-After-The-Massive-Amount-Of-Unstructured-Data-Theyre-
Collecting/articleshow/31055495.cms
15. Coudera Modern Data Operating System: http://www.slideshare.net/awadallah/introducing-
apache-hadoop-the-modern-data-operating-system-stanford-ee380
16. Google BigTable: www.research.google.com/archive/bigtable-osdi06.pdf
17. Google Spanner: www.research.google.com/archive/spanner-osdi2012.pdf
18. Apache Hbase: hbase.apache.org/
19. Mongo DB: www.mongodb.org/
20. BSON Specs: http://bsonspec.org/
21. JSON Specs: http://www.json.org/
22. LexisNexis HPCC: http://hpccsystems.com/
23. Definition of Big Data Analytics: http://www.webopedia.com/TERM/B/big_data_analytics.html
24. Big Data, Mining, and Analytics: Components of Strategic Decision Making, Mar 2014, Stephan
Kudyba, CRC Press.
25. Big Data Use Cases: http://bigdatawg.nist.gov/usecases.php
26. Big Data Analytics with R and Hadoop, Vignesh Prajapati, PACKT publishing.
References
27. IBM Article: Transforming Energy and Utilities through Big Data & Analytics:
http://www.slideshare.net/AndersQuitzauIbm/big-data-analyticsin-energy-utilities
28. Apache Spark: https://spark.apache.org/
29. Zookeeper: http://zookeeper.apache.org/doc/trunk/zookeeperOver.html
30. Neo4j Database: http://neo4j.com/developer/graph-database/
31. Apache Hadoop: http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F
32. HDFS: http://www.slideshare.net/hanborq/hadoop-hdfs-detailed-introduction
33. YARN: http://www.slideshare.net/hortonworks/apache-hadoop-yarn-enabling-nex
34. MapReduce: https://cloud.google.com/appengine/docs/python/dataprocessing/
35. MapReduce@Wiki:http://en.wikipedia.org/wiki/MapReduce
36. Investments in Big Data:
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
37. Big Data Challenges: http://infographicsmania.com/big-data-challenges/
References
THANKS!!

More Related Content

What's hot

Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Big data Presentation
Big data PresentationBig data Presentation
Big data PresentationAswadmehar
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Big Data Analytics Powerpoint Presentation Slide
Big Data Analytics Powerpoint Presentation SlideBig Data Analytics Powerpoint Presentation Slide
Big Data Analytics Powerpoint Presentation SlideSlideTeam
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEyad Manna
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data Srinath Perera
 

What's hot (20)

Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Overview of Big data(ppt)
Overview of Big data(ppt)Overview of Big data(ppt)
Overview of Big data(ppt)
 
Oltp vs olap
Oltp vs olapOltp vs olap
Oltp vs olap
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Ppt
PptPpt
Ppt
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Big data
Big dataBig data
Big data
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Big Data Analytics Powerpoint Presentation Slide
Big Data Analytics Powerpoint Presentation SlideBig Data Analytics Powerpoint Presentation Slide
Big Data Analytics Powerpoint Presentation Slide
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data analytics
Data analyticsData analytics
Data analytics
 

Similar to Introduction to Big Data

Analysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchAnalysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchStevenChike
 
Introduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQLIntroduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQLTushar Shende
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdferamfatima43
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Traditional data word
Traditional data wordTraditional data word
Traditional data wordorcoxsm
 
Analysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchAnalysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchStevenChike
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With HadoopUmair Shafique
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computingSachin Gowda
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfSumanthReddy540432
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 

Similar to Introduction to Big Data (20)

Analysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchAnalysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho bench
 
Introduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQLIntroduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQL
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
Big Data
Big DataBig Data
Big Data
 
Analysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchAnalysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho bench
 
Big Data SE vs. SE for Big Data
Big Data SE vs. SE for Big DataBig Data SE vs. SE for Big Data
Big Data SE vs. SE for Big Data
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdf
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 

Recently uploaded

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Introduction to Big Data

  • 1. Introduction to Big Data Vipin Batra
  • 2. Definitions:  Webopedia[1] Big data is used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques. Gartner [2] Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. What is BIG Data?
  • 3. Definitions:  National Institute of Standards and Technology (USA) [3]  Big Data consists of extensive datasets, primarily in the characteristics of volume, velocity, and/or variety that require a scalable architecture for efficient storage, manipulation, and analysis. Data set characteristics that force a new architecture are: 1. the dataset at rest characteristics of : Volume and Variety (i.e., data from multiple repositories, domains, or types), and 2. from the data in motion characteristics of Velocity (i.e., rate of flow) and Variability (i.e., the change in velocity.) These characteristics, known as the ‘V’s’ of Big Data What is Big Data
  • 5. Big Data - 4Vs  IBM’s 4Vs of Big Data: Volume Variety Velocity Veracity Data at Scale Terabytes to petabytes of data Data in Many Forms Structured, unstructured, text, multimedia Data in Motion Analysis of streaming data to enable decisions within fractions of a second. Data Uncertainty Managing the reliability and predictability of inherently imprecise data types.
  • 6.  Validity: Refers to Quality of data and Accuracy for intended purpose for which it is collected.  Volatility: Tendency for data structures to change over time. In this world of real time data you need to determine at what point is data no longer relevant to the current analysis  Value: The value through insight provided Big Data analytics More Vs of Big Data [10,3]
  • 7. 1024 Bytes = 210 Bytes 1024 KB = 220 Bytes 1024 MB = 230 Bytes 1024 GB = 240 Bytes 1024 TB = 250 Bytes 1024 PB = 260 Bytes 1024 EB = 270 Bytes 1YB=1024 ZB = 280 Bytes = 1,208,925,819,614,629,174,706,176 Bytes • 290 Bytes= Brontobytes, Hellabytes, or Ninabytes? • 2100 Bytes= Geopbytes, Gegobytes, or Tenabytes? Big Volume [4] – 1/ 1 GeopByte=1024 BB = 2100 Bytes = 1,267,650,600,228,229,401,496,703,205,376 Bytes
  • 13. Big Variety[13,14]  ~90% of all data is unstructured and it is growing faster than structured data.
  • 16.  Existing Storage and Processing systems have following limitations with respect to handling Big Data:  Way too much data to process within acceptable time limits: Network bottlenecks, Compute bottlenecks  Data needs to be structured before storing: Months needed to design/ implement new schemas, everytime new business need arises  Hard to retrieve archived data: Not trivial to find archive tapes and find relevant data Limitations of Existing Systems[15]
  • 17.  Distributed Computing: Horizontal Scaling instead of Vertical Scaling  Computations are done closer to where data is stored  Instead of centrally located parallel computing architecture with super-computing capabilities (Giga/Teraflops), low capacity distributed storage/computing solution is used  Use of Low Cost Commodity hardware  Big Data solutions use large number of low cost, commodity hardware, organized in clusters to carry out storage/computing tasks  Reliability, Fault Tolerance and Recovery  Individual nodes can fail anytime, so to ensure reliability, data is replicated across multiple nodes  Scaling with Demand  The solutions are scalable and allow cluster sizes to grow as per requirement  Storage of unstructured Data  Traditional RDBMS systems require well defined schema to be created, before data can be stored (schema on write)  New data storage paradigm – ‘NoSQL’ has evolved to cater to need to store any type of data. This provides for schema on read i.e. schema is applied when data is read.  No Archiving  Data is always online, so no archiving. The big data solutions do not assume what data queries will be using, so rule is to store all data in raw form. Characteristics of Big Data Systems
  • 19.  Key Points to Note: Comparison Traditional vs Big Data Storage [15] – 1/2 # Parameter Traditional Systems Big Data Storage 1 Schema Schema on Write: Schema must be created before data can be loaded Schema on Read: Data is simply stored, no transformation 2 Transformation Explicit load operation has to take place which transform data to DB internal structure A SerDes (Serializer/De-serializer) is applied during read time to extract the required columns 3 Storage Mechanism Single Seamless Store of Data mostly single machine/location Distributed Storage across multiple nodes/locations 4 Distillation (Organizing data for read) Already distilled data as in structured format Done on demand based on business needs, allowing for identifying new patterns and relationships in existing data.
  • 20.  Key Points to Note Contd..: Comparison Traditional vs Big Data Storage – 2/2 # Parameter Traditional Systems Big Data Storage 5 Store Process Data is stored after preparation (for example after the extract-transform- load and cleansing processes) 1. In a high velocity use case, the data is prepared and analyzed for alerting, and only then is it stored 2. In a volume use case, the data is often stored in the raw state in which it was produced. 6 Insights Analysis needs to be defined upfront and hence is rigid to the business need Ability to analyze data as required. Allows for data exploration and so enables the discovery of new insights that were not directly visible 7 Action Technically feasible, but not effective due to data latency Ability to integrate with Business Decisioning systems for the next best action
  • 21.  NoSQL database refers to class of database that do not use relational model for data storage (relational model uses tables and rows)  There are many NoSQL solutions, these are widely classified as: 1. Key-Value 2. Column-Family 3. Document 4. Graph NoSQL (Not Only SQL) Databases • First three models are aggregate oriented • Aggregate is a collection of related objects, treated as a single unit
  • 22.  Google BigTable is a compressed, high performance and proprietary data storage system used in Google projects. It is Column Family Database.  BigTable maps two arbitrary string values (row key and column key) and timestamp (hence three-dimensional mapping) into an associated arbitrary byte array. It is not a relational database and can be better defined as a sparse, distributed multi-dimensional sorted map Google BigTable[16]
  • 23.  Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's BigTable. It is a column oriented DBMS that runs on top of HDFS Apache Hbase[18]
  • 24.  Apache Cassandra is column-family database system. It designed as a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Column-Family Database[18]
  • 25.  MongoDB stores data in the form of documents, which are JSON[21]-like field and value pairs. Documents are analogous to structures in programming languages that associate keys with values (e.g. dictionaries, hashes, maps, and associative arrays). Formally, MongoDB documents are BSON[20] documents. BSON is a binary representation of JSON with additional type information. Document Database [19]  MongoDB supports search by field, range queries, regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions.  Any field in a MongoDB document can be indexed. Secondary indices are also available.
  • 26.  Neo4j is an open-source NoSQL graph database implemented in Java and Scala Graph Database [30]  The property graph contains connected entities (the nodes) which can hold any number of attributes (key-value-pairs).  Nodes can be tagged with labels, which in addition to contextualizing node and relationship properties may also serve to attach metadata—index or constraint information—to certain nodes.  Relationships provide directed, named semantically relevant connections between two node- entities.
  • 28.  Big data analytics refers to the process of collecting, organizing and analysing large sets of data to discover patterns and other useful information[23].  Conceptual Framework for Big Data analytics[24]: Big Data Analytics – 1/
  • 29.  The data analytics project life cycle stages[27]: Big Data Analytics – 2/
  • 30.  Following are types of Big Data Analytics[27]: Big Data Analytics – 3/ Diagnostic Analytics Descriptive Analytics Predictive Analytics Prescriptive Analytics
  • 31. Big Data Solutions and Frameworks
  • 32. Apache Hadoop [6]  Apache Hadoop is widely used, open-source software for reliable, scalable, distributed computing. Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage
  • 33.  Microsoft Dryad is a R&D project, which provides an infrastructure to allow a programmer to use the resources of a computer cluster or a data center for running data-parallel programs  A Dryad programmer can use thousands of machines, each of them with multiple processors or cores  A Dryad job is a graph generator which can synthesize any directed acyclic graph  These graphs can even change during execution, in response to important events in the computation. Big Data Solutions: Dryad[12]
  • 34. LexisNexis – HPCC[22]  HPCC (High-Performance Computing Cluster), also known as DAS (Data Analytics Supercomputer), is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. Thor: Batch Processing Engine Roxie: High Perf. Query Engine
  • 35.  Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications  Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object(aka driver program). Apache Spark [28] Lightning-fast cluster computing
  • 36.  The SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager or Mesos or YARN), which allocate resources across applications  Once connected, first it acquires executors (processes that run computations and store data) on nodes. Next, it sends application code (defined by JAR or Python files passed to SparkContext) to the executors.  Finally, SparkContext sends tasks for the executors to run. Apache Spark [28] Lightning-fast cluster computing
  • 37. Quiz (Match The Following)  Big Data is generally defined by:  HPCC is example of:  Cassandra is a:  MongoDB is a:  One of key Characteristic of Big Data Solution:  Characteristic of Traditional storage system:  NoSQL DB characteristic:  Schema design before Storage (Schema on write)  RDBMS  Column-family Database  Key-Value Database  Graph Database  Document Database  3Vs - Volume, Veracity, Variety  4Vs – Volume, Velocity, Variety, Value  Big Data Solution  Processing is closer to data location  Schema on Read
  • 38.
  • 40.  Hadoop Cluster consists of set of cheap commodity hardware networked together as set of servers in racks Hadoop – 1/
  • 41.  Hadoop framework allows for the distributed processing of large data sets across clusters of computers. Can scale up from single servers to thousands of machines each offering local computation and storage.  Designed to detect and handle failures so as to deliver a highly-available service on top of a cluster of computers, each of which may be prone to failures.  The project includes these modules:  Hadoop Common: The common utilities that support the other Hadoop modules.  Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.  Hadoop YARN: A framework for job scheduling and cluster resource management.  Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. Hadoop – 2/2 [6, 31]
  • 42.  HDFS is a Fault Tolerant Distributed File System  HDFS provides for all POSIX File System features:  File, Directory and Sub-Directory Structure  Permission (rwx)  Access (owner, Group, Others) and Super User  Optimized for storing large files, with streaming data access (not random access)  File System keeps checksum (CRC32 per 512 byte) of data for corruption detection and recovery  Files are stored on across multiple commodity hardware machines in a cluster  Files are divided into uniform sized blocks (64MB, 128MB, 256 MB)  Blocks are replicated across multiple machines to handle failures  Provides access to block locations (servers/racks), so computations can be done on same locations (same servers/racks on which data resides) Hadoop Distributed File System -1/9
  • 43. HDFS – 2/9  HDFS is implemented as a master-slave architecture  NameNode is master, it has a secondary NameNode as backup  DataNodes are slaves Name Node Secondary Name Node Checkpoints Data Node Data Node Data Node Metadata • Read/Write Commands • Data
  • 44. HDFS – 3/  NameNode manages the file system :  File System Names (e.g. /home/foo/data/ .. ) and meta data  Maps a file name to set of blocks  Maps a block to DataNodes where it resides  Blocks within the file system and their replicas  Manages Cluster Configuration  Managing data nodes  In case of NameNode Failure, SecondaryName node takes over DN1 DN2 DN3 DN4 DN5 ..DNN Files Meta-Data
  • 45. HDFS – 4/9  Meta Data  HDFS namespace is hierarchy of files and directories  Entire Meta-data is in Memory  No Demand Paging  Consists of list of blocks for each file, file attributes e.g. access time, replication factor etc.,  Changes to HDFS are recorded in log called ‘Transaction Log’  Block Placement, default 3 replicas, configurable  One replica on local node, Second on remote rack, Third on same remote rack  Additional copies randomly placed  Clients Read from nearest replica
  • 46. HDFS: Data Nodes – 5/9  DataNode  Slave Daemon process that reads/writes HDFS blocks from/to files in their local files system  During startup performs handshake with NameNode to verify namespace, software version of data node (if version mismatch, datanode shuts down)  Periodically sends heartbeat, block reports to NameNode  Heartbeat carries total storage capacity, fraction used, ongoing data transfers etc.  These stats are used by NameNode for block placement and load balancing  Block Report has Block ID, Timestamp, block length for each replica  Has no awareness of HDFS file system  Does block creation, deletion, replication, shutdown etc. when NameNode commands  Namenode commands are sent as replies to heartbeat messages received  Store each HDFS block in separate file as underlying OS’s files  Maintains optimal number of files per directory, creates new directories as needed  Interacts directly with client to read/write blocks
  • 47.  Java/C++ APIs are available to access Files on HDFS  Sample code illustrates, writing to HDFS as a 3 step process: HDFS Write – Sample Code: 6/9
  • 48.  Figure below illustrates how write takes place (how blocks and their replicas are updated): HDFS Write Operations: 7/9
  • 49.  Sample code illustrates, reading HDFS as a 4 step process: HDFS Read– Sample Code: 8/9
  • 50.  Figure below illustrates how read takes place HDFS Read Operations: 9/9
  • 51. YARN- Yet Another Resource Negotiator[33]  Manages Compute resources across the clusters  Consists of Following Nodes:  Resource Manager(RM)  Manages and Allocates Cluster Compute Resources  Node Manager on each Node (NM)  Manages and enforces node resources allocations  Application Master  Per application  Manages app lifecycle and tasks scheduling  Container  Basic Unit of allocation  Allows fine grained resource allocations
  • 52. YARN: Resource Manager  Resource Manager  Manages Nodes – Tracks heartbeats from NodeManagers  Managers Containers  Handles AM request for resources  De-allocates containers when they expire or application completes  Manages AM (ApplicationMasters)  Creates a container for AMs and tracks heartbeats  Manages Security  Support Kerberos
  • 53. YARN: NodeManager  Node Manager resides on each Node  Registers with ResourceManager (RM) and provides info on node resources  Sends periodic heartbeats and container status  Managers processes in container  Launches AMs on request from RM  Launches application processes on request from AM  Monitors resource usage by containers; kills rogue processes  Provides logging services to applications  Aggregates logs for an application and saves to HDFS  Maintains node level security via ACLs
  • 54.  Container  Created by Resource Manager upon request  Allocate a certain amount of resources (CPU, Memory)  Applications run in one or more containers  Application Master (AM)  One per application  Framework/application specific  Runs in a container  Requests more containers to run application tasks YARN: Containers and AMs
  • 55.  Client Requests RM an Application to be launched:  RM launched Application Master on one NodeManager YARN: Starting an App : 1/
  • 56.  Application Master (AM) requests resources from RM; RM allocates resources on Node Managers  RM confirms resources allocations to AM with details, AM launched App YARN: Starting an App: 2/ Resource Request Resource Name (Hostname, Rack#) Priority (within this app) Resource Required: • Memory (MB) , CPU (# of cores) etc. Number of Containers Allocates Resources Allocates Resources C1 C2 Container ID, Node • C1@NM1 • C2@NM2 MyApp MyApp Container Launch Context Container ID Commands Environment Local Resources Container Launch Context • Container ID • Commands (to start MyApp) • Environment (configuration) • Local Resources (e.g. MyApp binary, HDFS files) NM2 NM3 NM4 NM1
  • 57.  MapReduce, originally proprietary Google technology, is a programming model for processing large amounts of data in a parallel and distributed fashion. It is useful for large, long-running jobs that cannot be handled within the scope of a single request, tasks like:  Analyzing application logs  Aggregating related data from external sources  Transforming data from one format to another  Exporting data for external analysis Map-Reduce[34]
  • 58. Input from DB/HDFS MapReduce MapReduce Operation Map Shuffle Reduce Output to DB/HDFS
  • 59.  Zookeeper exposes primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming.  Clients connect to servers to access name space which is much like that of a standard file system to store/retrieve co-ordination data - status information, configuration, location information, etc., data is usually small, in the byte to kilobyte range.  Guarantees:  Sequential Consistency - Updates will be applied in the order that they were sent.  Atomicity - Updates either succeed or fail. No partial results.  Single System Image – Same view of the service regardless of the server used  Reliability - Once an update has been applied, it persists until updated.  Timeliness - View of the system is guaranteed to be up-to-date within a time bound. Zookeeper: A Distributed Coordination Service for Distributed Applications[29]
  • 60. Big Data – Business Trends
  • 62. Impact of Big Data on Economy
  • 63.  Top 10 Big Data Challenges Big Data Challenges
  • 65. Big Data Market Forecast
  • 66. Big Data: Revenues 1368 869 652 545 518 491 480 418 415 312 305 300 295 283 280 275 260 188 175 175 0 200 400 600 800 1000 1200 1400 1600 2013 Big Data Revenue ($ millions)
  • 67.  Government Operation: National Archives and Records Administration, Census Bureau  Commercial: Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS)  Defense: Sensors, Image surveillance, Situation Assessment  Healthcare and Life Sciences: Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity  Deep Learning and Social Media: Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets  The Ecosystem for Research: Metadata, Collaboration, Language Translation, Light source experiments  Astronomy and Physics: Sky Surveys compared to simulation, Large Hadron Collider at CERN, Belle Accelerator II in Japan  Earth, Environmental and Polar Science: Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors  Energy: Smart grid Big Data Applications
  • 68.  Real Time Analytics: Banking and Finance, Disaster detection and recovery, even monitoring etc. applications need vast data, coming at very fast pace to be processed within strict time limits  Artificial Intelligence/Business Intelligence:  Intelligent Maintenance Systems: is a system that utilizes the collected data from the machinery in order to predict and prevent the potential failures in them  IoT/M2M: These applications are generating data at a very fast rate (high velocity, from huge number of sources (high volume) and require big data solutions to process and derive meaningful information.  Transreality gaming, sometimes written as trans-reality gaming, describes a type or a mode of gameplay that combines playing a game in a virtual environment with game-related, physical experiences in the real world and vice versa. Emerging Trends in Big Data
  • 69.  Cloud computing advances have helped Big Data emerge as a mass scale solution  Leased/Rented data storage, computing clusters, enable even startups to have global scale Big Data capability, without major capital investment Emerging Trends in Cloud Computing – Complementary Technologies
  • 70.  Massively parallel processing refers to a multitude of individual processors working in parallel to execute a particular program  The Big Data paradigm consists of the distribution of data systems across horizontally coupled, independent resources to achieve the scalability needed for the efficient processing of extensive datasets.  Big Data Engineering: Advanced techniques that harness independent resources for building scalable data systems when the characteristics of the datasets require new architectures for efficient storage, manipulation, and analysis.  NoSQL: Non-relational models, also known as NoSQL, refer to logical data models that do not follow relational algebra for the storage and manipulation of data.  Federated database system is a type of meta2-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. Terms[3] – 1/
  • 71.  The data science paradigm is extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing.  The data lifecycle is the set of processes that transform raw data into actionable knowledge.  Analytics is the extraction of knowledge from information.  Data science is the construction of actionable knowledge from raw data through the complete data lifecycle process.  A data scientist is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes through each stage in the data lifecycle.  Schema-on-read is the application of a data schema through preparation steps such as transformations, cleansing, and integration at the time the data is read from the database.  Computational portability is the movement of the computation to the location of the data. Terms[3] – 2/
  • 72.  Transaction processing is a style of computing that divides work into individual, indivisible operations, called transactions.  Relational databases have traditionally supported the ACID transaction model. ACID transactions are:  Atomic Either all of the actions in a transaction are completed (i.e., transaction is committed) or none of them are completed (i.e., transaction is rolled back).  Consistent The transaction must begin and end with the database in a consistent state and must comply with all protocols (i.e., rules) of the database.  Isolated The transaction will behave as if it is the only operation being performed upon the database.  Durable The results of a committed transaction can survive system malfunctions.  The BASE acronym is often used to describe the types of transactions typically supported by nonrelational databases. A BASE System is described in contrast to an ACID-compliant systems as:  Basically Available, Soft state, and Eventually Consistent  BASE transactions allow a database to be in a temporarily inconsistent state that will eventually be resolved. Terms[3] – 3/
  • 73.  CAP Theorem states that a distributed system can support only two of the following three characteristics:  Consistency The client perceives that a set of operations has occurred all at once.  Availability Every operation must terminate in an intended response.  Partition tolerance Operations will complete, even if individual components are unavailable. Terms[3] – 4/
  • 74. 1. Webopedia: http://www.webopedia.com/TERM/B/big_data.html 2. Gartner Big Data Article: Laney, Douglas. "The Importance of 'Big Data': A Definition". Gartner. Retrieved21 June 2012 3. NIST definitions: http://bigdatawg.nist.gov/_uploadfiles/BD_Vol1-Definitions_V1Draft_Pre- release.pdf 4. Extreme Big Data: http://www.forbes.com/sites/oracle/2013/10/09/extreme-big-data-beyond- zettabytes-and-yottabytes/ 5. Presto Project: https://prestodb.io/ 6. Hadoop Project: http://hadoop.apache.org/ 7. Xoriant Big Data Report: http://www.xoriant.com/big-data-services 8. Big Data Article: http://www.slideshare.net/Codemotion/codemotionws-bigdata-conf 9. Big Data Article at Data science central: www.datasciencecentral.com 10. Big Data Article by IBM: http://www.ibmbigdatahub.com/infographic/four-vs-big-data 11. Big Data Article: http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big- data-veracity/ 12. Dyrad Project: http://research.microsoft.com/en-us/projects/Dryad/ 13. Data Variety: http://www.bi-bestpractices.com/view-articles/5643 References
  • 75. 14. Data Growth Article: http://www.businessinsider.in/Social-Networks-Like-Facebook-Are-Finally- Going-After-The-Massive-Amount-Of-Unstructured-Data-Theyre- Collecting/articleshow/31055495.cms 15. Coudera Modern Data Operating System: http://www.slideshare.net/awadallah/introducing- apache-hadoop-the-modern-data-operating-system-stanford-ee380 16. Google BigTable: www.research.google.com/archive/bigtable-osdi06.pdf 17. Google Spanner: www.research.google.com/archive/spanner-osdi2012.pdf 18. Apache Hbase: hbase.apache.org/ 19. Mongo DB: www.mongodb.org/ 20. BSON Specs: http://bsonspec.org/ 21. JSON Specs: http://www.json.org/ 22. LexisNexis HPCC: http://hpccsystems.com/ 23. Definition of Big Data Analytics: http://www.webopedia.com/TERM/B/big_data_analytics.html 24. Big Data, Mining, and Analytics: Components of Strategic Decision Making, Mar 2014, Stephan Kudyba, CRC Press. 25. Big Data Use Cases: http://bigdatawg.nist.gov/usecases.php 26. Big Data Analytics with R and Hadoop, Vignesh Prajapati, PACKT publishing. References
  • 76. 27. IBM Article: Transforming Energy and Utilities through Big Data & Analytics: http://www.slideshare.net/AndersQuitzauIbm/big-data-analyticsin-energy-utilities 28. Apache Spark: https://spark.apache.org/ 29. Zookeeper: http://zookeeper.apache.org/doc/trunk/zookeeperOver.html 30. Neo4j Database: http://neo4j.com/developer/graph-database/ 31. Apache Hadoop: http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F 32. HDFS: http://www.slideshare.net/hanborq/hadoop-hdfs-detailed-introduction 33. YARN: http://www.slideshare.net/hortonworks/apache-hadoop-yarn-enabling-nex 34. MapReduce: https://cloud.google.com/appengine/docs/python/dataprocessing/ 35. MapReduce@Wiki:http://en.wikipedia.org/wiki/MapReduce 36. Investments in Big Data: http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017 37. Big Data Challenges: http://infographicsmania.com/big-data-challenges/ References