The document discusses how Hadoop can be used for interactive and real-time data analysis. It notes that the amount of digital data is growing exponentially and will reach 40 zettabytes by 2020. Traditional data systems are struggling to manage this new data. Hadoop provides a solution by tying together inexpensive servers to act as one large computer for processing big data using various Apache projects for data access, governance, security and operations. Examples show how Hadoop can be used to analyze real-time streaming data from sensors on trucks to monitor routes, vehicles and drivers.
So, where does Hadoop fit in the data center? This picture here is a very simple depiction of the typical data architecture in any organization.
- There are sources of data: ERP, CRM, other digital sources
- That data is then stored in a data system: a data warehouse, MPP system, etc.
- Then an application of some kind accesses that data system: a packaged application such as Excel or Tableau, a custom application written by a developer, or even another business application
This has been the foundation of the data center for years. We have had some challenges with this architecture all along, however, we are seeing increased pressure to modify and improve this basic blueprint because
A) this approach created silos of data and it was difficult to share the data or get a holistic view of it
B) these systems are costly to scale
C) and they are also coupled to a very static schema. Changes to a data model are difficult if not impossible. This limits flexibility and insight.
Finally, the emergence of NEW types of data as we digitize the world around us such as clickstream, machine sensor, etc, are growing at exponential rates. We are all becoming data driven organizations.
In fact that sheer volume of data is to grow 20X between 2013 and 2020 – and which puts tremendous pressure on this architecture. The old architecture is neither technologically nor commercially practical.
When you distill down all the “new” types of big data that are being managed by Haodop, they generally fall into the six categories .. as represented as columns on the left side of this slide…
- sentiment & web,
- clickstream,
- machine & sensor,
- geographic data,
- server logs and
- and general unstructured content, the stuff we find in docs and pdfs throughout our organization.
Within various vertical, best-practice architectures have emerged to surface the value from Hadoop and HDP. Some representative appear here:
Advertisers target ads to their best customer segment and also analyze point-of-sale data to determine the effectiveness of campaigns.
Banks detect fraud and money laundering while also improving customer service.
Hospitals respond to patients in real time and then analyze historical data to reduce readmission rates.
Manufacturers control quality on the production line and then diagnose product defects in the aggregate.
Oil companies predict and repair equipment proactively and also analyze equipment durability under varied circumstances.
Telecoms allocate bandwidth in real time, and later discover unforeseen patterns after analyzing billions of historical call records.
Retailers make sure the shelves are stocked today and also plan their product mix for next year.
**** WHAT IS YOUR USE CASE???
YARN enables the modern data architecture as it turns hadoop into a truly multi-purpose data platform with batch, interactive and real time workloads all running in a single cluster..
It enables users to:
- Create a central cluster into which data can be stored and then accessed it using a range of processing engines: batch, interactive, real-time.
- It is akin to the journey with virtualization: from a single virtual server to a pool of virtual infrastructure.
It is the architectural center of Hadoop
- it provides the data operating system around which the core enterprise capabilities of security, governance and operations can be integrated
- It is the integration point into which all data processing engines integrate – from the open source community but also from the commercial vendor ecosystem
Hadoop has evolved over the years to not only provide linear scale compute and storage, but it also needed explicit functions to make it a complete data platform. These new projects spun up around Hadoop to meet some of the complex requirements of the modern enterprise
A good way to look at the evolution of Hadoop is through this picture.
- When Hadoop began it was simply a data management layer (HDFS) and a single data access engine (MapReduce). Over the past several years the range of components in the Hadoop ecosystem has exploded:
- Data Access - The emergence of multiple access engines spanning SQL, NoSQL, Scripting, Streaming and more. YARN ensures that they all can be part of Hadoop seamlessly.
- Security - To address the key requirements of authorization, access, audit/accounting and data protection
- Operations - Tools to manage the platform
- Governance and integration - Tools to load and manage data according to policy
These are all the core requirements of any data platform and over time the Hadoop community has expanded to include all of these capabilities. The reason that there are 5 categories?
Because each addresses the requirements of each different persona that engages with a data platform.
Developers (Data Access)
Administrators (Security, Operations)
Governance (Data Architects)
YARN enables the modern data architecture as it turns hadoop into a truly multi-purpose data platform with batch, interactive and real time workloads all running in a single cluster..
It enables users to:
- Create a central cluster into which data can be stored and then accessed it using a range of processing engines: batch, interactive, real-time.
- It is akin to the journey with virtualization: from a single virtual server to a pool of virtual infrastructure.
It is the architectural center of Hadoop
- it provides the data operating system around which the core enterprise capabilities of security, governance and operations can be integrated
- It is the integration point into which all data processing engines integrate – from the open source community but also from the commercial vendor ecosystem