Greg Chase, Director, Product Marketing presents Big Data 10 A
mazing Things to do With A Hadoop-based Data Lake at the Strata Conference + Hadoop World 2014 in NYC.
This is an architecture of a Business Data Lake. It is centered around Hadoop-based storage. It includes tools and components for ingesting data from different kinds of data sources, processing data for analytics and insights, and for supporting applications that utilize data, implement insights, and contribute data back to the data lake. In this presentation, we will look at the various components of a business data lake architecture, and show how to use it to maximize the value of your company’s data.
Let’s first look at why Hadoop and HDFS for a data lake makes a lot of sense.
Hadoop, and its underlying Hadoop File System, or HDFS, is a distributed file system that supports arbitrarily large clusters. This means your data storage can theoretically be as large as needed to fit your needs. You simply add more clusters as you need more space.
HDFS is schema-less, which means it can support files of any type and format. This is great for storing unstructured or semi-structured data, as well as non relational data formats such as binary streams from sensors, image data, machine logging. It’s also just fine for storing for structured, relational tabular data.
When your data storage can take any kind of data from any kind of source, allowing this data to be loaded and stored can be a challenge. This is why a wide selection of tools for ingest is needed to implement a data lake.
Batch loading can be achieved with a variety of tools, depending on additional sources needed.
Sqoop, for example, is great for handling large data batch loading, and can even pull data from legacy databases.
On the other hand, if your bulk loading operation needs some additional processing on it – such as you want to transform data from one format to another or create metadata, and if you want to be able to create analytics, then another open source tool, Spring XD, is available and provides scale and flexibility to handle your specific needs.
Microbatch – in other words, smaller, but recurring batch loads, such as data change deltas or event-triggered updates, is handled well by Flume.
Storing high-velocity data into Hadoop is a different challenge altogether. Considering that your source could be in any volume in addition to speed. If ensuring you store all the data is paramount, you need tools that can capture and queuedata in any scale or volume until the Hadoop cluster is able to store.
A data lake based on Pivotal Big Data Suite has two tools built for these use cases. In fact they can work together:
Spring XD can scale to handle data streaming at real time, and provide the same capabilities of processing and analyzing.
Pivotal GemFire XD can work with Spring XD to provide advanced database operations such search for duplicates in a window of time, for example, and allows you to ensure consistency of data in writes. Since it it’s a SQL-based database, it’s also great for helping convert or add structure to ingested data.
Once you have the ability to store and load data into your data lake, the next is deriving business value by processing, gaining insights, and taking action on the data.
It’s great that one can can get any kind of data into an HDFS data store. However, to be able to conduct advanced analytics on it, you often need to make it accessible to structured-based analysis tools.
This kind of processing may involve direct transformation of file types, or it might simply mean analyzing and creating meta data about the file type.
This can be done on ingest with some of the tools described, or can be processed after being stored in Hadoop.
Examples might be transforming binary image formats into RDBMS tables to enable large scale image processing, or even simple ETL processes on web logs so that it can later be turned into fact tables.
Once you have structure applied to your data, its possible to leverage SQL-based tools to do fast processing on your data for advanced analytics and data science. Only HAWQ provides full analytic SQL support on Hadoop in massive parallel processing. This allows you enjoy very high performance leveraging advanced analytics functions in MADlib, as well as when using analytics applications such as SAS.
With structure applied to your data, and the ability to deploy advanced analytics, now you can start doing some very powerful investigation, which is actually supported by Hadoop.
By discovering relationships between otherwise seemingly unrelated data sets, its possible to discover correlations and potential causation, and create multi-dimensional analytical models that have higher precision in predictive analytics.
Since HDFS allows you to store as much data as you want at a very cheap price, its possible to store larger detail data sets such as time series feeds, and application logs. In traditional data warehousing, ETL processes will aggregate and summarize this information, and lose detail for purposes of facilitating reporting. By saving the detail, its possible to run machine learning algorithms on the data to help build more accurate predictive analytics.
Distributed in-memory databases such as Pivotal GemFire XD make it possible to deploy real-time data-driven automation at scale. This means you can deploy applications for responding to and processing incoming streaming data such as for Internet of Things applications, or support large scale mobile-web applications. You want to create intelligent user experiences, and provide smart automation and processing in the backend. You also want to be able to capture and store detailed logging of all interactions for further analysis.
The ability to deploy automation at scale, capture and store all data, and analyze to discover insights and algorithms is an ongoing process of continuous improvement and innovation.
Using the full capabilities of a data lake from storing massive data sets to achieving coninuous innovation allows your company to maximize the business value it generates off its data.