6. What is a Data Lake?
A data lake is a collection of long term data containers that capture, refine, and explore
any form of raw data at scale, enabled by low cost technologies, from which multiple
downstream facilities may draw upon.
Data sources Downstream
Sensors email
TransactionsMachine logs
Geolocation Media
BI Tools IDW
Data Marts Analysis
Apps Other
Data LakeData Lake
C
7. Value from Data Lakes
• New insights from unknown or under appreciated
data
• New forms of analytics
• Expanded corporate memory retention
• Data integration optimization
C
9. Data Manufacturing: Logical View of Workloads
DATA R&D
• Goal: analytic agility, flexibility
• Exploratory tools, algorithms, skills
• Finding new high value questions
• Light governance, no SLAs
• Data scientists, data miners
DATA LAKE
• Goal: original raw data at low cost
• Refinery feeds data R&D, data products
• Medium governance, SLAs
• Low business value density
• Programmers and data scientists
DATA PRODUCTS
• Goal: consumable analytic results
• Integrated, cleansed, + metadata
• High governance, SLAs, cost
• High business value density
• Shared by many users, roles, skills
R
10. AccessPreparationAcquisition
Data Lake Architecture
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS
& APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
Streams SearchAggregations
Security, Metadata/Lineage, Administration
Distributed Storage
Msg. queues Cleansing Access
ExperimentsGovernanceFeeds
SOURCES
Sensors
email
Social
Telemetry
Mobile
Tabular Data
Machine logs
Attunity, Hortonworks and Think Big sponsored research on Data Lake adoption and maturity. Working with Database Trends an Applications (DBTA), we had Radiant Advisors and Unisphere Research survey 385 IT practitioners and stakeholders at organizations within a variety of industries. Today, we’ll be discussing the results of the survey and talking about what we’re seeing from our customers who are using Data Lakes today.
The survey respondents were highly technical. Approximately 60% were IT and database administrators while the remainder held CXO or similar executive leader roles. Less than 5% or respondents were from academia.
The company size of respondents was made up of 30% large enterprise (over 20,000 employees worldwide), 48% mid-size (less than 20,000 employees but more than 250 employees) and 22% small business (less than 250 employees.)
There was a broad spectrum of industry verticals with finance and software companies as the most highly represented at 13% and 11%. Other industries well represented included government, education, and manufacturing. And, over 80% of respondents were from North America.
At a high level, the research showed that:
The Data Lake is increasingly recognized within a data strategy
Clear early use cases exist for the Data Lake
Governance and security are still top of mind as challenges and success factors for the Data Lake
This chart shows that about 51% of respondents are familiar with the term “Data Lake” and 20% of respondents are actively involved with it.
Drawing upon many sources of definitions as well as on-site experience with customers, Teradata came to this definition. We locked 14 of our top experts in a room --including our CTO, VP of development, and president of Think Big to debate the definition. We came to agreement sooner than we expected.
Notice it does NOT say Hadoop but it does say low cost technologies which includes hardware.
Scalability is a crucial SLA. It also means your quad core Intel server with 10 terabytes is not a data lake. Call it a puddle.
Raw data is the key to the data lake’s goals. We want to keep the first version of a file in its native format. Yes, we will do light transformations. But if we have the original file, we can always repeat those transformations. If you only have derivative 5 in series, you cannot reproduce derivatives 2, 3, and 4. Also note that ERP, CRM, and SCM data extracted is also raw data. Don’t try eating this stuff, cook it first.
Note the emphasis on downstream services. The clarifies a huge role for the data lake.
The Data Lake is independent of technology. It is technology neutral.
A data lake holds raw data and initial light refinements. Think of this as ETL. This is the first stages of refinement.
The data products often get their data from the data lake. The data is further refined to a final state in the data products system. Data products are what the business user consumes with their applications or BI tools. Data R&D is research, looking for the questions that need to be asked on a regular business task. We tend to equate this with data scientists and Aster but its not limited to these. Like everything else in this diagram, data R&D is an abstract concept, not a technology.
Yes, there is some overlap where each of these systems can do the work of the other. For example, analytics can be applied in all 3 systems. But best fit engineering drives most implementations to the system with the most capability for a specific task.
Think of it like a manufacturing line: the data lake receives the raw ore from dirt in the ground. It refines it into steel ingots. The data products division refines it further into consumer useful items such as pans, lamps, automobiles (consumer goods). The R&D division is constantly looking at the raw dirt and ingots for traces of new ores or clues of what can be done with this lightly refined materials.
Each of these major subsystems has different service levels for availability, performance, data quality, data freshness. The SLAS are different for a data product versus a data lake versus data R&D. For example, a data scientist cares less about high performance and data quality than about agility. They don’t want data models or ETL. Data products need high availability, high performance, high data quality
The raw data lake holds all data (includes non record oriented data) forever (some hype in those words). But it does mean we need an extremely low cost of storage and access. If we can reduce costs 10X, we can store 10X more data. A subset of the raw data is transformed by data scientists into data products. More often, data is refined and promoted to the data products area.
High business value density means that every record stored has value to some sector of the user population. Low density means there is a lot of noise in the data that must be sifted and discarded to find the valuable data. In this logical abstraction, we must separate the tools and people from the logical workload.
This illustrates the fundamental processing in iconic form.
The research showed that 20% of respondents have an initiative underway and 35% have an approved budget for a Data Lake initiative.
The largest majority of respondents in the survey – 70% - are using Hadoop for data discovery, data science, and Big Data projects.
ITAMAR: Survey result shows that as adoption of the Data Lake continues, respondents are most concerned about governance, metadata management issues and security. Additional areas of continues concern include availability of skills and data ingest.
It’s interesting to note that when it comes to governance, 62% of responders said that governance was a “must have” from the beginning, while 31% said that governance can be added incrementally.
When it comes to security, 42% said that a Data Lake strategy can not be started without a security framework in place. Another 45% said that the security framework must be consistent or even more robust than other IT database security policies.
With that – about half of the respondents also highlighted the lack of skills as well as data ingest as key challenges.
The survey showed us that companies are using Data Lakes today, but that there are still obstacles towards achieving goals.
We invite you to get your own copy of the research.