Navigating User Data Management and Discovery in Hadoop

NAVIGATING THE WORLD OF
USER DATA MANAGEMENT AND DATA
DISCOVERY
SMITI SHARMA, VIRTUSTREAM - EMC

About Me
2
Principal Engineer & Lead,
Big Data Cloud, Virtustream
 Oracle – Principal Engineer & PM team
 EMC – Big Data Lead
 Pivotal – Global CTO
 Virtustream EMC
Areas of expertise
 Architecting, developing and managing Mission Critical
Transactional and Analytics Platforms
 RDBMS & NoSQL – Hadoop Platforms
 Product Management and Development
 ODPi RT member
Sessions at Hadoop Summit 2016
 Wed: 11:30am Room 211: User Data Management in HD
 Wed: 4:10pm Room 212: Building PAAS for Smart cities
Smiti.sharma@virtustream.com
@smiti_sharma

About Virtustream
 Enterprise-Class Cloud Solutions
− Cloud IAAS, PAAS
− Cloud Software Solutions
− Cloud Managed Services
− Cloud Professional Services
 Developer of xStream Cloud Management Platform SW
 Inventor of the µVMTM (MicroVM) Cloud Technology
 Industry leading Cloud offers in areas of
− SAP Landscape, HANA
− Storage
− Big Data
 Close partnerships with SAP, EMC, VMWare
 Service Provider to 2,000 Global Workloads
Global Footprint

 Data Management Overview
 Project Background & Context
 High level Architecture and data flow
 Solution criteria
 Evaluation criteria of Data Management Tools
 Differentiation factors
 Proposed Solution
 Conclusions
Agenda
4

User Data Landscape
Master Data Management
Metadata
- Business
- Technical
- Operational
Reference DataTransactional data
6
Reference DataTransactional data
Metadata
- Business
- Technical
- Operational

Driving Factors for Data Management
• IT custodian of business data
• Data Characteristics
– Business Value
• Analytical vs. transactional systems
– Volume and Volatility
– Complexity of Data type and formats
– Adaptive feedback from IT to Business
– Reusability factor – across different teams
– De-duplication factor
7

MDM is an organizational approach to
managing data as a corporate asset
8
What is Master Data Management
Application framework and collection of
tools implemented with the business
goal of providing, managing & using
data intelligently and productively

Multi-Domain & Organizational MDM
Metadata Data
Reference
Data
Transactional
Data
9
Domain 1: Product
Master Data Metadata Data
Reference
Data
Transactional
Data
Domain 3: Logistics
Master Data
Metadata Data
Reference
Data
Transactional
Data
Domain 2: Supply Chain
Master Data

Universal Ideology..somewhat..
10

11
Data Management for Hadoop: Why, At what stage
7%
4%
9%
Today
In three years
Growth
Indicator
Fig 2: Source TDWI
Fig 1: Source TDWI
Primary Strategy to
improve Quality of
Data managed in Hadoop

• Uni-directional movement of data
• Static and limited identification patterns
• Focused mainly on Transactional systems – data type/Hadoop OSS
integration limited
• Non-adaptive solutions to rapidly changing “schema”
• Limiting performance
12
Traditional MDM Challenges

Building “Data Management
Layer”
for Hadoop Data Lake
13

Project(s) and Context
• Project initiated at two Large Retailers
• Goal to extend the analytical Data Lake
– As of Late 2015 Data Lake built only for Analytics
– Pulls data from Transactional, ERP, POS systems
– Implemented using ODPi (Pivotal/Hortonworks) Distribution and Greenplum for MPPDB
• Next Generation Data Lake
– Current ETL system reaching performance and scale limits  Move ETL in Hadoop
– Move BIDW and Transactional reporting to Hadoop
– Increase users on this system – Security and Quality constraints
– In-store SKU count ~ 500 Million ; Online SKU count ~ 5 Million
• Complex Master Data Management around existing systems
– For Hadoop – the EIM integration didn’t exist and/or processes were not in place
– Little to no interest from EIM data integration team
14

Key Problem Statement (at least 1 of them!)
Evaluate and Prototype the
Data Management Strategy and Product(s)
to enhance and enrich the
“Next Generation Data Lake”
15

High Level Logical Data Architecture
Metadata repository, Policy
management and Business
Rules Engine
Enterprise Security
Framework
(AD/LDAP)
Query/Accessand
VisualizationLayer
• API to access data sources
• Interfaces with Metadata
Repository to define data query
path
• Potentially Custom Portals for
User queries as well as standard
tools
Access
DataSources
I
n
g
e
s
t
Data Sources
(Raw)/Aggregated*
Inventory
data
Logistics
Product/
Vendor Data
Data Fabric/“Landing Zone”
Processing Framework
In-
Memory
Process-
ing
Object
store
HDFS
MPP DB RDBMS
NoSQL/
NewSQL
Data Ingest to Persistence or
memory layer
Federated query
Ingest to Metadata
management Layer
Cross reference for Rules,
policies and Metadata
LEGEND
16
Metadata Management
Ingestion and Indexing
Data Management

Solution Requirements
• Inherent Data processing requirements
• Incoming data from sources e.g. Kafka, Storm, Sqoop, Spark
• Be able to manage complex data types e.g. Video files from POS
• Data placement based on priority and sensitivity – memory or disk
• Handling both Synchronous and Async (In-band and out-of-band)
• Integration with existing EIM tools
• Performance requirements
• Increasing ingest volume of data and expanding sources
• Varied Data Type support and considerations
17

File format type Embedded
Metadata
Compression&
Splitable
HQL/SQL interface
viability
Popularity in current
and new landscape
Support for
Schema
evolution
18
CSV/Text No No^ Hive/Hawq Most common Limited
Avro Yes Yes Hive Increasing footprint Yes
JSON Yes No^ Hive/ MongoDB Increasing footprint Yes
RC Files Limited Not as well Hive (RW) Yes No
ORC Files No Yes Hive (RW) Impala ( R) Yes No
Sequence Files (binary
format)
No Yes Hive (RW) Impala ( R) None today Limited
Parquet Yes Yes Yes – Hive and impala Increasing footprint Limited
• Read/Write performance
• Source, Application and development effort and support
• Hierarchical model
File Format Considerations

File format type Embedded
Metadata
Compression&
Splitable
HQL/SQL interface
viability
Popularity in current
and new landscape
Support for
Schema
evolution
19
CSV/Text No No^ Hive/Hawq Most common Limited
Avro Yes Yes Hive Increasing footprint Yes
JSON Yes No^ Hive/ MongoDB Increasing footprint Yes
RC Files Limited Not as well Hive (RW) Yes No
ORC Files No Yes Hive (RW) Impala ( R) Yes No
Sequence Files (binary
format)
No Yes Hive (RW) Impala ( R) None today Limited
Parquet Yes Yes Yes – Hive and impala Increasing footprint Limited
• Read/Write performance
• Source, Application and development effort and support
• Hierarchical model
File Format Considerations

Key Evaluation and Selection
Criteria
20

Initial Challenge
• Too many tools to choose from
• Each claimed to be Metadata management tool
• Each claimed security and integration features
• Resistance from the EIM team when initially involved
• Translating Data Management Ideology to tasks of
evaluation
21

Project Approach
• Build a list of KPI to evaluate tools
• Working with EIM team (best practices advise & SME engagement),
business and IT team support Data lake project
• Vendor Identification – List of 5
• Implementation
• Minimized scope of project
• Decided to tackle integration with legacy EIM at a later date
• After Evaluation, focused on implementing no-more than 2 Data
management tools for Next-Gen Data Fabric Platform
22

• Define Business Metadata (Is reference data available within tool or outside)
• Automation and flexibility in crawling the HDFS and understand the various format
– Range of File formats supported
– Reading each file to extract metadata
– Both for data persisted already and incoming new files in real-time
– Cross reference with lookup or repository for pre-existing classes and profiles
– Maturity of attaching context or facet to the atomic data
– Ability to retrieve descriptive and Structural Metadata even with no Metadata within the content
• Storing the profiled data – actual data and metadata in a repository
• Custom Tagging as well as recognizing Metadata information
• Translation and integration with industry certification and models
23
Metadata Curation and Management (1/2)
Data Profiling

• Ability to classify data – based on user defined categories
– Search/Crawl and identification "Facet Finder” and efficiency of internal repository
– Presence of Data Models if any
– Features around custom Metadata and Tagging
• Once classified - ability for Metadata information to be indexed, and searchable thru
API or Web Interfaces
– Efficiency of Search and indexing
– Richness of Integration with NLPTK
• Data Re-mediation
• Data Archiving and policy implementation
• Notification: Configurable triggers – based on user-defined criteria
24
Metadata Curation and Management (2/2)
Data Classification

Lineage and Versioning
• Be able to identify the origin of data – i.e. from
– Transactional systems, Dump files, Another HDFS file, Repository etc.
– Level of depth of data origination and lineage
• Ability of the solution to sense and preserve Metadata Versions around a given entity
during Capture process and post
• Ability to support Deduplication with the Entity’s metadata
– On the fly without impacting the performance
25

Integration
• Ability to integrate its meta store with enterprise MDM /EIM systems
– Maturity of Metadata Entity Readers (Input/Output) Artifacts from Metastore
– Bi-directional API for other tool integration to identify lineage
– Bi-directional API for other tool integration for SIEM threat assessment and detection
– While maintaining user and security context
• Integration with the various tools of Ingestion, Transformation & Consumption
– Spark, Storm, Kafka, Informatica, Data Stage etc.
• Integration with security tools – LDAP, ACLs, encryption
• Rules and Policy engine
26

Performance, Accuracy and Ease of Use
• Sample visualization of Metadata with Native Reporting tools & others
• Ability to process compressed and encrypted files
• Level of Error and exception handling built in during all processes
• Impact on performance from
– Crawling, scanning and profiling
– Classification & transformation
• Enable notifications of data availability - how customizable are they?
• Self-service discovery portal leveraging curated artifacts
27

Some of the notable Vendors evaluated
• Attivio
• Global ID
• Waterline Data
• Zaloni
• Adaptive Inc.
28
At the time of this study, Falcon and Ranger were new. Little analysis on these products was done

Vendor Evaluation Scoreboard (Template)
29

Vendor Evaluation Summary Results

31
Metadata
curation and
management
Lineage and
versioning
Integration Performance,
Accuracy and
Ease of use
Attivio
Global ID
Waterline Data
Zaloni
Global ID
Attivio
Waterline Data
Zaloni
Zaloni
Attivio
Global ID
Waterline Data
Attivio
Zaloni
Waterline Data
Global ID
• All tools had satisfactory features overall with emphasis in 1 or 2 areas.
• Your choice of tools needs to align with Business and User Requirements
• Waterline: Automated data discovery, self- service
• Attivio: Data Curation – Discovery, Search, flexibility of tagging, performant and scalable
• Global ID: Efficient in Mapping logical models, overlapping data identification and pattern matching
• Zaloni: Had notable interface for Data mapping and flow, integration with external tools
Evaluation Summary CAVEAT: Based on
criterion driven by
customer needs. You
eval and updates from
vendor will affect
results

Rules Engine
Enterprise Security
Framework
(AD/LDAP)
Query/Accessand
VisualizationLayer
path
tools
Access
DataSources
I
n
g
e
s
t
Data Sources
(Raw)/Aggregated*
Inventory
data
Logistics
Product/
Vendor Data
In-
Memory
Process-
ing
Object
store
HDFS
MPP DB RDBMS
NoSQL/
NewSQL
32
Metadata Management
Ingestion and Indexing
Data Management
memory layer
Federated query
Ingest to Metadata
management Layer
LEGEND

Rules Engine
Enterprise Security
Framework
(AD/LDAP)
CustomPortal/other
evaluations(TBD)
path
tools
Access
DataSources
Flume/Kafka/SpringXD
Data Sources
(Raw)/Aggregated*
Inventory
data
Logistics
Product/
Vendor Data
Apache
Spark/
GemFir
e
Object
store
HDFS
MPP DB RDBMS
NoSQL/
NewSQL
33
Metadata Management
Attivio
Global ID
memory layer
Federated query
Ingest to Metadata
management Layer
LEGEND

 Market
− Metadata Mgmt tools in market are still evolving for Data Lake architectures
− Ever growing and Rich Partner ecosystem
− Hadoop does not offer a sufficient policy engine or action framework
 Customer
− Choice of tool is IT and business driven. Sponsorship important !
− To drive adoption – ease of use and intuitive product a must
− Balancing Multi-vendor and functionality: Limit number of tools to 3
− Recommendation to use Information management Professional Services with selected tool (s)
Key Takeaways

PROCESS
 Evaluation of the tools
− Reviews and demo of the tools versus a full-fledged POC
− Build an adaptive matrix of KPI measurements, customized to your organization - Unless quantified
evaluation would be very subjective
 Beware of the Trap- Analysis - Paralysis
− Multiple business units drive this decision
− Functionality scope - workflows, ETL processes and integration or pure-play data management
− Integration with existing EIM tools was delayed as a priority: Huge part of the success
 Investment/Cost: Existing tools, Level of Effort and implementation
Key Takeaways

References
• References to the following documents were made
– TDWI- Hadoop for enterprise
– MDM institute
• Acknowledgements from the following authors and additional work
– EMC IT Team
– Customer’s IT team for Prototyping along with EMC Field resources
37

Navigating User Data Management and Discovery in Hadoop

Navigating User Data Management and Discovery in Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Navigating User Data Management and Discovery in Hadoop

Similar to Navigating User Data Management and Discovery in Hadoop (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Navigating User Data Management and Discovery in Hadoop