More Related Content More from MapR Technologies (20) Best Practices to Deploy a Governed Data Lake1. © 2015 MapR Technologies 1© 2015 MapR Technologies
Deploying a Governed Data Lake
2. © 2015 MapR Technologies 2
Welcome
• Event will be recorded
• Ask your questions in the Q&A Panel in the lower right-hand
corner of your screen
• Tweet us @mapr during the event
3. © 2015 MapR Technologies 3
Key Points
• The data lake is becoming a “real-time” shared service to provide
data to the business to support data science and big data
analytics needs
• As the data lake becomes a trusted source of data to drive big
data analytics, security and data governance have to be
addressed
• Security and data governance policies need to be implemented
in a way that still enables self-service and quick time to value vs.
creating 3-6 month delays
4. © 2015 MapR Technologies 4
Deliver Data Discovery Agility with a Governed “Data Layer”
Adhere to security,
compliance and data
governance policies
Catalog data assets at scale,
with secure provisioning to
the business
Find and understand best-
suited and most trusted data
5. © 2015 MapR Technologies 5
The danger of the data lake becoming a flea market
Botond Horvath / Shutterstock.com
INVENTORY
DATA
Can’t create and maintain an
inventory fast enough
Big Data Architect INVENTORY
DATA
Can’t explore everything to find
the best item
Data Engineer/Data
Scientist/Business Analyst
INVENTORY
DATA
Can’t tell what’s what and what
can be trusted
CDO/Data Steward
6. © 2015 MapR Technologies 6
Imagine shopping on Amazon.com
GOVERNANCE
Inventory
Find and Understand
Provision
7. © 2015 MapR Technologies 7
Governed data lake is like Amazon.com for data in Hadoop
GOVERNANCE
Inventory
Find and Understand
Provision
8. © 2015 MapR Technologies 8
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
SENSORS
BLOGS,
TWEETS,
LINK DATA
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Operational
Apps
Recommendation
Fraud Detection
Logistics
MapR-DB MapR-FS
MapR Data Platform
Distribution including
Apache Hadoop
The Governed Data Lake on Apache Hadoop
Data Inventory:
Find, understand
and govern
9. © 2015 MapR Technologies 9
The Governed Data Lake
Define Ingest Inventory Explore Provision
Wrangle/Model/Vi
sualize
• Critical data elements
• Sensitive data elements
• Security and data
governance policies
• Load
• Profile
• Automatic tagging
• Discover metadata
and generate tags
• Discover data lineage
• Manage tags
• Browse/search
inventory
• Inspect data quality
• Tag and annotate
• Bookmark
• Copy
• Authorized view
Governed data lake as a shared service
Data Governance Data Discovery Agility
Data protection, authentication, authorization, auditing
Can you achieve both?
10. © 2015 MapR Technologies 10
Find, understand and govern data in Hadoop
11. © 2015 MapR Technologies 11
Waterline Data is like Amazon.com for data in Hadoop
GOVERNANCE
Inventory
Find and Understand
Provision
14. © 2015 MapR Technologies 14
Provision
Future: Generate
Drill Views
16. © 2015 MapR Technologies 16
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
SENSORS
BLOGS,
TWEETS,
LINK DATA
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Operational
Apps
Recommendation
Fraud Detection
Logistics
MapR-DB MapR-FS
MapR Data Platform
Distribution including
Apache Hadoop
The Governed Data Lake on Apache Hadoop with MapR
Data Inventory:
Find, understand
and govern
17. © 2015 MapR Technologies 17
Separate Distinct Data Sets via MapR Volumes
Volumes dramatically simplify
management:
• Replication factor
• Scheduled mirroring
• Scheduled snapshots
• Data placement control
• User access and tracking
• Administrative permissions
/projects
/tahoe
/yosemite
/user
/msmith
/bjohnson
18. © 2015 MapR Technologies 18
MapR Trust Model (Product Security)
Flexible
Authentication
• Wire-level authentication for all
services in the cluster
• NSA-level cryptographic algorithms
• Integration with LDAP, Active
Directory and other third party
directory services
• Kerberos or username/password
authentication
1
A
AA
DP
Granular
Authorization
• Access Control Expressions
• Protect files, tables, column families,
columns, and management objects
• Extend to role-based access control
(RBAC) with custom role functions
• Drill Views
2Robust
Auditing
• All events recorded immediately
in JSON log files
• Includes data access and
administrative actions
• Ad-hoc queries and custom
reports on audit logs via SQL and
standard BI tools
3
Ubiquitous
Data Protection
• Encryption for Data in Motion
• Within a Cluster
• Between Clusters
• Between Client and Cluster
• Encryption for Data at Rest
• LUKS
• Self-Encrypting Disk
• Partners
4
19. © 2015 MapR Technologies 19
MapR Comprehensive Auditing
Serving Security Analysts…
Monitoring
Incident
Response
• Who touched customer records outside of
business hours?
• What actions did users take in the days
before leaving the company?
• What operations were performed without
following change control?
• Are users accessing sensitive files from
protected/secured source IPs?
• Why do my reports look different, despite
sourcing from same underlying data?
Security
20. © 2015 MapR Technologies 20
MapR Comprehensive Auditing (cont.)
…And Data Scientists Too
• Which data is used most frequently?
Implication: High Value; Share More
Broadly
• Which data is least commonly used?
Implication: Low Value; Candidate
for Purge
• Which data should be used more?
Implication: Underutilized; Increase
Awareness
• What administrative actions are
most commonly performed?
Implication: Candidate for
automation
Predictive Analytics
21. © 2015 MapR Technologies 21
MapR Audits – Key Features
Data Access
• Files
• MapR-DB Tables
Cluster Operations
• Administrative Operations
• Maprcli commands
Authentication Requests
Secure
High Performance
Flexible
• Retention Period
• Maxsize
• Coalesce Interval
JSON Format
{"timestamp":"{$date=2015-06-
01T05:24:58.231Z}","operation":"GETATTR",
"user":"root","uid":"0","ipAddress":"10.10.x.x",
"nfsServer":"10.10.x.x","srcPath":"/dbtest.0/","
srcFid":"2147.16.2","VolumeName":“mktg_file
s","volumeId":“mktg_files","status":"0"}
22. © 2015 MapR Technologies 22
Access Control that Scales
PAM Authentication +
User Impersonation
Fine-grained row and
column level access control
with Drill Views – no
centralized security
repository required
Files HBase Hive
Drill
View 1
Drill
View 2
UUU
User
User
23. © 2015 MapR Technologies 23
Ownership Chaining
Combine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path
24. © 2015 MapR Technologies 24
Find, Understand and Govern Data in Hadoop
At Scale and in Real-Time
Discover and protect
sensitive data, audit
and authorize access
to the data lake,
discover data lineage,
and provide data
stewardship
CDO/Data Steward
Automate cataloging of
data assets at scale,
with secure
provisioning to
business users
Big Data Architect
Find and understand
best-suited and most
trusted data without
having to explore
every file manually
Data Engineer/Data
Scientist/Business Analyst
25. © 2015 MapR Technologies 25
Learn More
www.waterlinedata.com
• Watch the solution video
• Read analyst papers
• Download the free Waterline
Data / MapR sandbox
• Request a demo
• Download and evaluate the
product
www.mapr.com
• Get free On-Demand
Training for Hadoop
• Download the free Waterline
Data / MapR sandbox