The European General Data Protection Regulation (GDPR) will come into effect in May 2018 and it will impact all organizations that store or process personal data of EU citizens. The European Commission is exporting European data protection principles to the rest of the world while widening the definition of personal data and enforcing privacy by design. These changes will not only have an impact on the organizations but also on the software which is used for data processing. How does it affect the Hadoop ecosystem?
Distributed data processing at scale is one of Hadoop’s core features and we will explore how the GDPR could potentially affect it. We will also take a look at the technical aspects of the rights of data subjects and see if and how we can address those, with a particular focus on open-source technologies.
This talk will give you an overview of the key themes of the GDPR including the rights of the data subject and will investigate the technical implications for data processing within the Hadoop ecosystem.
2. 2
• GDPR Overview
• Rights of the data subject
• Challenges within Hadoop ecosystem
• Technical considerations
Agenda
3. 3
• Complex and detailed topic
• This is NOT legal advice
• A lot of opinions and interpretations about
GDPR
• Talk is not covering all aspects of GDPR
• Process matters, documentation is your
friend
Disclaimer
Take it with a grain of salt
4. 4
“Regulation (EU) 2016/679 of the European Parliament [...] on the protection of natural persons with
regard to the processing of personal data and on the free movement of such data, and repealing
Directive 95/46/EC (General Data Protection Regulation)”
• Establishes data protection as a fundamental right
• Creates unified data protection law for all EU member states
• Enables EU citizens to be in control of their personal data
General Data Protection Regulation
GDP what?
- Official title of the GDPR, http://eur-lex.europa.eu/eli/reg/2016/679/oj
5. 5
• Applies if the data controller or processor (organization) or the data
subject (person) is based in the EU
• Applies to organizations based outside the European Union if they
process or monitor personal data of EU citizens
• Employees might be EU citizens as well
General Data Protection Regulation
Who is affected?
6. 6
• Officially published on May 4th 2016
• Applicable from May 25th 2018 across the EU (including UK)
• “Regulation” instead of “Directive” → no need for national
implementing legislation, directly applicable to all EU countries
• Evaluated and reviewed on May 25th 2020
General Data Protection Regulation
When does it happen?
7. 7
• Better data protection and portability for consumers
• Fines for non-compliance will be
– up to €10M or 2% revenue for minor violations
– up to €20M or 4% revenue for major violations
• Any individual has the right to raise a complaint against any
organisation (Art. 77)
General Data Protection Regulation
Why should I care?
8. 8
Privacy by design
Better data protection, you said?
• Privacy by design and by default, essential data protection
• Breach notification within 72 hours
• Data minimization and access limitation
• Data Protection Officer (DPO) and Data Privacy Impact Assessments
(DPIAs)
• Active, specific and unambiguous consent
“the controller shall [...] implement appropriate technical and organisational measures [...] in an
effective manner [...] in order to meet the requirements of this Regulation and protect the rights of
data subjects.” - Article 25, GDPR
10. 10
Personal data (examples)
It all depends on context
• Location or web surfing data
• Video surveillance and images
• Personal interests or behavioural patterns
• A child's drawing depicting its family
• Publication of x-ray plates together with the patient's first name
• Damage caused by graffiti in public transportation
• X1234 drinks a glass of wine more than 3 times a week, drives a
Bentley and has a Windows 10 phone
11. 11
Source: Facebook
• Right of access and data portability
– free of charge
– structured, commonly used and machine readable
• Right to erasure
– “without undue delay”
• Right to object, to restrict, to rectify, ...
Data citizen rights
Rights of the data subject
14. 14
Data processing on Hadoop
Bird’s eye view
• Various data sources and ingestion tools
• Diverse input formats, structured & unstructured
• Diverse processing tools
• Liberal data access, local data science
• Write-append and immutable data structures
• Redundant data
Ingest Process Access
16. 16
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
Challenges by example
Ingest table from RDBMS
daily import (e.g. via sqoop)
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
today
-1 day
-2 days
Big DataSmaller Data
17. 17
Problems & Solution approaches
• Right to be forgotten
• Access limitation
• Bound to consent
• ...
• Anonymization
• Hashing
• Encryption
• ...
18. 18
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
Challenges by example
Encrypt, a.k.a. Lost Key Pattern
daily import (e.g. via sqoop)
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “54DCF13E4...”
“dateOfBirth”: “D3DFBCE...”
today
-1 day
-2 days
123
19. 19
deviceId: 123pushes data to Kafka topic
123
B
“deviceId”: 123
“lat”: 52.510781
“lon”: 13.371735
Challenges by example
Deletion in log based systems
Edge device
456
A
123
D
123
∅
Kafka topic Consumer
B, C, D, ∅
offset
2
123
C
3 4 5 6
20. 20
deviceId: 123pushes data to Kafka topic
123
D4
“deviceId”: 123
“lat”: 52.510781
“lon”: 13.371735
Challenges by example
Encrypt on write
Edge device
123
Z3
456
T3
123
6H
Kafka topic Consumer
A, B, C, D
offset
1
123
N7
2 3 4 5
123
?
21. 21
Vendor recommendations
Distributions to the rescue!
• Hortonworks - "GDPR: The Good, Bad and Ugly", Jun 20 2017
• Cloudera - "Simplify your response to GDPR", Aug 24 2017
• GDPR compliance via partner solutions
• Only partial answers
Source: Cloudera Inc.
23. 23
Data privacy and open source
Pragmatic considerations
• Secured cluster
• Raw data in encryption zones with very limited access
• Anonymize for further processing wherever possible
• Proper retention policies, batch delete requests and perform regular
clean-ups
• Integrate with Atlas and Ranger → tagging, filtering and masking
• Custom solutions for glue and missing pieces
24. 24
Summary
• No comprehensive open-source solution available
• Proprietary services target specific problem domains, integration still
necessary
• Some time until legal dust settled
• Idea: Avro (logical types) + Vault (or similar) + Ranger + Atlas?
The road ahead
26. 26
Hadoop Security Primer
In just one slide
• Authentication - Kerberos
• Authorization - Ranger, Sentry, ACLs
• Auditing / Monitoring - Ranger, Navigator, ...
• Encryption of data in motion - KMS, Navigator, ...
• Encryption of data at rest - Encryption zones, SEDs, ...
• Hadoop Security (Ben Spivey, Joey Echeverria)
• Hadoop and Kerberos: The Madness beyond the Gate
27. 27
Personal data
According to GDPR
“any information relating to an identified or identifiable natural person (‘data
subject’);
An identifiable natural person is one who can be identified, directly or indirectly,
in particular by reference to an identifier such as a name, an identification
number, location data, an online identifier or to one or more factors specific to
the physical, physiological, genetic, mental, economic, cultural or social identity
of that natural person.”
- Article 4, GDPR