SlideShare a Scribd company logo
1 of 27
Download to read offline
​GDPR and Hadoop
​The elephant in the room
​Janosch Woschitz
​2017-09-27
2
• GDPR Overview
• Rights of the data subject
• Challenges within Hadoop ecosystem
• Technical considerations
Agenda
3
• Complex and detailed topic
• This is NOT legal advice
• A lot of opinions and interpretations about
GDPR
• Talk is not covering all aspects of GDPR
• Process matters, documentation is your
friend
Disclaimer
Take it with a grain of salt
4
“Regulation (EU) 2016/679 of the European Parliament [...] on the protection of natural persons with
regard to the processing of personal data and on the free movement of such data, and repealing
Directive 95/46/EC (General Data Protection Regulation)”
• Establishes data protection as a fundamental right
• Creates unified data protection law for all EU member states
• Enables EU citizens to be in control of their personal data
General Data Protection Regulation
GDP what?
- Official title of the GDPR, http://eur-lex.europa.eu/eli/reg/2016/679/oj
5
• Applies if the data controller or processor (organization) or the data
subject (person) is based in the EU
• Applies to organizations based outside the European Union if they
process or monitor personal data of EU citizens
• Employees might be EU citizens as well
General Data Protection Regulation
Who is affected?
6
• Officially published on May 4th 2016
• Applicable from May 25th 2018 across the EU (including UK)
• “Regulation” instead of “Directive” → no need for national
implementing legislation, directly applicable to all EU countries
• Evaluated and reviewed on May 25th 2020
General Data Protection Regulation
When does it happen?
7
• Better data protection and portability for consumers
• Fines for non-compliance will be
– up to €10M or 2% revenue for minor violations
– up to €20M or 4% revenue for major violations
• Any individual has the right to raise a complaint against any
organisation (Art. 77)
General Data Protection Regulation
Why should I care?
8
Privacy by design
Better data protection, you said?
• Privacy by design and by default, essential data protection
• Breach notification within 72 hours
• Data minimization and access limitation
• Data Protection Officer (DPO) and Data Privacy Impact Assessments
(DPIAs)
• Active, specific and unambiguous consent
“the controller shall [...] implement appropriate technical and organisational measures [...] in an
effective manner [...] in order to meet the requirements of this Regulation and protect the rights of
data subjects.” - Article 25, GDPR
9
Personal data?
https://pixabay.com/en/family-drawing-children-cat-paper-879432/
10
Personal data (examples)
It all depends on context
• Location or web surfing data
• Video surveillance and images
• Personal interests or behavioural patterns
• A child's drawing depicting its family
• Publication of x-ray plates together with the patient's first name
• Damage caused by graffiti in public transportation
• X1234 drinks a glass of wine more than 3 times a week, drives a
Bentley and has a Windows 10 phone
11
Source: Facebook
• Right of access and data portability
– free of charge
– structured, commonly used and machine readable
• Right to erasure
– “without undue delay”
• Right to object, to restrict, to rectify, ...
Data citizen rights
Rights of the data subject
GDPR and Hadoop
13
Hadoop ecosystem & beyond
The known Hadoopverse (excerpt)
and much more ...
14
Data processing on Hadoop
Bird’s eye view
• Various data sources and ingestion tools
• Diverse input formats, structured & unstructured
• Diverse processing tools
• Liberal data access, local data science
• Write-append and immutable data structures
• Redundant data
Ingest Process Access
15
Challenges by
example
• Customer data from
RDBMS to HDFS
• Streaming device
location data to
Kafka
16
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
Challenges by example
Ingest table from RDBMS
daily import (e.g. via sqoop)
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
today
-1 day
-2 days
Big DataSmaller Data
17
Problems & Solution approaches
• Right to be forgotten
• Access limitation
• Bound to consent
• ...
• Anonymization
• Hashing
• Encryption
• ...
18
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
Challenges by example
Encrypt, a.k.a. Lost Key Pattern
daily import (e.g. via sqoop)
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “54DCF13E4...”
“dateOfBirth”: “D3DFBCE...”
today
-1 day
-2 days
123
19
deviceId: 123pushes data to Kafka topic
123
B
“deviceId”: 123
“lat”: 52.510781
“lon”: 13.371735
Challenges by example
Deletion in log based systems
Edge device
456
A
123
D
123
∅
Kafka topic Consumer
B, C, D, ∅
offset
2
123
C
3 4 5 6
20
deviceId: 123pushes data to Kafka topic
123
D4
“deviceId”: 123
“lat”: 52.510781
“lon”: 13.371735
Challenges by example
Encrypt on write
Edge device
123
Z3
456
T3
123
6H
Kafka topic Consumer
A, B, C, D
offset
1
123
N7
2 3 4 5
123
?
21
Vendor recommendations
Distributions to the rescue!
• Hortonworks - "GDPR: The Good, Bad and Ugly", Jun 20 2017
• Cloudera - "Simplify your response to GDPR", Aug 24 2017
• GDPR compliance via partner solutions
• Only partial answers
Source: Cloudera Inc.
22
GDPR recommendations simplified
Kudu
Sentry
Navigator
Data Science
Workbench
HDFS / ...
Ranger
Atlas
Zeppelin
+ lots of partner solutions
23
Data privacy and open source
Pragmatic considerations
• Secured cluster
• Raw data in encryption zones with very limited access
• Anonymize for further processing wherever possible
• Proper retention policies, batch delete requests and perform regular
clean-ups
• Integrate with Atlas and Ranger → tagging, filtering and masking
• Custom solutions for glue and missing pieces
24
Summary
• No comprehensive open-source solution available
• Proprietary services target specific problem domains, integration still
necessary
• Some time until legal dust settled
• Idea: Avro (logical types) + Vault (or similar) + Ranger + Atlas?
The road ahead
2525 © 2017 Teradata
26
Hadoop Security Primer
In just one slide
• Authentication - Kerberos
• Authorization - Ranger, Sentry, ACLs
• Auditing / Monitoring - Ranger, Navigator, ...
• Encryption of data in motion - KMS, Navigator, ...
• Encryption of data at rest - Encryption zones, SEDs, ...
• Hadoop Security (Ben Spivey, Joey Echeverria)
• Hadoop and Kerberos: The Madness beyond the Gate
27
Personal data
According to GDPR
“any information relating to an identified or identifiable natural person (‘data
subject’);
An identifiable natural person is one who can be identified, directly or indirectly,
in particular by reference to an identifier such as a name, an identification
number, location data, an online identifier or to one or more factors specific to
the physical, physiological, genetic, mental, economic, cultural or social identity
of that natural person.”
- Article 4, GDPR

More Related Content

What's hot

What's hot (20)

Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Exploring Graph Visualization
Exploring Graph VisualizationExploring Graph Visualization
Exploring Graph Visualization
 
A Technical Introduction to RTBkit
A Technical Introduction to RTBkitA Technical Introduction to RTBkit
A Technical Introduction to RTBkit
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
MapReduce
MapReduceMapReduce
MapReduce
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Bayesian Linear Regression.pptx
Bayesian Linear Regression.pptxBayesian Linear Regression.pptx
Bayesian Linear Regression.pptx
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Python
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Graph Databases - RedisGraph and RedisInsight
Graph Databases - RedisGraph and RedisInsightGraph Databases - RedisGraph and RedisInsight
Graph Databases - RedisGraph and RedisInsight
 
support vector regression
support vector regressionsupport vector regression
support vector regression
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboost
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 
Probability Theory for Data Scientists
Probability Theory for Data ScientistsProbability Theory for Data Scientists
Probability Theory for Data Scientists
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
暗認本読書会12
暗認本読書会12暗認本読書会12
暗認本読書会12
 
1. Linear Algebra for Machine Learning: Linear Systems
1. Linear Algebra for Machine Learning: Linear Systems1. Linear Algebra for Machine Learning: Linear Systems
1. Linear Algebra for Machine Learning: Linear Systems
 

Similar to GDPR and Hadoop

Similar to GDPR and Hadoop (20)

Sible 09
Sible 09Sible 09
Sible 09
 
04 - VMUGIT - Lecce 2018 - Giampiero Petrosi, Rubrik
04 - VMUGIT - Lecce 2018 - Giampiero Petrosi, Rubrik04 - VMUGIT - Lecce 2018 - Giampiero Petrosi, Rubrik
04 - VMUGIT - Lecce 2018 - Giampiero Petrosi, Rubrik
 
Isaca new delhi india - privacy and big data
Isaca new delhi india - privacy and big dataIsaca new delhi india - privacy and big data
Isaca new delhi india - privacy and big data
 
Isaca new delhi india privacy and big data
Isaca new delhi india   privacy and big dataIsaca new delhi india   privacy and big data
Isaca new delhi india privacy and big data
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
e-SIDES workshop at EBDVF 2018, Vienna 14/11/2018
e-SIDES workshop at EBDVF 2018, Vienna 14/11/2018 e-SIDES workshop at EBDVF 2018, Vienna 14/11/2018
e-SIDES workshop at EBDVF 2018, Vienna 14/11/2018
 
Gdpr brief and controls ver2.0
Gdpr brief and controls ver2.0Gdpr brief and controls ver2.0
Gdpr brief and controls ver2.0
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
bigdataintro.pptx
bigdataintro.pptxbigdataintro.pptx
bigdataintro.pptx
 
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Cross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive dataCross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive data
 
Vuzion Love Cloud GDPR Event
Vuzion Love Cloud GDPR Event Vuzion Love Cloud GDPR Event
Vuzion Love Cloud GDPR Event
 
Lecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdfLecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdf
 
Big Data LDN 2017: Applied AI for GDPR
Big Data LDN 2017: Applied AI for GDPRBig Data LDN 2017: Applied AI for GDPR
Big Data LDN 2017: Applied AI for GDPR
 
Mind Your Business: Why Privacy Matters to the Successful Enterprise
 Mind Your Business: Why Privacy Matters to the Successful Enterprise Mind Your Business: Why Privacy Matters to the Successful Enterprise
Mind Your Business: Why Privacy Matters to the Successful Enterprise
 
How MongoDB can accelerate a path to GDPR compliance
How MongoDB can accelerate a path to GDPR complianceHow MongoDB can accelerate a path to GDPR compliance
How MongoDB can accelerate a path to GDPR compliance
 
Security Beyond Compliance: Using Tokenisation for Data Protection by Design ...
Security Beyond Compliance: Using Tokenisation for Data Protection by Design ...Security Beyond Compliance: Using Tokenisation for Data Protection by Design ...
Security Beyond Compliance: Using Tokenisation for Data Protection by Design ...
 
Webinar: An EU regulation affecting companies worldwide - GDPR
Webinar: An EU regulation affecting companies worldwide - GDPRWebinar: An EU regulation affecting companies worldwide - GDPR
Webinar: An EU regulation affecting companies worldwide - GDPR
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Recently uploaded (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

GDPR and Hadoop

  • 1. ​GDPR and Hadoop ​The elephant in the room ​Janosch Woschitz ​2017-09-27
  • 2. 2 • GDPR Overview • Rights of the data subject • Challenges within Hadoop ecosystem • Technical considerations Agenda
  • 3. 3 • Complex and detailed topic • This is NOT legal advice • A lot of opinions and interpretations about GDPR • Talk is not covering all aspects of GDPR • Process matters, documentation is your friend Disclaimer Take it with a grain of salt
  • 4. 4 “Regulation (EU) 2016/679 of the European Parliament [...] on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)” • Establishes data protection as a fundamental right • Creates unified data protection law for all EU member states • Enables EU citizens to be in control of their personal data General Data Protection Regulation GDP what? - Official title of the GDPR, http://eur-lex.europa.eu/eli/reg/2016/679/oj
  • 5. 5 • Applies if the data controller or processor (organization) or the data subject (person) is based in the EU • Applies to organizations based outside the European Union if they process or monitor personal data of EU citizens • Employees might be EU citizens as well General Data Protection Regulation Who is affected?
  • 6. 6 • Officially published on May 4th 2016 • Applicable from May 25th 2018 across the EU (including UK) • “Regulation” instead of “Directive” → no need for national implementing legislation, directly applicable to all EU countries • Evaluated and reviewed on May 25th 2020 General Data Protection Regulation When does it happen?
  • 7. 7 • Better data protection and portability for consumers • Fines for non-compliance will be – up to €10M or 2% revenue for minor violations – up to €20M or 4% revenue for major violations • Any individual has the right to raise a complaint against any organisation (Art. 77) General Data Protection Regulation Why should I care?
  • 8. 8 Privacy by design Better data protection, you said? • Privacy by design and by default, essential data protection • Breach notification within 72 hours • Data minimization and access limitation • Data Protection Officer (DPO) and Data Privacy Impact Assessments (DPIAs) • Active, specific and unambiguous consent “the controller shall [...] implement appropriate technical and organisational measures [...] in an effective manner [...] in order to meet the requirements of this Regulation and protect the rights of data subjects.” - Article 25, GDPR
  • 10. 10 Personal data (examples) It all depends on context • Location or web surfing data • Video surveillance and images • Personal interests or behavioural patterns • A child's drawing depicting its family • Publication of x-ray plates together with the patient's first name • Damage caused by graffiti in public transportation • X1234 drinks a glass of wine more than 3 times a week, drives a Bentley and has a Windows 10 phone
  • 11. 11 Source: Facebook • Right of access and data portability – free of charge – structured, commonly used and machine readable • Right to erasure – “without undue delay” • Right to object, to restrict, to rectify, ... Data citizen rights Rights of the data subject
  • 13. 13 Hadoop ecosystem & beyond The known Hadoopverse (excerpt) and much more ...
  • 14. 14 Data processing on Hadoop Bird’s eye view • Various data sources and ingestion tools • Diverse input formats, structured & unstructured • Diverse processing tools • Liberal data access, local data science • Write-append and immutable data structures • Redundant data Ingest Process Access
  • 15. 15 Challenges by example • Customer data from RDBMS to HDFS • Streaming device location data to Kafka
  • 16. 16 “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” Challenges by example Ingest table from RDBMS daily import (e.g. via sqoop) “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” today -1 day -2 days Big DataSmaller Data
  • 17. 17 Problems & Solution approaches • Right to be forgotten • Access limitation • Bound to consent • ... • Anonymization • Hashing • Encryption • ...
  • 18. 18 “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” Challenges by example Encrypt, a.k.a. Lost Key Pattern daily import (e.g. via sqoop) “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” “userId”: 123 “firstName”: “54DCF13E4...” “dateOfBirth”: “D3DFBCE...” today -1 day -2 days 123
  • 19. 19 deviceId: 123pushes data to Kafka topic 123 B “deviceId”: 123 “lat”: 52.510781 “lon”: 13.371735 Challenges by example Deletion in log based systems Edge device 456 A 123 D 123 ∅ Kafka topic Consumer B, C, D, ∅ offset 2 123 C 3 4 5 6
  • 20. 20 deviceId: 123pushes data to Kafka topic 123 D4 “deviceId”: 123 “lat”: 52.510781 “lon”: 13.371735 Challenges by example Encrypt on write Edge device 123 Z3 456 T3 123 6H Kafka topic Consumer A, B, C, D offset 1 123 N7 2 3 4 5 123 ?
  • 21. 21 Vendor recommendations Distributions to the rescue! • Hortonworks - "GDPR: The Good, Bad and Ugly", Jun 20 2017 • Cloudera - "Simplify your response to GDPR", Aug 24 2017 • GDPR compliance via partner solutions • Only partial answers Source: Cloudera Inc.
  • 22. 22 GDPR recommendations simplified Kudu Sentry Navigator Data Science Workbench HDFS / ... Ranger Atlas Zeppelin + lots of partner solutions
  • 23. 23 Data privacy and open source Pragmatic considerations • Secured cluster • Raw data in encryption zones with very limited access • Anonymize for further processing wherever possible • Proper retention policies, batch delete requests and perform regular clean-ups • Integrate with Atlas and Ranger → tagging, filtering and masking • Custom solutions for glue and missing pieces
  • 24. 24 Summary • No comprehensive open-source solution available • Proprietary services target specific problem domains, integration still necessary • Some time until legal dust settled • Idea: Avro (logical types) + Vault (or similar) + Ranger + Atlas? The road ahead
  • 25. 2525 © 2017 Teradata
  • 26. 26 Hadoop Security Primer In just one slide • Authentication - Kerberos • Authorization - Ranger, Sentry, ACLs • Auditing / Monitoring - Ranger, Navigator, ... • Encryption of data in motion - KMS, Navigator, ... • Encryption of data at rest - Encryption zones, SEDs, ... • Hadoop Security (Ben Spivey, Joey Echeverria) • Hadoop and Kerberos: The Madness beyond the Gate
  • 27. 27 Personal data According to GDPR “any information relating to an identified or identifiable natural person (‘data subject’); An identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.” - Article 4, GDPR