SlideShare a Scribd company logo
1 of 19
How-to implement a GDPR
solution in a Cloudera
architecture
Cloudera + StreamSets Data Collector
Introduction
Since the implementation of GDPR regulation, all data processors across the world have
been struggling to be GDPR compliant and also deal with the new reality in Big Data, that
data is constantly drifting and mutating.
In this presentation, the approach will be:
● Cloudera architecture
● No additional financial cost
● Masking & Encrypting
Architecture
This architecture enables the following:
● Discover Data - ETL
● Data Governance - Policies
● Protect - Anonymization
Pre-Assumptions
1. Cluster hostname: cm1.localdomain
2. StreamSets Data Collector Linux user: sdc Kerberos Principal: sdc/cm1.localdomain/@DOMAIN.COM
3. Cluster Cloudera Manager: Cloudera Manager 5.12.2
4. StreamSets Data Collector: Installed
5. Cluster Authentication Pre-Installed: Kerberos
a. Kerberos Realm DOMAIN.COM
6. StreamSets Data Collector version: 3.5.0
7. StreamSets Data Collector Authentication: Kerberos
Base of Knowledge
With GDPR regulation it’s difficult to find the right balance
between protecting data and utilizing it, but with the solution
on this presentation you will not only do the ETL on your data
lake as you will also protect all the sensitive data.
Knowing this, you must know your data, the purpose of it, as
also the architecture of your System Environment and
therefore choose one or more ways to handle it.
JDBC Connections
The JDBC connections on StreamSets Origins require additional drivers.
For Hive and Impala JDBC Origin connection you need the Cloudera drivers (HiveJDBC41.jar, ImpalaJDBC41.jar), as for a Oracle connection the
Ojdbc8.jar.
Base of Knowledge
JDBC Conn Security URL
Hive
Metadata
None jdbc:hive2://server:10000/db_name
Hive
Metadata
Kerberos
(static)
jdbc:hive2://server:10000/db_name;principal=hive/server@REALM.COM
Hive
(Origin)
Kerberos
(user)
jdbc:impala://server:10000;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=server.example.com;KrbServiceName=hive
Impala
(Origin)
None jdbc:impala://server:21050;authMech=0
Impala
(Origin)
Kerberos jdbc:impala://server:21050;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=server.example.com;KrbServiceName=impala
Oracle
(Origin)
Static jdbc:oracle:thin:@server:listerner_port:db
Use Case
Ingest and Mask Data from Local File System to Hive Metadata.
Data Description
Sample Data
Synopsis
● Dataset title.basics.tsv from IMDB dataset https://www.imdb.com/interfaces/
● Read Tabular Data File from Local File System,
● Mask the field primaryTitle with a regular expression
● Create a Hive External Table and Write to HDFS
Read data from File System
Mask Field
Hive Metadata
Hadoop File System
Hive Metastore
Pipeline Finisher
Data Result
Sample Data
Synopsis
● Regular expression: (.*)(.{4})
● Irreversible
Use Case
Ingest and Encrypt Data from Local File System to Hive Metadata.
Encrypt Field
Data Result
Sample Data
Synopsis
● Regular expression: (.*)(.{4})
● Reversible
Thanks
Big Data Engineer
Tiago Simões

More Related Content

What's hot

Cloudera cluster setup and configuration
Cloudera cluster setup and configurationCloudera cluster setup and configuration
Cloudera cluster setup and configurationSudheer Kondla
 
Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingWhy Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingCloudera, Inc.
 
IT Infrastructure Through The Public Network Challenges And Solutions
IT Infrastructure Through The Public Network   Challenges And SolutionsIT Infrastructure Through The Public Network   Challenges And Solutions
IT Infrastructure Through The Public Network Challenges And SolutionsMartin Jackson
 
ProxySQL & PXC(Query routing and Failover Test)
ProxySQL & PXC(Query routing and Failover Test)ProxySQL & PXC(Query routing and Failover Test)
ProxySQL & PXC(Query routing and Failover Test)YoungHeon (Roy) Kim
 
Mysql Fun
Mysql FunMysql Fun
Mysql FunSHC
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersSematext Group, Inc.
 
MySQL Audit using Percona audit plugin and ELK
MySQL Audit using Percona audit plugin and ELKMySQL Audit using Percona audit plugin and ELK
MySQL Audit using Percona audit plugin and ELKYoungHeon (Roy) Kim
 
Connect Amazon EC2 Linux Instance
Connect Amazon EC2 Linux InstanceConnect Amazon EC2 Linux Instance
Connect Amazon EC2 Linux InstanceVCP Muthukrishna
 
How Helm, The Package Manager For Kubernetes, Works
How Helm, The Package Manager For Kubernetes, WorksHow Helm, The Package Manager For Kubernetes, Works
How Helm, The Package Manager For Kubernetes, WorksMatthew Farina
 
Squid for Load-Balancing & Cache-Proxy ~ A techXpress Guide
Squid for Load-Balancing & Cache-Proxy ~ A techXpress GuideSquid for Load-Balancing & Cache-Proxy ~ A techXpress Guide
Squid for Load-Balancing & Cache-Proxy ~ A techXpress GuideAbhishek Kumar
 
Multinode kubernetes-cluster
Multinode kubernetes-clusterMultinode kubernetes-cluster
Multinode kubernetes-clusterRam Nath
 
Guide - Migrating from Heroku to AWS using CloudFormation
Guide - Migrating from Heroku to AWS using CloudFormationGuide - Migrating from Heroku to AWS using CloudFormation
Guide - Migrating from Heroku to AWS using CloudFormationRob Linton
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2benjaminwootton
 
Clouldera Implementation Guide for Production Deployments
Clouldera Implementation Guide for Production DeploymentsClouldera Implementation Guide for Production Deployments
Clouldera Implementation Guide for Production DeploymentsAhmed Mekawy
 
How To Install and Configure AWS CLI on RHEL 7
How To Install and Configure AWS CLI on RHEL 7How To Install and Configure AWS CLI on RHEL 7
How To Install and Configure AWS CLI on RHEL 7VCP Muthukrishna
 
OpenStack in 10 minutes with Devstack
OpenStack in 10 minutes with DevstackOpenStack in 10 minutes with Devstack
OpenStack in 10 minutes with DevstackSean Dague
 
Zookeeper In Action
Zookeeper In ActionZookeeper In Action
Zookeeper In Actionjuvenxu
 
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...DataStax
 

What's hot (20)

Cloudera cluster setup and configuration
Cloudera cluster setup and configurationCloudera cluster setup and configuration
Cloudera cluster setup and configuration
 
Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingWhy Your Apache Spark Job is Failing
Why Your Apache Spark Job is Failing
 
IT Infrastructure Through The Public Network Challenges And Solutions
IT Infrastructure Through The Public Network   Challenges And SolutionsIT Infrastructure Through The Public Network   Challenges And Solutions
IT Infrastructure Through The Public Network Challenges And Solutions
 
ProxySQL & PXC(Query routing and Failover Test)
ProxySQL & PXC(Query routing and Failover Test)ProxySQL & PXC(Query routing and Failover Test)
ProxySQL & PXC(Query routing and Failover Test)
 
Build Automation 101
Build Automation 101Build Automation 101
Build Automation 101
 
Mysql Fun
Mysql FunMysql Fun
Mysql Fun
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
MySQL Audit using Percona audit plugin and ELK
MySQL Audit using Percona audit plugin and ELKMySQL Audit using Percona audit plugin and ELK
MySQL Audit using Percona audit plugin and ELK
 
Connect Amazon EC2 Linux Instance
Connect Amazon EC2 Linux InstanceConnect Amazon EC2 Linux Instance
Connect Amazon EC2 Linux Instance
 
How Helm, The Package Manager For Kubernetes, Works
How Helm, The Package Manager For Kubernetes, WorksHow Helm, The Package Manager For Kubernetes, Works
How Helm, The Package Manager For Kubernetes, Works
 
Squid for Load-Balancing & Cache-Proxy ~ A techXpress Guide
Squid for Load-Balancing & Cache-Proxy ~ A techXpress GuideSquid for Load-Balancing & Cache-Proxy ~ A techXpress Guide
Squid for Load-Balancing & Cache-Proxy ~ A techXpress Guide
 
Multinode kubernetes-cluster
Multinode kubernetes-clusterMultinode kubernetes-cluster
Multinode kubernetes-cluster
 
Guide - Migrating from Heroku to AWS using CloudFormation
Guide - Migrating from Heroku to AWS using CloudFormationGuide - Migrating from Heroku to AWS using CloudFormation
Guide - Migrating from Heroku to AWS using CloudFormation
 
How to Run Solr on Docker and Why
How to Run Solr on Docker and WhyHow to Run Solr on Docker and Why
How to Run Solr on Docker and Why
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
 
Clouldera Implementation Guide for Production Deployments
Clouldera Implementation Guide for Production DeploymentsClouldera Implementation Guide for Production Deployments
Clouldera Implementation Guide for Production Deployments
 
How To Install and Configure AWS CLI on RHEL 7
How To Install and Configure AWS CLI on RHEL 7How To Install and Configure AWS CLI on RHEL 7
How To Install and Configure AWS CLI on RHEL 7
 
OpenStack in 10 minutes with Devstack
OpenStack in 10 minutes with DevstackOpenStack in 10 minutes with Devstack
OpenStack in 10 minutes with Devstack
 
Zookeeper In Action
Zookeeper In ActionZookeeper In Action
Zookeeper In Action
 
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
 

Similar to Implement a GDPR-compliant data solution using Cloudera and StreamSets

Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Big Data Spain
 
PartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionPartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionTimothy Spann
 
Data Access Technologies
Data Access TechnologiesData Access Technologies
Data Access TechnologiesDimara Hakim
 
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...NomanKhalid56
 
Oracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud InfrastructureOracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud InfrastructureSinanPetrusToma
 
Introduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard BrokerIntroduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard BrokerZohar Elkayam
 
Oracle Fleet Patching and Provisioning Deep Dive Webcast Slides
Oracle Fleet Patching and Provisioning Deep Dive Webcast SlidesOracle Fleet Patching and Provisioning Deep Dive Webcast Slides
Oracle Fleet Patching and Provisioning Deep Dive Webcast SlidesLudovico Caldara
 
State of The Dolphin - May 2021
State of The Dolphin - May 2021State of The Dolphin - May 2021
State of The Dolphin - May 2021Frederic Descamps
 
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfSchema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfseo18
 
Database Mirror for the exceptional DBA – David Izahk
Database Mirror for the exceptional DBA – David IzahkDatabase Mirror for the exceptional DBA – David Izahk
Database Mirror for the exceptional DBA – David Izahksqlserver.co.il
 
TechEd Africa 2011 - OFC308: SharePoint Security in an Insecure World: Unders...
TechEd Africa 2011 - OFC308: SharePoint Security in an Insecure World: Unders...TechEd Africa 2011 - OFC308: SharePoint Security in an Insecure World: Unders...
TechEd Africa 2011 - OFC308: SharePoint Security in an Insecure World: Unders...Michael Noel
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosEdureka!
 
Introduction to firebidSQL 3.x
Introduction to firebidSQL 3.xIntroduction to firebidSQL 3.x
Introduction to firebidSQL 3.xFabio Codebue
 
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax
 
From Java 17 to 21- A Showcase of JDK Security Enhancements
From Java 17 to 21- A Showcase of JDK Security EnhancementsFrom Java 17 to 21- A Showcase of JDK Security Enhancements
From Java 17 to 21- A Showcase of JDK Security EnhancementsAna-Maria Mihalceanu
 
ADF Mobile: Implementing Data Caching and Synching
ADF Mobile: Implementing Data Caching and SynchingADF Mobile: Implementing Data Caching and Synching
ADF Mobile: Implementing Data Caching and SynchingSteven Davelaar
 
Securing MongoDB to Serve an AWS-Based, Multi-Tenant, Security-Fanatic SaaS A...
Securing MongoDB to Serve an AWS-Based, Multi-Tenant, Security-Fanatic SaaS A...Securing MongoDB to Serve an AWS-Based, Multi-Tenant, Security-Fanatic SaaS A...
Securing MongoDB to Serve an AWS-Based, Multi-Tenant, Security-Fanatic SaaS A...MongoDB
 

Similar to Implement a GDPR-compliant data solution using Cloudera and StreamSets (20)

Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
 
PartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionPartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC Solution
 
Data Access Technologies
Data Access TechnologiesData Access Technologies
Data Access Technologies
 
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...
 
Oracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud InfrastructureOracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud Infrastructure
 
JDBC.ppt
JDBC.pptJDBC.ppt
JDBC.ppt
 
Introduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard BrokerIntroduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard Broker
 
Oracle Fleet Patching and Provisioning Deep Dive Webcast Slides
Oracle Fleet Patching and Provisioning Deep Dive Webcast SlidesOracle Fleet Patching and Provisioning Deep Dive Webcast Slides
Oracle Fleet Patching and Provisioning Deep Dive Webcast Slides
 
State of The Dolphin - May 2021
State of The Dolphin - May 2021State of The Dolphin - May 2021
State of The Dolphin - May 2021
 
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfSchema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
 
Database Mirror for the exceptional DBA – David Izahk
Database Mirror for the exceptional DBA – David IzahkDatabase Mirror for the exceptional DBA – David Izahk
Database Mirror for the exceptional DBA – David Izahk
 
TechEd Africa 2011 - OFC308: SharePoint Security in an Insecure World: Unders...
TechEd Africa 2011 - OFC308: SharePoint Security in an Insecure World: Unders...TechEd Africa 2011 - OFC308: SharePoint Security in an Insecure World: Unders...
TechEd Africa 2011 - OFC308: SharePoint Security in an Insecure World: Unders...
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With Kerberos
 
Introduction to firebidSQL 3.x
Introduction to firebidSQL 3.xIntroduction to firebidSQL 3.x
Introduction to firebidSQL 3.x
 
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
 
From Java 17 to 21- A Showcase of JDK Security Enhancements
From Java 17 to 21- A Showcase of JDK Security EnhancementsFrom Java 17 to 21- A Showcase of JDK Security Enhancements
From Java 17 to 21- A Showcase of JDK Security Enhancements
 
Oracle Database 12c : Multitenant
Oracle Database 12c : MultitenantOracle Database 12c : Multitenant
Oracle Database 12c : Multitenant
 
ADF Mobile: Implementing Data Caching and Synching
ADF Mobile: Implementing Data Caching and SynchingADF Mobile: Implementing Data Caching and Synching
ADF Mobile: Implementing Data Caching and Synching
 
Securing MongoDB to Serve an AWS-Based, Multi-Tenant, Security-Fanatic SaaS A...
Securing MongoDB to Serve an AWS-Based, Multi-Tenant, Security-Fanatic SaaS A...Securing MongoDB to Serve an AWS-Based, Multi-Tenant, Security-Fanatic SaaS A...
Securing MongoDB to Serve an AWS-Based, Multi-Tenant, Security-Fanatic SaaS A...
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 

Implement a GDPR-compliant data solution using Cloudera and StreamSets

  • 1. How-to implement a GDPR solution in a Cloudera architecture Cloudera + StreamSets Data Collector
  • 2. Introduction Since the implementation of GDPR regulation, all data processors across the world have been struggling to be GDPR compliant and also deal with the new reality in Big Data, that data is constantly drifting and mutating. In this presentation, the approach will be: ● Cloudera architecture ● No additional financial cost ● Masking & Encrypting
  • 3. Architecture This architecture enables the following: ● Discover Data - ETL ● Data Governance - Policies ● Protect - Anonymization
  • 4. Pre-Assumptions 1. Cluster hostname: cm1.localdomain 2. StreamSets Data Collector Linux user: sdc Kerberos Principal: sdc/cm1.localdomain/@DOMAIN.COM 3. Cluster Cloudera Manager: Cloudera Manager 5.12.2 4. StreamSets Data Collector: Installed 5. Cluster Authentication Pre-Installed: Kerberos a. Kerberos Realm DOMAIN.COM 6. StreamSets Data Collector version: 3.5.0 7. StreamSets Data Collector Authentication: Kerberos
  • 5. Base of Knowledge With GDPR regulation it’s difficult to find the right balance between protecting data and utilizing it, but with the solution on this presentation you will not only do the ETL on your data lake as you will also protect all the sensitive data. Knowing this, you must know your data, the purpose of it, as also the architecture of your System Environment and therefore choose one or more ways to handle it.
  • 6. JDBC Connections The JDBC connections on StreamSets Origins require additional drivers. For Hive and Impala JDBC Origin connection you need the Cloudera drivers (HiveJDBC41.jar, ImpalaJDBC41.jar), as for a Oracle connection the Ojdbc8.jar. Base of Knowledge JDBC Conn Security URL Hive Metadata None jdbc:hive2://server:10000/db_name Hive Metadata Kerberos (static) jdbc:hive2://server:10000/db_name;principal=hive/server@REALM.COM Hive (Origin) Kerberos (user) jdbc:impala://server:10000;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=server.example.com;KrbServiceName=hive Impala (Origin) None jdbc:impala://server:21050;authMech=0 Impala (Origin) Kerberos jdbc:impala://server:21050;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=server.example.com;KrbServiceName=impala Oracle (Origin) Static jdbc:oracle:thin:@server:listerner_port:db
  • 7. Use Case Ingest and Mask Data from Local File System to Hive Metadata.
  • 8. Data Description Sample Data Synopsis ● Dataset title.basics.tsv from IMDB dataset https://www.imdb.com/interfaces/ ● Read Tabular Data File from Local File System, ● Mask the field primaryTitle with a regular expression ● Create a Hive External Table and Write to HDFS
  • 9. Read data from File System
  • 15. Data Result Sample Data Synopsis ● Regular expression: (.*)(.{4}) ● Irreversible
  • 16. Use Case Ingest and Encrypt Data from Local File System to Hive Metadata.
  • 18. Data Result Sample Data Synopsis ● Regular expression: (.*)(.{4}) ● Reversible