Knowing me, knowing you, knowing your disease

Knowing me, knowing you, knowing your disease:
A new paradigm in healthcare privacy-preserving data sharing and
big data analytics
Omiros Metaxas
ATHENA Research Center & University of Athens

Research Areas
Database and
Information
Systems
Human-
Computer
Interaction
Scientific
Systems
Personalization &
Social Networks
Electronic
Infrastructures
Applications
• Query Optimization
• Cloud Query Processing
• Heterogeneous Systems
• Data mining / analytics
• Data curation
• Database User Interfaces
• Complex Data Visualization
• Scientific Experiment Management
• Scientific Databases
• Workflow Management
• Distributed Systems
• Cultural Heritage
• Life Sciences
• Physical Sciences
• User Modeling
• User Profiling
• Adaptivity
• Digital Libraries
• Data Repositories
• Interoperability
• Open Access Policies
• Cloud Data Services

BioMed
Oceans
Space & Earth
Culture Environment
OA Policies
Data Proc

From big data to new medical practice
• Manage heterogeneous, federated biomedical data sources & models
• Data provenance & on-line transformation (ETL)
• “Sanitization” (Anonymisation)
• Semi-automatic data profiling & curation
• Decentralization: Use Blockchain to manage access to sensitive Data
1. Big Data Management
• Address High Dimensionality & heterogeneity
• Scaling through Distributed processing
• Twofold Similarity Analysis (patients like mine & patients like me)
• KDD, statistical simulation & DSS based on BIG - routine - DATA
• Biomedical & Imaging Model-Based Analysis
• Privacy preserving algorithms & mechanisms
2. Big Data Analytics & Model-Based Analysis
• Scientific workflow support
• Collaboration, data sharing and 2nd opinion support
• Personalized, Unified Access to internal & external well-organized data,
information, models & knowledge
• DSS Tools & Applications for every role
3. Clinicians, Researchers & Patients Support
• Ethics & Privacy
• Transform daily routine’s data into useful information & knowledge
• Promote Model-Guided Personalized Medicine utilizing similarity
analysis, simulation models and DSS tools
• Tools & Models validation based on clinicians’ feedback
4. Medical Practice Reengineering
• Organize communities (clinicians, researchers & patients)
• Save, organize and diffuse information & knowledge
• Promote health (self care, awareness – patients like me, similarity)
• Market Place for everything and everyone (data, models, services and
applications)
5. Create & Support an Ecosystem
Big Data
• Volume (high)
• Velocity (high)
• Variety (great)
• Veracity (lack of)
• Value (hard to extract)
Big Data Analytics
• Capture (multi source)
• Aggregate (distributed storage)
• Process (distributed processing)
Privacy by Design & Privacy by Default
• Privacy preserving data publication & sharing
• Privacy preserving complex data flow execution
• Secure Data Access
• Privacy & Security data profiling

Quality Assurance, Quality of Service , Compliance & Dissemination
Privacy by design middleware Layer
ATHENA, GNUBILA [WP5]
Privacy preserving distributed data
processing
HOWWHERE
Private Data Sources
Federated Data Management & Data Harmonisation Layer
WHAT
Application Layer (WEB & Apps)
SIEMENS , ATHENA, HES-SO [WP2, WP8]
Data Exploration, Analytics & Cohort Builder
based on advanced Similarity & Semantic Search
HWC, DigiMe [WP3]
Personal Data Account (PDA) & Dynamic
consent management
WHY
GNUBILA, ATHENA, HWC [WP6, WP3]
Blockchain Integration & Smart
contracts management
HES-SO, ATHENA [WP4] : Semantic Modeling and data
integration
HES-SO, GNUBILA [WP4]: Persistent Identifiers
Cataloguing (PID)
API (for SaaS applications) ATHENA, GNUBILA (WP5, WP6)
Hospitals
Electronic Medical Records
Personal Data Subjects
social media accounts, clinical data repositories, personal
drives, wearable devices
LYN [WP11]: Coordination & Management
LYN [WP10]: Dissemination and Exploitation
CNR [WP9]: Penetration & Re-Identification Challenge NCTM [WP2]: Regulatory and Compliance Study
HES-SO [WP1]: Requirements Analysis
LYN [WP7]: Platform-driven Assessment
DigiMe, HWC [WP3]
Personal Data acquisition and
management
ATHENA [WP5]: Data Profiling & curation (quality, privacy & analysis)

processing
HOW
HWC, DigiMe [WP3]
consent management
WHY
DigiMe, HWC [WP3]
management
WHERE
WHAT
integration
Cataloguing (PID)
Hospitals

 Data collection / origin
◦ Pseudonymised (de-identified) clinical (routine) data
◦ Personal data including machine-generated data from Internet of Things (IoT)
◦ Derived data related to the usage and the processing of the data
 Data storage & preservation
◦ Federated data management for clinical data
 ETL, pre-processing and pseudo-anonymization flow
◦ DIGI.me Personal Data Account (PDA) application
 retrieve personal data to an encrypted local library, which the users can then add to a personal cloud
 Data Modelling, Harmonisation, Cataloguing and Integration
◦ Global dynamic Subjective-Objective-Assessment-Plan (SOAP) model
◦ Use biomedical taxonomies and ontologies such as LOINC, SNOMED CT, ICD-10-CM, CPT, MESH
◦ Persistent Identifiers (PIDs)
 Secure data access, sharing and processing in line with GDPR legislation
Data Collection and Management

Hospitals
OPBG - Vatican
UCL/GOSH – London
DH – Berlin
IGG – Genova
KU - Leuven
CHUV – Lausanne
…

HWC, DigiMe [WP3]
consent management
WHY
WHERE
WHAT
integration
Cataloguing (PID)
Hospitals
processing
HOW
DigiMe, HWC [WP3]
management

Data access & Privacy preservation
 Security / privacy breaches:
◦ avoid a single point of failure (i.e., datawarehouse, TTP): decentralize data
(transactions, patient data) and control using federation and blockchain
◦ offer multiple levels of privacy preservation
 Ownership: Users should control their data, easily join or leave
 Transparency: Users should audit the usage of their data
 Privacy is important

MDPSeC CDP
Blockchain as an access-control manager
Patient
PIDs
PIDs
PIDs
Digital Object Architecture (DOA)
PI
(1) Initiates a Data
Access request
(2) Re-identification &
consent request
(2) consent request
(Anonymous)
Medical Data
consent
consent
consent
New cohort
Request
Smart
Contract
(3a,b) Sharing of
(Anonymous) EHRs
Sharing
Privacy
preserving data
publishing
Blockchain integration @ MHMD
(3c) Execute a privacy
Preserving computation
Bio-medical model
Privacy preserving
distributed
complex data
flow execution
Transaction
Actors (WHO)
Data controllers
Data processors
Data subjects
Data controllers
Data (WHAT)
Functions (WHY)
Methods (HOW)
Output (WHAT)
 a decentralized personal data management platform focused
on privacy
 combine blockchain and off-blockchain storage
 users own, control and monitor their data and data usage
 utilize blockchain & smart contracts as an automated access-
control manager
 does not require trust in a third party
 pointers to de-identified data  suitable for random queries
 support full data processing through PPDM

 Smart Contract
Blockchain integration
WHO
subjects & controllers processors & requesters
WHAT & WHY
HOW
Data Functions Output
DMP &
(privacy) profiling
PPDM: MPC, DP, Encryption
(on pseudoanonymized data)
PredictionsPublishing
(external parties)
Mining
(within MHMD)
Models EHR data
Publishing: Anonymization &
Watermarking
Blockchain & Smart contracts
(control & trace data usage)
Personal data
access

Three main use cases:
 Personal Data Access
◦ Patient accessing his/her EHR
 Data publishing
◦ Research VS other purposes
◦ Anonymization requirements
◦ Watermarking
 Privacy Preserving Data Mining (within platform)
◦ Move data (authorized applications get and process the data i.e., MDP / Cardioproof)
◦ Move computation to data: secure multiparty computation (SMC, DP) on federated data /
distrustful parties (MHMD, HBP)
◦ Other encryption techniques (homomorphic)
Encryption and privacy preserving policies

 static data publishing: “Sanitization” (Anonymization)
 secure multi party computation: Only overall aggregated data are
transferred between nodes
 interactive anonymization: Differential Privacy & Crowd-Blending
privacy
 encryption: Fully/Partially Homomorphic Encryption (FHE)
 decentralization: Use Blockchain to Protect Personal Data

 Privacy & Sensitivity Data Profiling:
◦ Define privacy profiles per data type & usage scenario
 Trade-offs among efficiency, accuracy & privacy
 Define a formal methodology to describe “privacy
budget” in terms of expected accuracy
 Automate privacy preserving method selection based
on privacy & sensitivity profile and efficiency /
accuracy trade-offs
Efficiency

Secure Data publishing
 Different dangers
◦ Identity leakage
◦ Attribute leakage
◦ Participation leakage
 Different transformations
◦ Generalization
◦ Suppression
◦ Perturbation
◦ Partitioning
◦ Noise addition
 “Sanitization” (Anonymisation) hiding individual information
(ensuring k-anonymity) but preserving aggregated
(sufficient) statistics

Secure Data publishing
 Amnesia anonymization tool
◦ It offers several versions of k-anonymity
◦ It allows the user to select and customize possible solutions
◦ It offers graphical tools that allow the user to analyze the anonymized dataset
◦ It is scalable and uses all available CPU cores in the anonymization process
 Watermarking techniques

 The setting: Data is horizontally distributed at different sites on a Private
Data Network (PDN) of mutually distrustfully parties
 The aim: Compute the data mining algorithm on the data so that nothing
but the output is learned
◦ Use secure computation using SMPC, encryption, DP etc
◦ Assume Semi-honest types of adversaries that follow the protocol
 Makes sense where the participating parties really trust each other (e.g., hospitals)
 Training (learning) vs Reasoning: different requirements and privacy
related issues
◦ training: needs access to patient records
◦ reasoning: needs only the model and new data subjects but…
 Inference from the results: One can break privacy using well specified queries and analyzing
the results
Privacy Preserving Data Mining

 Distributed elastic execution
 Iterative dataflow execution: Support ML algorithms
 Powerful data programming paradigm: SQL with User Defined Functions
 Privacy-aware query processing
Distributed Privacy Preserving Data Mining:
EXAREME

Query
Federatio
n
Decompose query into
local and global parts
Dataflow Execution Example
1 N
id m-name m-valueid m-name m-value
Local queries Local queries
Partial
aggregated
results
Run local
queries
Run local
queries
“count, avg, std”
m-name N avg std
m-name Σx Σx2 N
Σx,Σx2,N Σx,Σx2,N
Partial
aggregated
results
m-name Σx Σx2 N
L:“Σx, Σx2, N”
G:“N, avg, std”
Run global
queries
N, avg, std

WHERE
WHAT
integration
Cataloguing (PID)
Hospitals
processing
HOW
DigiMe, HWC [WP3]
management
HWC, DigiMe [WP3]
consent management
WHY

Data Cleaning, Exploration & Analytics
 Data curation & profiling, knowledge discovery and statistical
simulation framework
◦ Process driven by bottom-up evidence AND top-down models/knowledge
◦ Data profiling, cleaning & exploration: Statistical analysis, advanced visualization, rule based
cleaning
◦ Data Mining, pattern discovery and similarity analysis: Well established ML
◦ Statistical simulation: Dependency analysis/reasoning based on Bayesian Nets

Data
Query System
Action
Results
Analytics System
Curation System
XYZ System
 Abstraction
 Analytics (data mining, machine learning, discovery)
 Cleaning and curation
 Homogenization and integration
 Querying and searching
 Transformation
 Visualization
 Zooming
 …

Individualized diagnosis, prognosis &
treatment plan
Data analytics flow to P. Medicine
Precision Medicine
Support
Reasoning, Simulation & DSS
Domain knowledge & assumptions Clinical workflows
Data Analysis &
Modelling
Knowledge Discovery & Model training
Disease signatures & patient groups
Variables dependencies & prediction models
For a particular
patient
Unknown / missing data
Predict value of missing
variable
Final DAG (based on MCMC&DP, threshold=0.5)
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Transformed & validated data
Data Curation &
Exploration
Cleaning, profiling & pre-processing
Biomarker based personalized acquisition
TOP-DOWNBOTTOM-UP
Data Management & Harmonisation

treatment plan
Precision Medicine
Support
Domain knowledge & assumptions Clinical workflows
Data Analysis &
Modelling
For a particular
patient
variable
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
TOP-DOWNBOTTOM-UP
Data Curation &
Exploration

 Data profiling:
◦ ensures and assess the actual content, structure and quality of the data
◦ reveal their characteristics, strengths and weaknesses
 Types:
◦ Structural: Schema, Type (e.g., numeric or text), Format (e.g., mm/dd/yyyy)
◦ Statistical: distribution, missing values, tails
◦ Logical: rules, constraints
◦ Identity: deduplication / resolution, ref. table matching
◦ Security / privacy Data Profiling: assessing relevance, sensitivity, risk for the
individual and practical value
Data Profiling and Curation

Data Profiling and Curation
DCV: semi-automatic tool
 data profiling
 data cleaning, validation & transformation
 privacy preserving data analysis
 interactive and efficient web-based interface
 workflow support (rerun experiments, reproduce results)

User-defined Cleaning Rules
Click on red piece of
pie to see violations

treatment plan
Precision Medicine
Support
Clinical workflows
For a particular
patient
variable
Data Curation &
Exploration
TOP-DOWNBOTTOM-UP
Domain knowledge & assumptions
Data Analysis &
Modelling
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

 Disease signatures: latent factors characterizing disease
◦ Patterns over the most relevant disease variables, e.g., biomarkers
◦ Several approaches (probabilistic latent factor analysis, well established ML
argorithms)
 Predictive analysis: Patient classification or regression for
categorization and outcome analysis
 Descriptive analysis: clustering algorithms & probabilistic (mixed)
membership models
 Similarity Analysis: patients “like” me or mine (patient/clinician role)
Data Mining & KDD

Classification (model training & pattern
discovery)

Data Analysis &
Modelling
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Data Curation &
Exploration
TOP-DOWNBOTTOM-UP
treatment plan
Precision Medicine
Support
Clinical workflows
For a particular
patient
variable

 NEUROLOGICAL AND NEUROMUSCULAR DISEASE (NND) Use-case:
Automatic classification of 7 Joint Movement Patterns based on kinematic
data.
 Training: on specific extracted features or raw gait analysis waveforms (time
series)
 Cross-Validation: Stratified 10-fold
 Method: Random Forests, kNN
 Results: Models prediction accuracies>85%
Classification (categorization)

Aim: To predict early disease outcome in JIA using baseline variables
Analysis: Random Forest algorithm on three datasets
Conclusion: Difficulty identifying patients who remained active
Clinical (acc=0.6)
Clinical with
Luminex (0.57)
Clinical with
microbiota (0.52)
Classification (outcome prediction)

Probabilistic Modeling for statistical
simulation
Modelling
Dependency Analysis
Inference

Probabilistic Modeling for statistical
simulation
Finding most important dependencies and independencies:
e.g. disDur, neutro,pga are almost uncorrelated and excluded
Qualitative dependency analysis: Learning
the structure (DAG)
Quantitative analysis: Learning model
parameters (Cond. Prob.)

Sensitivity Analysis on Outcome

tmj active
very small sample
- Bad prognosis
- Aggressive treatment

What if.. A new patient with 2 act. knee
joints & symmetry

What if.. Therapy = MTX
-same percentage
-worse prognosis

What about domain knowledge??
Data Analysis &
Modelling
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Data Curation &
Exploration
TOP-DOWNBOTTOM-UP
treatment plan
Precision Medicine
Support
Clinical workflows
For a particular
patient
variable

BIO-KNOWLEDGE ASSOCIATION MAP
Multi View Topic
Modelling
NLP & Named
Entity Recognition
Semantic bio terms
(PDBCodes,
chem2Bio2RDF,
LODD)
Generate specific association maps for
different types of entities (e.g., genes,
MESH, proteins, drugs)
Annotate publications
with bioterms (genes,
pdbCodes etc)
Full text PubMed
papers and meta data
including MESH
Identify multi modal
topics & quantify
associations Text pdbCode
cancer tumor growth breast lines
apoptosis tumors prostate kinase 1m17 2ity 1qcf
binding dna brca brct cancer res
mutations domain results
1tsr 1ycs 2ac0
1gzh
nef hiv vpr ssb virus felv hck
replication ssdna
1eyg 2nef 1jmc
1m8l 1efn 1izn

 Analyze large collections of documents, and meta-
data to:
 identify active areas of research: discover hidden themes (topics)
 understand what is actually produced: project the output to the reduced topic
space (calc topic distributions per document or other entity (e.g. gene or
protein)
 create association maps (interaction networks ) among different entities (e.g.,
genes, drugs, diseases, proteins)
• promote target identification: “Pathway expansion” for no ‘druggable’
targets, multi-target drugs, drug repositioning (indication expansion)
 identify emerging research areas , e.g., target identification, or the understanding
of disease mechanisms: create new therapeutic opportunities
 assess coverage, identify gaps or new therapeutic opportunities: compare funded
research, patents
Mining scientific literature
WHY

What is involved…
Extract features and annotate (enrich)
content using NLP, Named Entity
Recognition & Semantic Annotation
Tokenize, remove stop words
Refine stop words for
specific domain
1
ENRICH &
PRE-PROCESS
Identify topics: distribution over words
& “side” information
Automatic topic curation & entitling
Assign topics to publications
Evaluate & categorize
topics
Assess topic labels
2
FIND
TOPICS
Calculate topic proportions & trends of
objects based on their publications
Calculate similarity among different
entities based on various metrics
Analyze & Validate the
results
3
CALCULATE
TRENDS &
SIMILARITIES
Create WEB interactive visualization
with data driven graphs, charts and
layouts
Design optimal views
Validate modeling results
4
VISUALIZE

 Probabilistic Multi-View Topic Modeling of Text-Augmented
Heterogeneous Information Networks
 interconnected (linked) entities which characterized by TEXT and
related side information & links (e.g., taxonomies, venues, projects /
research areas, citations, authors)
 side-information:
 structured or unstructured attributes and meta-data
 links / relations: e.g., authorship network, citation network
 Incomplete, noisy or not related to textual attributes
Methodology

Multi-View Topic Modeling
Text
gene
cells
expression
vector
aav
vectors
dna
therapy
figure
cell
target
gfp
targeting
delivery
diseases
Phrases
gene therapy
gene transfer
aav vectors
lentiviral vectors
Grants
PERSIST: Persisting Transgenesis
AAVEYE: GENE THERAPY FOR INHERITED
SEVERE PHOTORECEPTOR DISEASES
MESH Descriptors
Genetic Vectors
Lentivirus
Genetic Therapy
Dependovirus
Green Fluorescent Proteins
Journals
Molecular therapy
Research Areas
Biotechnology, generic tools and medical
technologies for human health
Expert: What is this Topic about??
Diagnostics and treatment development:
Gene therapy & genetic vectors

Multi-View Topic Modeling
Infectious diseases: HIV and NEF protein
Text
hiv
cells
cell
nef
viral
virus
bst
gag
infected
drug
vpu
gfp
assembly
surface
cellular
Phrases
gfp cells
hela cells
Infected cells
plasma membrane
Grants
HIV ACE: Targeting assembly of infectious
HIV particles
INEF: Inhibiting Nef: a novel drug target for
HIV-host interactions
MESH Descriptors
HIV-1
Antigens, CD
Cell Membrane
Membrane Glycoproteins
Journals
plos pathogens
Research Areas
HEALTH-2007-2 [Translating research for
human health]
PDB codes
2NEF, 1M8ML, 1EFN

Similarity & Graph
clustering
Topics & allocations
Modelling
LINKS represent topic
based similarity
NODES may represent drugs,
PDBCodes, genes or MeSH terms
Size: ~ # of publications
Categories may
represent Anatomical
Therapeutic Chemical
(ATC) class, Biological
Process, MeSH hierarchy
etc

e-Infrastructures & data repositories
Use Domain Knowledge to
• Enhance Patient Similarity Analysis
• Promote Decision Support
Clinical data clouds
• Analyze clinical data to
validate findings

Knowing me, knowing you, knowing your disease

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Knowing me, knowing you, knowing your disease

Similar to Knowing me, knowing you, knowing your disease (20)

More from eHealth Forum

More from eHealth Forum (14)

Recently uploaded

Recently uploaded (20)

Knowing me, knowing you, knowing your disease

Editor's Notes