Knowing me, knowing you, knowing your disease: A new paradigm in healthcare privacy-preserving data sharing and big data analytics . Speaker: Omiros Metaxas, Senior Researcher at ATHENA RIC & University of Athens
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Knowing me, knowing you, knowing your disease
1. Knowing me, knowing you, knowing your disease:
A new paradigm in healthcare privacy-preserving data sharing and
big data analytics
Omiros Metaxas
ATHENA Research Center & University of Athens
2. Research Areas
Database and
Information
Systems
Human-
Computer
Interaction
Scientific
Systems
Personalization &
Social Networks
Electronic
Infrastructures
Applications
• Query Optimization
• Cloud Query Processing
• Heterogeneous Systems
• Data mining / analytics
• Data curation
• Database User Interfaces
• Complex Data Visualization
• Scientific Experiment Management
• Scientific Databases
• Workflow Management
• Distributed Systems
• Cultural Heritage
• Life Sciences
• Physical Sciences
• User Modeling
• User Profiling
• Adaptivity
• Digital Libraries
• Data Repositories
• Interoperability
• Open Access Policies
• Cloud Data Services
4. From big data to new medical practice
• Manage heterogeneous, federated biomedical data sources & models
• Data provenance & on-line transformation (ETL)
• “Sanitization” (Anonymisation)
• Semi-automatic data profiling & curation
• Decentralization: Use Blockchain to manage access to sensitive Data
1. Big Data Management
• Address High Dimensionality & heterogeneity
• Scaling through Distributed processing
• Twofold Similarity Analysis (patients like mine & patients like me)
• KDD, statistical simulation & DSS based on BIG - routine - DATA
• Biomedical & Imaging Model-Based Analysis
• Privacy preserving algorithms & mechanisms
2. Big Data Analytics & Model-Based Analysis
• Scientific workflow support
• Collaboration, data sharing and 2nd opinion support
• Personalized, Unified Access to internal & external well-organized data,
information, models & knowledge
• DSS Tools & Applications for every role
3. Clinicians, Researchers & Patients Support
• Ethics & Privacy
• Transform daily routine’s data into useful information & knowledge
• Promote Model-Guided Personalized Medicine utilizing similarity
analysis, simulation models and DSS tools
• Tools & Models validation based on clinicians’ feedback
4. Medical Practice Reengineering
• Organize communities (clinicians, researchers & patients)
• Save, organize and diffuse information & knowledge
• Promote health (self care, awareness – patients like me, similarity)
• Market Place for everything and everyone (data, models, services and
applications)
5. Create & Support an Ecosystem
Big Data
• Volume (high)
• Velocity (high)
• Variety (great)
• Veracity (lack of)
• Value (hard to extract)
Big Data Analytics
• Capture (multi source)
• Aggregate (distributed storage)
• Process (distributed processing)
Privacy by Design & Privacy by Default
• Privacy preserving data publication & sharing
• Privacy preserving complex data flow execution
• Secure Data Access
• Privacy & Security data profiling
5. Quality Assurance, Quality of Service , Compliance & Dissemination
Privacy by design middleware Layer
ATHENA, GNUBILA [WP5]
Privacy preserving distributed data
processing
HOWWHERE
Private Data Sources
Federated Data Management & Data Harmonisation Layer
WHAT
Application Layer (WEB & Apps)
SIEMENS , ATHENA, HES-SO [WP2, WP8]
Data Exploration, Analytics & Cohort Builder
based on advanced Similarity & Semantic Search
HWC, DigiMe [WP3]
Personal Data Account (PDA) & Dynamic
consent management
WHY
GNUBILA, ATHENA, HWC [WP6, WP3]
Blockchain Integration & Smart
contracts management
HES-SO, ATHENA [WP4] : Semantic Modeling and data
integration
HES-SO, GNUBILA [WP4]: Persistent Identifiers
Cataloguing (PID)
API (for SaaS applications) ATHENA, GNUBILA (WP5, WP6)
Hospitals
Electronic Medical Records
Personal Data Subjects
social media accounts, clinical data repositories, personal
drives, wearable devices
LYN [WP11]: Coordination & Management
LYN [WP10]: Dissemination and Exploitation
CNR [WP9]: Penetration & Re-Identification Challenge NCTM [WP2]: Regulatory and Compliance Study
HES-SO [WP1]: Requirements Analysis
LYN [WP7]: Platform-driven Assessment
DigiMe, HWC [WP3]
Personal Data acquisition and
management
ATHENA [WP5]: Data Profiling & curation (quality, privacy & analysis)
6. Quality Assurance, Quality of Service , Compliance & Dissemination
Privacy by design middleware Layer
ATHENA, GNUBILA [WP5]
Privacy preserving distributed data
processing
HOW
Application Layer (WEB & Apps)
SIEMENS , ATHENA, HES-SO [WP2, WP8]
Data Exploration, Analytics & Cohort Builder
based on advanced Similarity & Semantic Search
HWC, DigiMe [WP3]
Personal Data Account (PDA) & Dynamic
consent management
WHY
GNUBILA, ATHENA, HWC [WP6, WP3]
Blockchain Integration & Smart
contracts management
API (for SaaS applications) ATHENA, GNUBILA (WP5, WP6)
LYN [WP11]: Coordination & Management
LYN [WP10]: Dissemination and Exploitation
CNR [WP9]: Penetration & Re-Identification Challenge NCTM [WP2]: Regulatory and Compliance Study
HES-SO [WP1]: Requirements Analysis
LYN [WP7]: Platform-driven Assessment
DigiMe, HWC [WP3]
Personal Data acquisition and
management
WHERE
Private Data Sources
Federated Data Management & Data Harmonisation Layer
WHAT
HES-SO, ATHENA [WP4] : Semantic Modeling and data
integration
HES-SO, GNUBILA [WP4]: Persistent Identifiers
Cataloguing (PID)
Hospitals
Electronic Medical Records
Personal Data Subjects
social media accounts, clinical data repositories, personal
drives, wearable devices
ATHENA [WP5]: Data Profiling & curation (quality, privacy & analysis)
7. Data collection / origin
◦ Pseudonymised (de-identified) clinical (routine) data
◦ Personal data including machine-generated data from Internet of Things (IoT)
◦ Derived data related to the usage and the processing of the data
Data storage & preservation
◦ Federated data management for clinical data
ETL, pre-processing and pseudo-anonymization flow
◦ DIGI.me Personal Data Account (PDA) application
retrieve personal data to an encrypted local library, which the users can then add to a personal cloud
Data Modelling, Harmonisation, Cataloguing and Integration
◦ Global dynamic Subjective-Objective-Assessment-Plan (SOAP) model
◦ Use biomedical taxonomies and ontologies such as LOINC, SNOMED CT, ICD-10-CM, CPT, MESH
◦ Persistent Identifiers (PIDs)
Secure data access, sharing and processing in line with GDPR legislation
Data Collection and Management
9. Quality Assurance, Quality of Service , Compliance & Dissemination
Application Layer (WEB & Apps)
SIEMENS , ATHENA, HES-SO [WP2, WP8]
Data Exploration, Analytics & Cohort Builder
based on advanced Similarity & Semantic Search
HWC, DigiMe [WP3]
Personal Data Account (PDA) & Dynamic
consent management
WHY
LYN [WP11]: Coordination & Management
LYN [WP10]: Dissemination and Exploitation
CNR [WP9]: Penetration & Re-Identification Challenge NCTM [WP2]: Regulatory and Compliance Study
HES-SO [WP1]: Requirements Analysis
LYN [WP7]: Platform-driven Assessment
WHERE
Private Data Sources
Federated Data Management & Data Harmonisation Layer
WHAT
HES-SO, ATHENA [WP4] : Semantic Modeling and data
integration
HES-SO, GNUBILA [WP4]: Persistent Identifiers
Cataloguing (PID)
Hospitals
Electronic Medical Records
Personal Data Subjects
social media accounts, clinical data repositories, personal
drives, wearable devices
ATHENA [WP5]: Data Profiling & curation (quality, privacy & analysis)
Privacy by design middleware Layer
ATHENA, GNUBILA [WP5]
Privacy preserving distributed data
processing
HOW
GNUBILA, ATHENA, HWC [WP6, WP3]
Blockchain Integration & Smart
contracts management
API (for SaaS applications) ATHENA, GNUBILA (WP5, WP6)
DigiMe, HWC [WP3]
Personal Data acquisition and
management
10. Data access & Privacy preservation
Security / privacy breaches:
◦ avoid a single point of failure (i.e., datawarehouse, TTP): decentralize data
(transactions, patient data) and control using federation and blockchain
◦ offer multiple levels of privacy preservation
Ownership: Users should control their data, easily join or leave
Transparency: Users should audit the usage of their data
Privacy is important
11. MDPSeC CDP
Blockchain as an access-control manager
Patient
PIDs
PIDs
PIDs
Digital Object Architecture (DOA)
PI
(1) Initiates a Data
Access request
(2) Re-identification &
consent request
(2) consent request
(Anonymous)
Medical Data
consent
consent
consent
New cohort
Request
Smart
Contract
(3a,b) Sharing of
(Anonymous) EHRs
Sharing
Privacy
preserving data
publishing
Blockchain integration @ MHMD
(3c) Execute a privacy
Preserving computation
Bio-medical model
Privacy preserving
distributed
complex data
flow execution
Transaction
Actors (WHO)
Data controllers
Data processors
Data subjects
Data controllers
Data (WHAT)
Functions (WHY)
Methods (HOW)
Output (WHAT)
a decentralized personal data management platform focused
on privacy
combine blockchain and off-blockchain storage
users own, control and monitor their data and data usage
utilize blockchain & smart contracts as an automated access-
control manager
does not require trust in a third party
pointers to de-identified data suitable for random queries
support full data processing through PPDM
12. Smart Contract
Blockchain integration
WHO
subjects & controllers processors & requesters
WHAT & WHY
HOW
Data Functions Output
DMP &
(privacy) profiling
PPDM: MPC, DP, Encryption
(on pseudoanonymized data)
PredictionsPublishing
(external parties)
Mining
(within MHMD)
Models EHR data
Publishing: Anonymization &
Watermarking
Blockchain & Smart contracts
(control & trace data usage)
Personal data
access
13. Three main use cases:
Personal Data Access
◦ Patient accessing his/her EHR
Data publishing
◦ Research VS other purposes
◦ Anonymization requirements
◦ Watermarking
Privacy Preserving Data Mining (within platform)
◦ Move data (authorized applications get and process the data i.e., MDP / Cardioproof)
◦ Move computation to data: secure multiparty computation (SMC, DP) on federated data /
distrustful parties (MHMD, HBP)
◦ Other encryption techniques (homomorphic)
Encryption and privacy preserving policies
14. static data publishing: “Sanitization” (Anonymization)
secure multi party computation: Only overall aggregated data are
transferred between nodes
interactive anonymization: Differential Privacy & Crowd-Blending
privacy
encryption: Fully/Partially Homomorphic Encryption (FHE)
decentralization: Use Blockchain to Protect Personal Data
Encryption and privacy preserving policies
15. Privacy & Sensitivity Data Profiling:
◦ Define privacy profiles per data type & usage scenario
Trade-offs among efficiency, accuracy & privacy
Define a formal methodology to describe “privacy
budget” in terms of expected accuracy
Automate privacy preserving method selection based
on privacy & sensitivity profile and efficiency /
accuracy trade-offs
Encryption and privacy preserving policies
Efficiency
16. Secure Data publishing
Different dangers
◦ Identity leakage
◦ Attribute leakage
◦ Participation leakage
Different transformations
◦ Generalization
◦ Suppression
◦ Perturbation
◦ Partitioning
◦ Noise addition
“Sanitization” (Anonymisation) hiding individual information
(ensuring k-anonymity) but preserving aggregated
(sufficient) statistics
17. Secure Data publishing
Amnesia anonymization tool
◦ It offers several versions of k-anonymity
◦ It allows the user to select and customize possible solutions
◦ It offers graphical tools that allow the user to analyze the anonymized dataset
◦ It is scalable and uses all available CPU cores in the anonymization process
Watermarking techniques
18. The setting: Data is horizontally distributed at different sites on a Private
Data Network (PDN) of mutually distrustfully parties
The aim: Compute the data mining algorithm on the data so that nothing
but the output is learned
◦ Use secure computation using SMPC, encryption, DP etc
◦ Assume Semi-honest types of adversaries that follow the protocol
Makes sense where the participating parties really trust each other (e.g., hospitals)
Training (learning) vs Reasoning: different requirements and privacy
related issues
◦ training: needs access to patient records
◦ reasoning: needs only the model and new data subjects but…
Inference from the results: One can break privacy using well specified queries and analyzing
the results
Privacy Preserving Data Mining
19. Distributed elastic execution
Iterative dataflow execution: Support ML algorithms
Powerful data programming paradigm: SQL with User Defined Functions
Privacy-aware query processing
Distributed Privacy Preserving Data Mining:
EXAREME
20. Query
Federatio
n
Decompose query into
local and global parts
Dataflow Execution Example
1 N
id m-name m-valueid m-name m-value
Local queries Local queries
Partial
aggregated
results
Run local
queries
Run local
queries
“count, avg, std”
m-name N avg std
m-name Σx Σx2 N
Σx,Σx2,N Σx,Σx2,N
Partial
aggregated
results
m-name Σx Σx2 N
L:“Σx, Σx2, N”
G:“N, avg, std”
Run global
queries
N, avg, std
21. Quality Assurance, Quality of Service , Compliance & Dissemination
LYN [WP11]: Coordination & Management
LYN [WP10]: Dissemination and Exploitation
CNR [WP9]: Penetration & Re-Identification Challenge NCTM [WP2]: Regulatory and Compliance Study
HES-SO [WP1]: Requirements Analysis
LYN [WP7]: Platform-driven Assessment
WHERE
Private Data Sources
Federated Data Management & Data Harmonisation Layer
WHAT
HES-SO, ATHENA [WP4] : Semantic Modeling and data
integration
HES-SO, GNUBILA [WP4]: Persistent Identifiers
Cataloguing (PID)
Hospitals
Electronic Medical Records
Personal Data Subjects
social media accounts, clinical data repositories, personal
drives, wearable devices
Privacy by design middleware Layer
ATHENA, GNUBILA [WP5]
Privacy preserving distributed data
processing
HOW
GNUBILA, ATHENA, HWC [WP6, WP3]
Blockchain Integration & Smart
contracts management
API (for SaaS applications) ATHENA, GNUBILA (WP5, WP6)
DigiMe, HWC [WP3]
Personal Data acquisition and
management
Application Layer (WEB & Apps)
SIEMENS , ATHENA, HES-SO [WP2, WP8]
Data Exploration, Analytics & Cohort Builder
based on advanced Similarity & Semantic Search
HWC, DigiMe [WP3]
Personal Data Account (PDA) & Dynamic
consent management
WHY
ATHENA [WP5]: Data Profiling & curation (quality, privacy & analysis)
22. Data Cleaning, Exploration & Analytics
Data curation & profiling, knowledge discovery and statistical
simulation framework
◦ Process driven by bottom-up evidence AND top-down models/knowledge
◦ Data profiling, cleaning & exploration: Statistical analysis, advanced visualization, rule based
cleaning
◦ Data Mining, pattern discovery and similarity analysis: Well established ML
◦ Statistical simulation: Dependency analysis/reasoning based on Bayesian Nets
23. Data Cleaning, Exploration & Analytics
Data
Query System
Action
Results
Analytics System
Curation System
XYZ System
Abstraction
Analytics (data mining, machine learning, discovery)
Cleaning and curation
Homogenization and integration
Querying and searching
Transformation
Visualization
Zooming
…
25. Individualized diagnosis, prognosis &
treatment plan
Data analytics flow to P. Medicine
Precision Medicine
Support
Reasoning, Simulation & DSS
Domain knowledge & assumptions Clinical workflows
Data Analysis &
Modelling
Knowledge Discovery & Model training
Disease signatures & patient groups
Variables dependencies & prediction models
For a particular
patient
Unknown / missing data
Predict value of missing
variable
Final DAG (based on MCMC&DP, threshold=0.5)
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Transformed & validated data
Data Curation &
Exploration
Cleaning, profiling & pre-processing
Biomarker based personalized acquisition
TOP-DOWNBOTTOM-UP
Data Management & Harmonisation
26. Individualized diagnosis, prognosis &
treatment plan
Data analytics flow to P. Medicine
Precision Medicine
Support
Reasoning, Simulation & DSS
Domain knowledge & assumptions Clinical workflows
Data Analysis &
Modelling
Knowledge Discovery & Model training
Disease signatures & patient groups
Variables dependencies & prediction models
For a particular
patient
Unknown / missing data
Predict value of missing
variable
Final DAG (based on MCMC&DP, threshold=0.5)
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Biomarker based personalized acquisition
TOP-DOWNBOTTOM-UP
Data Management & Harmonisation
Transformed & validated data
Data Curation &
Exploration
Cleaning, profiling & pre-processing
27. Data profiling:
◦ ensures and assess the actual content, structure and quality of the data
◦ reveal their characteristics, strengths and weaknesses
Types:
◦ Structural: Schema, Type (e.g., numeric or text), Format (e.g., mm/dd/yyyy)
◦ Statistical: distribution, missing values, tails
◦ Logical: rules, constraints
◦ Identity: deduplication / resolution, ref. table matching
◦ Security / privacy Data Profiling: assessing relevance, sensitivity, risk for the
individual and practical value
Data Profiling and Curation
28. Data Profiling and Curation
DCV: semi-automatic tool
data profiling
data cleaning, validation & transformation
privacy preserving data analysis
interactive and efficient web-based interface
workflow support (rerun experiments, reproduce results)
32. Individualized diagnosis, prognosis &
treatment plan
Data analytics flow to P. Medicine
Precision Medicine
Support
Reasoning, Simulation & DSS
Clinical workflows
For a particular
patient
Unknown / missing data
Predict value of missing
variable
Transformed & validated data
Data Curation &
Exploration
Cleaning, profiling & pre-processing
TOP-DOWNBOTTOM-UP
Data Management & Harmonisation
Domain knowledge & assumptions
Data Analysis &
Modelling
Knowledge Discovery & Model training
Disease signatures & patient groups
Variables dependencies & prediction models
Final DAG (based on MCMC&DP, threshold=0.5)
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Biomarker based personalized acquisition
33. Disease signatures: latent factors characterizing disease
◦ Patterns over the most relevant disease variables, e.g., biomarkers
◦ Several approaches (probabilistic latent factor analysis, well established ML
argorithms)
Predictive analysis: Patient classification or regression for
categorization and outcome analysis
Descriptive analysis: clustering algorithms & probabilistic (mixed)
membership models
Similarity Analysis: patients “like” me or mine (patient/clinician role)
Data Mining & KDD
35. Data analytics flow to P. Medicine
Domain knowledge & assumptions
Data Analysis &
Modelling
Knowledge Discovery & Model training
Disease signatures & patient groups
Variables dependencies & prediction models
Final DAG (based on MCMC&DP, threshold=0.5)
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Transformed & validated data
Data Curation &
Exploration
Cleaning, profiling & pre-processing
Biomarker based personalized acquisition
TOP-DOWNBOTTOM-UP
Data Management & Harmonisation
Individualized diagnosis, prognosis &
treatment plan
Precision Medicine
Support
Reasoning, Simulation & DSS
Clinical workflows
For a particular
patient
Unknown / missing data
Predict value of missing
variable
36. NEUROLOGICAL AND NEUROMUSCULAR DISEASE (NND) Use-case:
Automatic classification of 7 Joint Movement Patterns based on kinematic
data.
Training: on specific extracted features or raw gait analysis waveforms (time
series)
Cross-Validation: Stratified 10-fold
Method: Random Forests, kNN
Results: Models prediction accuracies>85%
Classification (categorization)
37. Aim: To predict early disease outcome in JIA using baseline variables
Analysis: Random Forest algorithm on three datasets
Conclusion: Difficulty identifying patients who remained active
Clinical (acc=0.6)
Clinical with
Luminex (0.57)
Clinical with
microbiota (0.52)
Classification (outcome prediction)
38. Individualized diagnosis, prognosis &
treatment plan
Data analytics flow to P. Medicine
Precision Medicine
Support
Reasoning, Simulation & DSS
Clinical workflows
For a particular
patient
Unknown / missing data
Predict value of missing
variable
Transformed & validated data
Data Curation &
Exploration
Cleaning, profiling & pre-processing
TOP-DOWNBOTTOM-UP
Data Management & Harmonisation
Domain knowledge & assumptions
Data Analysis &
Modelling
Knowledge Discovery & Model training
Disease signatures & patient groups
Variables dependencies & prediction models
Final DAG (based on MCMC&DP, threshold=0.5)
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Biomarker based personalized acquisition
40. Probabilistic Modeling for statistical
simulation
Finding most important dependencies and independencies:
e.g. disDur, neutro,pga are almost uncorrelated and excluded
Qualitative dependency analysis: Learning
the structure (DAG)
Quantitative analysis: Learning model
parameters (Cond. Prob.)
41. Data analytics flow to P. Medicine
Domain knowledge & assumptions
Data Analysis &
Modelling
Knowledge Discovery & Model training
Disease signatures & patient groups
Variables dependencies & prediction models
Final DAG (based on MCMC&DP, threshold=0.5)
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Transformed & validated data
Data Curation &
Exploration
Cleaning, profiling & pre-processing
Biomarker based personalized acquisition
TOP-DOWNBOTTOM-UP
Data Management & Harmonisation
Individualized diagnosis, prognosis &
treatment plan
Precision Medicine
Support
Reasoning, Simulation & DSS
Clinical workflows
For a particular
patient
Unknown / missing data
Predict value of missing
variable
48. What about domain knowledge??
Data Analysis &
Modelling
Knowledge Discovery & Model training
Disease signatures & patient groups
Variables dependencies & prediction models
Final DAG (based on MCMC&DP, threshold=0.5)
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Transformed & validated data
Data Curation &
Exploration
Cleaning, profiling & pre-processing
Biomarker based personalized acquisition
TOP-DOWNBOTTOM-UP
Data Management & Harmonisation
Individualized diagnosis, prognosis &
treatment plan
Precision Medicine
Support
Reasoning, Simulation & DSS
Clinical workflows
For a particular
patient
Unknown / missing data
Predict value of missing
variable
Domain knowledge & assumptions
49. BIO-KNOWLEDGE ASSOCIATION MAP
Multi View Topic
Modelling
NLP & Named
Entity Recognition
Semantic bio terms
(PDBCodes,
chem2Bio2RDF,
LODD)
Generate specific association maps for
different types of entities (e.g., genes,
MESH, proteins, drugs)
Annotate publications
with bioterms (genes,
pdbCodes etc)
Full text PubMed
papers and meta data
including MESH
Identify multi modal
topics & quantify
associations Text pdbCode
cancer tumor growth breast lines
apoptosis tumors prostate kinase 1m17 2ity 1qcf
binding dna brca brct cancer res
mutations domain results
1tsr 1ycs 2ac0
1gzh
nef hiv vpr ssb virus felv hck
replication ssdna
1eyg 2nef 1jmc
1m8l 1efn 1izn
50. Analyze large collections of documents, and meta-
data to:
identify active areas of research: discover hidden themes (topics)
understand what is actually produced: project the output to the reduced topic
space (calc topic distributions per document or other entity (e.g. gene or
protein)
create association maps (interaction networks ) among different entities (e.g.,
genes, drugs, diseases, proteins)
• promote target identification: “Pathway expansion” for no ‘druggable’
targets, multi-target drugs, drug repositioning (indication expansion)
identify emerging research areas , e.g., target identification, or the understanding
of disease mechanisms: create new therapeutic opportunities
assess coverage, identify gaps or new therapeutic opportunities: compare funded
research, patents
Mining scientific literature
WHY
51. What is involved…
Extract features and annotate (enrich)
content using NLP, Named Entity
Recognition & Semantic Annotation
Tokenize, remove stop words
Refine stop words for
specific domain
1
ENRICH &
PRE-PROCESS
Identify topics: distribution over words
& “side” information
Automatic topic curation & entitling
Assign topics to publications
Evaluate & categorize
topics
Assess topic labels
2
FIND
TOPICS
Calculate topic proportions & trends of
objects based on their publications
Calculate similarity among different
entities based on various metrics
Analyze & Validate the
results
3
CALCULATE
TRENDS &
SIMILARITIES
Create WEB interactive visualization
with data driven graphs, charts and
layouts
Design optimal views
Validate modeling results
4
VISUALIZE
52. Probabilistic Multi-View Topic Modeling of Text-Augmented
Heterogeneous Information Networks
interconnected (linked) entities which characterized by TEXT and
related side information & links (e.g., taxonomies, venues, projects /
research areas, citations, authors)
side-information:
structured or unstructured attributes and meta-data
links / relations: e.g., authorship network, citation network
Incomplete, noisy or not related to textual attributes
Methodology
53. Multi-View Topic Modeling
Text
gene
cells
expression
vector
aav
vectors
dna
therapy
figure
cell
target
gfp
targeting
delivery
diseases
Phrases
gene therapy
gene transfer
aav vectors
lentiviral vectors
Grants
PERSIST: Persisting Transgenesis
AAVEYE: GENE THERAPY FOR INHERITED
SEVERE PHOTORECEPTOR DISEASES
MESH Descriptors
Genetic Vectors
Lentivirus
Genetic Therapy
Dependovirus
Green Fluorescent Proteins
Journals
Molecular therapy
Research Areas
Biotechnology, generic tools and medical
technologies for human health
Expert: What is this Topic about??
Diagnostics and treatment development:
Gene therapy & genetic vectors
54. Multi-View Topic Modeling
Infectious diseases: HIV and NEF protein
Text
hiv
cells
cell
nef
viral
virus
bst
gag
infected
drug
vpu
gfp
assembly
surface
cellular
Phrases
gfp cells
hela cells
Infected cells
plasma membrane
Grants
HIV ACE: Targeting assembly of infectious
HIV particles
INEF: Inhibiting Nef: a novel drug target for
HIV-host interactions
MESH Descriptors
HIV-1
Antigens, CD
Cell Membrane
Membrane Glycoproteins
Journals
plos pathogens
Research Areas
HEALTH-2007-2 [Translating research for
human health]
PDB codes
2NEF, 1M8ML, 1EFN
55. Similarity & Graph
clustering
Topics & allocations
Modelling
LINKS represent topic
based similarity
NODES may represent drugs,
PDBCodes, genes or MeSH terms
Size: ~ # of publications
Categories may
represent Anatomical
Therapeutic Chemical
(ATC) class, Biological
Process, MeSH hierarchy
etc
56. e-Infrastructures & data repositories
Use Domain Knowledge to
• Enhance Patient Similarity Analysis
• Promote Decision Support
Clinical data clouds
• Analyze clinical data to
validate findings
Editor's Notes
Probably this should be analyzed on the other section
Probably this should be analyzed on the other section
Probably this should be analyzed on the other section
E
4-8 slides
Probably this should be analyzed on the other section
analyze the content, structure, and relationships within data to uncover patterns and rules, inconsistencies, anomalies, and redundancies and automate curation process using a variety of advanced data cleaning methods
analyze the content, structure, and relationships within data to uncover patterns and rules, inconsistencies, anomalies, and redundancies and automate curation process using a variety of advanced data cleaning methods
a histogram of variable JADAS-71 (shows outliers with high values)
a plot of JADAS-71 (Juvenile arthritis disease activity score, on 71 joints) against CHAQ-score (Childhood Health Assessment Questionnaire), both indications of disease severity (outliers in red).
a line graph between weight and height (showing their obvious correlation).
The graphs are interactive.
DCV Data Cleaning Rule on JIA . Discrepancy found between the two variables that represent the outcome after 6 months.
There is one violation (see red above) in the mapping between columns Outcome[29] and Outcome dichotomised[30] – we want to utilise column 30 for now.
The correct mapping between these variables is:
clinical inactive disease —> 1
persistent activity —> 0
disease flare —> 0
Showing discretisation of Microbiota variables in quartiles.
Biomarker: “a characteristic that is objectively measured and
evaluated as an indicator of normal biologic processes,
pathogenic processes, or pharmacologic responses to a
therapeutic intervention”
Thus, biomarkers refer to single measurements able to improve differential diagnosis, track disease progression and measure treatment efficiency
whereas disease signature involve multiple (multi or single modal) measurements that form a specific pattern