Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

ty
si
SVM based approach

er
to assess the reliability of

iv
Un
protein-protein interactions

on
s
Ma
ge

Meher Preethi Boorgula, Ronak Shah,
or

Neerja Katiyar
Ge

Motivation:

ty
si
er
Protein interactions play a key role in many

iv
cellular processes.

Un
Distortion of protein interfaces may lead to

on
development of many diseases.
s
Ma
Reliable Protein-protein interactions (PPIs)
ge

conserved among different species and that are
or

involved in diseases would be very helpful for
Ge

researchers.

Problem Statement:

ty
si
er
Protein-Protein Interactions (PPIs) are very

iv
Un
helpful in functional annotation of proteins. It

on
is important that the PPI data is reliable.
s
Ma
Thus, we try to predict the reliability of PPIs
with respect to a disease causing bacterium.
ge
or
Ge

Objective:

ty
si
er
To create a prediction model based on Kernel

iv
Un
method (SVM) to assess the reliability of PPIs

on
in Treponema pallidum obtained from Yeast
s
Two Hybrid (Y2H) system.
Ma
To classify the interactions as reliable and not
ge
or

reliable.
Ge

Introduction:

ty
si
er
Protein-protein interactions can be identified

iv
with the help of high-throughput techniques like

Un
the Yeast-two Hybrid (Y2H) and Mass

on
Spectrometry (MS).
s
Ma
The main disadvantage with these existing
techniques is the amount of false-positives in the
ge

data obtained.
or
Ge

So, assessing the reliability of PPIs is necessary.

Methodology:

ty
si
er
Preparation of data sets

iv
Un
Extract the attributes

son
Ma
Create & test model using SVM light
ge

Evaluate the performance of the model
or
Ge

Analyze the reliability of PPI data sets

Datasets:

ty
si
er
Raw data of interactions was obtained from

iv
Y2H experiments performed at J.Craig Venter

Un
Institute.

on
This data was then organized into train and
s
Ma
test sets by considering equal number of
ge

positive and negative examples.
or

Positive – High Confidence data
Ge

Negative – Low Confidence data

Dataset (Contd…)

ty
si
er
All Interactions = 2993

iv
High Confidence = 721

Un
Common Interactions = 66
son
Total (excluding common) = 3648
Ma

Train & Test datasets were made by taking
ge
or

1824 interactions.
Ge

Extracting Attributes:

ty
si
er
Attributes chosen include:

iv
- Sequence based:

Un
i. occurrence of 5-mers in the sequence data
son
ii. Hydrophobicity
Ma
- Non-sequence based:
ge
or

i. Jaccard coefficient
Ge

ii. GO Annotation

Hydrophobicity:

ty
si
er
Protein interaction depends on the nature of the

iv
active/binding site.

Un
Hydrophobicity profile was used in order to extract

on
this feature.
s
Ma
Average Hydropathy was calculated for a sequence
based on the hydrophobicity of each amino acid
ge

residue.
or
Ge

This was obtained using the tool “ProteinGRAVY”.

Jaccard coefficient:

ty
si
er
In a PPI network, the neighbors of interacting

iv
proteins also tend to interact.

Un
Jaccard coefficient:

on
|N(v) U N(u)| / |N(v) ∩ N(u)|
s
Ma
where u, v are the interacting proteins
ge

N(X) = set of neighbors of protein X in the PPI
or

network
Ge

GO Annotations:

ty
si
er
Proteins that are present in the same cellular

iv
component or that participate in same biological

Un
processes are more likely to interact.

on
This was captured with the help of extracting
s
Ma
identical GO IDs for the interacting proteins.
ge

Interacting proteins with atleast one common GO
or

ID was considered reliable.
Ge

Occurrence of 5-mers

ty
si
er
Spectrum kernel models a sequence in the

iv
space of all k-mers (5-mers).

Un
All possible 5-mers in the protein sequences

on
were obtained for the data.
s
Ma
Number of times each 5-mer appears in the
ge

sequence data for both bait and prey proteins
or

was extracted.
Ge

Creating & Testing Model:

ty
si
er
SVM Light was used to create a classification

iv
model based on linear & sigmoid kernel.

Un
Test data was applied to the model in order to

on
s
classify it.
Ma
The performance of the model was evaluated
ge

based on Accuracy, Precision and Recall
or
Ge

values.

Experiments Performed:

ty
si
er
1) Model generated using the attribute

iv
Hydrophobicity.

Un
2) Model generated using the attribute JC
son
3) Model generated using both of these
Ma
attributes.
ge

4) Model generated using both these attributes
or
Ge

on a different data set (equal number of
positive and negative examples).

Results for Linear Kernel:

ty
si
er
iv
Exp-1 Exp-2 Exp-3 Exp-4

Un
on
Accuracy 79.99 79.99 79.88 51.23
s
Ma
(%)
ge

Precision - - - -
or

(%)
Ge

Recall 0.0 0.0 0.0 0.0
(%)

Results for Sigmoid Kernel:

ty
si
er
Exp-1 Exp-2 Exp-3 Exp-4

iv
Un
on
Accuracy - - 79.88 57.26
s
Ma
(%)
Precision - - 0.0 57.80
ge
or

(%)
Ge

Recall - - 0.0 45.79
(%)

Observation:

ty
si
er
Results obtained were not reliable as the

iv
model was built using only two attributes.

Un
This would not be efficient in discriminating

on
the positive & negative examples.
s
Ma
Also, it was observed that there was no
ge

significance of the positive examples while
or

creating the model.
Ge

To Be done:

ty
si
er
Extracting the attribute “occurrence of 5-

iv
mers” for the protein pairs and perform all the

Un
experiments.

on
Obtain data from INTACT database to
s
increase the number of positive examples and
Ma
to overcome the number of false positives in
ge

the data since it is from Y2H experiments.
or
Ge

Compare the performance with the existing
model based on “Logistic Regression”.

Problems:

ty
si
er
The major problem for extracting attributes

iv
Un
which were dependent on the annotation was

on
that Treponema is not fully annotated.
s
Ma
The interaction data for Treponema is also not
reliable.
ge
or
Ge

Future Work:

ty
si
er
We would like to apply this model to

iv
Streptococcus Pneumoniae.

Un
Using PSSM scores by performing PSI-Blast

on
would be helpful.
s
Ma
Analyze for the biological relevance of our
ge

predictions and then test experimentally the
or

interactions predicted to be reliable by the
Ge

model.

References:

ty
si
er
Dr.Peter Uetz et al (J.Craig Venter Institute)

iv
Kernel methods for predicting protein–protein

Un
interactions by Asa Ben-Hur & William Stafford

on
Noble
s
Ma
SVM Light: http://svmlight.joachims.org/
ge

Protein GRAVY:
or

http://www.bioinformatics.org/sms2/protein_gravy.html
Ge

PIR: http://pir.georgetown.edu/

Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Similar to Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology (20)

More from Ronak Shah

More from Ronak Shah (7)

Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology