Protein-Protein interactions discovered by the existing high-throughput techniques contain very high amount of false positives. Here we present an SVM based approach to generate a model that is built on sequence and non-sequence based information of the interacting proteins. This model is used to assess the reliability of given protein-protein interactions. It was run on the interaction data of a pathogenic bacterium; Treponema pallidum (causes Syphilis in humans) obtained from Yeast two hybrid experiments. Various kernels were used for building the model and of all, Sigmoid kernel performed well when used with all the features combined with area under the receiver operating curve (ROC) as 0.53.
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology
1. ty
si
SVM based approach
er
to assess the reliability of
iv
Un
protein-protein interactions
on
s
Ma
ge
Meher Preethi Boorgula, Ronak Shah,
or
Neerja Katiyar
Ge
2. Motivation:
ty
si
er
Protein interactions play a key role in many
iv
cellular processes.
Un
Distortion of protein interfaces may lead to
on
development of many diseases.
s
Ma
Reliable Protein-protein interactions (PPIs)
ge
conserved among different species and that are
or
involved in diseases would be very helpful for
Ge
researchers.
3. Problem Statement:
ty
si
er
Protein-Protein Interactions (PPIs) are very
iv
Un
helpful in functional annotation of proteins. It
on
is important that the PPI data is reliable.
s
Ma
Thus, we try to predict the reliability of PPIs
with respect to a disease causing bacterium.
ge
or
Ge
4. Objective:
ty
si
er
To create a prediction model based on Kernel
iv
Un
method (SVM) to assess the reliability of PPIs
on
in Treponema pallidum obtained from Yeast
s
Two Hybrid (Y2H) system.
Ma
To classify the interactions as reliable and not
ge
or
reliable.
Ge
5. Introduction:
ty
si
er
Protein-protein interactions can be identified
iv
with the help of high-throughput techniques like
Un
the Yeast-two Hybrid (Y2H) and Mass
on
Spectrometry (MS).
s
Ma
The main disadvantage with these existing
techniques is the amount of false-positives in the
ge
data obtained.
or
Ge
So, assessing the reliability of PPIs is necessary.
6. Methodology:
ty
si
er
Preparation of data sets
iv
Un
Extract the attributes
son
Ma
Create & test model using SVM light
ge
Evaluate the performance of the model
or
Ge
Analyze the reliability of PPI data sets
7. Datasets:
ty
si
er
Raw data of interactions was obtained from
iv
Y2H experiments performed at J.Craig Venter
Un
Institute.
on
This data was then organized into train and
s
Ma
test sets by considering equal number of
ge
positive and negative examples.
or
Positive – High Confidence data
Ge
Negative – Low Confidence data
8. Dataset (Contd…)
ty
si
er
All Interactions = 2993
iv
High Confidence = 721
Un
Common Interactions = 66
son
Total (excluding common) = 3648
Ma
Train & Test datasets were made by taking
ge
or
1824 interactions.
Ge
9. Extracting Attributes:
ty
si
er
Attributes chosen include:
iv
- Sequence based:
Un
i. occurrence of 5-mers in the sequence data
son
ii. Hydrophobicity
Ma
- Non-sequence based:
ge
or
i. Jaccard coefficient
Ge
ii. GO Annotation
10. Hydrophobicity:
ty
si
er
Protein interaction depends on the nature of the
iv
active/binding site.
Un
Hydrophobicity profile was used in order to extract
on
this feature.
s
Ma
Average Hydropathy was calculated for a sequence
based on the hydrophobicity of each amino acid
ge
residue.
or
Ge
This was obtained using the tool “ProteinGRAVY”.
11. Jaccard coefficient:
ty
si
er
In a PPI network, the neighbors of interacting
iv
proteins also tend to interact.
Un
Jaccard coefficient:
on
|N(v) U N(u)| / |N(v) ∩ N(u)|
s
Ma
where u, v are the interacting proteins
ge
N(X) = set of neighbors of protein X in the PPI
or
network
Ge
12. GO Annotations:
ty
si
er
Proteins that are present in the same cellular
iv
component or that participate in same biological
Un
processes are more likely to interact.
on
This was captured with the help of extracting
s
Ma
identical GO IDs for the interacting proteins.
ge
Interacting proteins with atleast one common GO
or
ID was considered reliable.
Ge
13. Occurrence of 5-mers
ty
si
er
Spectrum kernel models a sequence in the
iv
space of all k-mers (5-mers).
Un
All possible 5-mers in the protein sequences
on
were obtained for the data.
s
Ma
Number of times each 5-mer appears in the
ge
sequence data for both bait and prey proteins
or
was extracted.
Ge
14. Creating & Testing Model:
ty
si
er
SVM Light was used to create a classification
iv
model based on linear & sigmoid kernel.
Un
Test data was applied to the model in order to
on
s
classify it.
Ma
The performance of the model was evaluated
ge
based on Accuracy, Precision and Recall
or
Ge
values.
15. Experiments Performed:
ty
si
er
1) Model generated using the attribute
iv
Hydrophobicity.
Un
2) Model generated using the attribute JC
son
3) Model generated using both of these
Ma
attributes.
ge
4) Model generated using both these attributes
or
Ge
on a different data set (equal number of
positive and negative examples).
16. Results for Linear Kernel:
ty
si
er
iv
Exp-1 Exp-2 Exp-3 Exp-4
Un
on
Accuracy 79.99 79.99 79.88 51.23
s
Ma
(%)
ge
Precision - - - -
or
(%)
Ge
Recall 0.0 0.0 0.0 0.0
(%)
17. Results for Sigmoid Kernel:
ty
si
er
Exp-1 Exp-2 Exp-3 Exp-4
iv
Un
on
Accuracy - - 79.88 57.26
s
Ma
(%)
Precision - - 0.0 57.80
ge
or
(%)
Ge
Recall - - 0.0 45.79
(%)
18. Observation:
ty
si
er
Results obtained were not reliable as the
iv
model was built using only two attributes.
Un
This would not be efficient in discriminating
on
the positive & negative examples.
s
Ma
Also, it was observed that there was no
ge
significance of the positive examples while
or
creating the model.
Ge
19. To Be done:
ty
si
er
Extracting the attribute “occurrence of 5-
iv
mers” for the protein pairs and perform all the
Un
experiments.
on
Obtain data from INTACT database to
s
increase the number of positive examples and
Ma
to overcome the number of false positives in
ge
the data since it is from Y2H experiments.
or
Ge
Compare the performance with the existing
model based on “Logistic Regression”.
20. Problems:
ty
si
er
The major problem for extracting attributes
iv
Un
which were dependent on the annotation was
on
that Treponema is not fully annotated.
s
Ma
The interaction data for Treponema is also not
reliable.
ge
or
Ge
21. Future Work:
ty
si
er
We would like to apply this model to
iv
Streptococcus Pneumoniae.
Un
Using PSSM scores by performing PSI-Blast
on
would be helpful.
s
Ma
Analyze for the biological relevance of our
ge
predictions and then test experimentally the
or
interactions predicted to be reliable by the
Ge
model.
22. References:
ty
si
er
Dr.Peter Uetz et al (J.Craig Venter Institute)
iv
Kernel methods for predicting protein–protein
Un
interactions by Asa Ben-Hur & William Stafford
on
Noble
s
Ma
SVM Light: http://svmlight.joachims.org/
ge
Protein GRAVY:
or
http://www.bioinformatics.org/sms2/protein_gravy.html
Ge
PIR: http://pir.georgetown.edu/