Bioinformatics dogma asserts that all-atom representations, capable of encoding details such as disulfide bridging and post-translationally modified amino acids, are too unwieldy to be of practical use. In this presentation, we show how recent advances in computer power, software algorithms and storage technology require us to question this precept. We show how InChI, InChI keys and canonical SMILES can be generated for the largest known proteins, and even for nucleic acid sequences as large as viral and prokaryotic genomes. Indeed, unique identifiers derived from all-atom nucleic acid representations, allow the capture of epigenetic methylation information and circular DNA; feats that are impossible with the one-letter codes used by bioinformaticians. These unique identifiers allow the linking of mature antibodies to the unique identifiers of the plasmids used to express them. Finally, we discuss the possibility of polymer-specific implementations/optimizations of standard InChI, by showing how InChIs and InChI keys may be generated efficiently for specific classes of polymer with over a million atoms.
User Guide: Orion™ Weather Station (Columbia Weather Systems)
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers
1. Generating canonical identifiers
for (glycoproteins and other
chemically modified) biopolymers
Roger Sayle , john may & Noel O’Boyle
Nextmove software, cambridge, uk
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
2. motivation
• Non-standard peptides, post-translationally modified
proteins and drug-antibody conjugates are becoming
increasingly relevant to the life sciences.
• Registration of biologics, beyond the FASTA sequence,
is considered desirable but technically challenging.
• In this talk, I discuss complementary approaches to
biologics registration; one based upon expressive all-
atom representations, another on tracking deltas to a
reference database of protein sequences.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
3. Real world small scale example
• Many research reagents contain “hybrid molecules”
– Innovagen SP-5125 lauroyl-apelin-13
• dodecanoyl-QRPRLSHKGPMPF
– Innovagen SP-5126 myristoyl-apelin-13
• tetadecanoyl-QRPRLSHKGPMPF
– Innovagen SP-5124 palmitoyl-apelin-13
• hexadecanoyl-QRPRLSHKGPMPF
– Innovagen SP-5127 steroyl-apelin-13
• octadecanoyl-QRPRLSHKGPMPF
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
4. The cutting edge of biosimilarity
• The high prevalence of potentially life-threatening
hypersensitivity reactions to the antibody cetuximab
(Erbitux) in some US states has been traced to its
glycosylation [containing a Gal(a1-3)Gal epitope].
Chung et al., “Cetuximab-induced anaphylaxis and IgE specific for
galactose-alpha-1,3-galactose”, New England Journal of Medicine,
Vol. 358, No. 11, pp. 1109-1117, 13th March 2008.
• Similarly, Human Erythropoietin (EPO) alpha, beta,
delta and omega share the same primary sequence,
but differ in their glycosylation patterns.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
5. Monomer dictionaries don’t scale
• Systems based upon monomer dictionaries (such as
HELM and PDB) are notoriously difficult to maintain.
• The limited number of monomers in proteinogenic
peptides and natural nucleic acid sequences leads to
a false sense of security; that monomers are finite.
• In practice, the number of monomers, post-
translational and chemical modifications is infinite.
• Even more difficult than standardizing monomer
definitions via a central repository, like PDB, is
allowing local custom definitions.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
9. The current situation
• Pistoia HELM can’t [yet] handle/canonicalize
glycans and oligosaccharides.
– It can’t uniquely canonicalize Fmoc-Ala-OH
(between Pistoia and ChEMBL monomer sets).
• IUPAC InChI can’t [yet] officially handle more
than 1024 atoms.
• Folks working on glycoproteins are screwed…
(or use expensive commercial software)
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
10. Constructive suggestion…
• Ideally, a chemical identifier should be independent
of the input representation or file format.
• Equivalence between small molecules, peptide and
proteins are best determined by a single identifier,
preferably the existing standard InChI.
• This is possible as increases in computer power and
storage mean that cheminformatics toolkits can
handle huge biopolymers on modern hardware.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
11. Three recent experiments
1. Is it possible to generate standard InChI for
extremely large molecules (polymers)?
2. How well do all-atom canonicalization
algorithms scale and can they be improved?
3. Are there alternative canonical identifiers
that can be useful in bioinformatics and
precision medicine?
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
13. Special classes of molecules
• Alkanes
– InChI=1S/C6H14/c1-3-5-6-4-2/h3-6H2,1-2H3
– InChI=1S/C8H18/c1-3-5-7-8-6-4-2/h3-8H2,1-2H3
– InChI=1S/C10H22/c1-3-5-7-9-10-8-6-4-2/h3-10H2,1-2H3
– 1 million carbons, InChI is 6,889,942 bytes (~6.9Mbytes)
– 1 billion carbons, InChI is 9,888,888,954 bytes (~9.9Gbytes)
• Polyalanine
– InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
– 1 thousand L-alanines, InChI is 42,965 bytes (~43Kbytes)
– 1 million L-alanines, InChI is 66,888,995 bytes (~66.9Mbytes)
• Theoretically one could write an efficient fasta2inchi
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
14. Algorithm scaling to 100AA
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
15. Algorithm scaling to 1000AA
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
16. Algorithm scaling to 5000AA
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
17. Algorithm scaling to maximum
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
18. Spread of algorithm run-times
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
19. peptide names (chembl)
• The following names are machine generated
• [15-L-arginine]nociceptin CHEMBL526333
• [2-4-chloro-L-phenylalanine]neuropeptide S [human] CHEMBL441576
• [1-L-threonine]cyclosporin A CHEMBL2370014
• [6-L-tryptophan]sermorelin free acid CHEMBL440438
• angiotensin II (3-8) CHEMBL261120
• nociceptin amide CHEMBL389521
• acetyl-alpha-MSH (4-10) amide CHEMBL410411
• [2-L-cysteine,13-L-cysteine]neurotensin disulfide CHEMBL3278512
• myristoyl-[1-L-lysine,4-L-tryptophan]tetrapandin 2 amide CHEMBL3288219
• [2-(4RS)-thiazolidine-4-carboxylic acid,4-L-proline]endomorphin-2 CHEMBL126611
• [22-L-serine]kalata B1 CHEMBL1801140
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
20. Scaling-up protein variant naming
• The algorithm described for naming peptides can
also be applied to naming arbitary protein variants.
• Consider the a database of the following 11 peptides:
– CFFQNCPRG phenylpressin
– CFVRNCPTG annetocin
– CFWTSCPIG octopressin
– CYFQNCPRG argipressin
– CYFQNCPKG lypressin
– CYFRNCPIG cephalotocin
– CYIQNCPLG oxytocin
– CYIQNCPPG prol-oxytocin
– CYIQNCPRG vasotocin
– CYIQSCPIG seritocin
– CYISNCPIG isotocin
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
21. Dag representation of sequences
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
These 11 peptides may be efficiently represented and
search as a “directed acyclic graph” [38 vs. 99 states]
22. entirety of uniprot/swissprot
• Using this representation, all 540546 protein
sequences in uniprot_sprot, which contains over
192M amino acids, requires 142M states (1.4Gb).
• This data structure allows close analogues to be
identified much faster than using NCBI blastp.
• For example, all 540546 sequences can be queried
against this database (i.e. all-against-all) in ~9m30s
on a single core on a laptop.
• The sequence from PDB 1CRN (crambin 46AA) is
canonically named as [L25I]P01542 in 0.002s.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
23. Application to precision medicine
• A more realistic example is that sequence of the
gene “spastic paraplegia4” with six mutations from
OMIM:604277 can be canonically named as
[I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0
• Run-time for this query is 0.2s.
• By comparison, blastp 2.2.29+ takes about 6s.
– With default arguments, NCBI blastp run time is 7s.
– Only 6s with –num_descriptions 1 –num_alignments 1.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
24. conclusions
• “InChI for large molecules” can be achieved,
and remain compatible with small molecule
InChI identifiers, through the evolution of ever
better canonicalization algorithms.
• Journal reviewers who claim that the run-time
of canonicalization algorithms is a non-issue,
and not an area ripe for improvement are…
very mistaken.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015