SlideShare a Scribd company logo
1 of 25
Download to read offline
Generating canonical identifiers
for (glycoproteins and other
chemically modified) biopolymers
Roger Sayle , john may & Noel O’Boyle
Nextmove software, cambridge, uk
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
motivation
• Non-standard peptides, post-translationally modified
proteins and drug-antibody conjugates are becoming
increasingly relevant to the life sciences.
• Registration of biologics, beyond the FASTA sequence,
is considered desirable but technically challenging.
• In this talk, I discuss complementary approaches to
biologics registration; one based upon expressive all-
atom representations, another on tracking deltas to a
reference database of protein sequences.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Real world small scale example
• Many research reagents contain “hybrid molecules”
– Innovagen SP-5125 lauroyl-apelin-13
• dodecanoyl-QRPRLSHKGPMPF
– Innovagen SP-5126 myristoyl-apelin-13
• tetadecanoyl-QRPRLSHKGPMPF
– Innovagen SP-5124 palmitoyl-apelin-13
• hexadecanoyl-QRPRLSHKGPMPF
– Innovagen SP-5127 steroyl-apelin-13
• octadecanoyl-QRPRLSHKGPMPF
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
The cutting edge of biosimilarity
• The high prevalence of potentially life-threatening
hypersensitivity reactions to the antibody cetuximab
(Erbitux) in some US states has been traced to its
glycosylation [containing a Gal(a1-3)Gal epitope].
Chung et al., “Cetuximab-induced anaphylaxis and IgE specific for
galactose-alpha-1,3-galactose”, New England Journal of Medicine,
Vol. 358, No. 11, pp. 1109-1117, 13th March 2008.
• Similarly, Human Erythropoietin (EPO) alpha, beta,
delta and omega share the same primary sequence,
but differ in their glycosylation patterns.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Monomer dictionaries don’t scale
• Systems based upon monomer dictionaries (such as
HELM and PDB) are notoriously difficult to maintain.
• The limited number of monomers in proteinogenic
peptides and natural nucleic acid sequences leads to
a false sense of security; that monomers are finite.
• In practice, the number of monomers, post-
translational and chemical modifications is infinite.
• Even more difficult than standardizing monomer
definitions via a central repository, like PDB, is
allowing local custom definitions.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
48 hexopyranoses
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
264 deoxy-hexopyranoses
9540 substituted
hexopyranoses (4 most common
substituents)
The current situation
• Pistoia HELM can’t [yet] handle/canonicalize
glycans and oligosaccharides.
– It can’t uniquely canonicalize Fmoc-Ala-OH
(between Pistoia and ChEMBL monomer sets).
• IUPAC InChI can’t [yet] officially handle more
than 1024 atoms.
• Folks working on glycoproteins are screwed…
(or use expensive commercial software)
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Constructive suggestion…
• Ideally, a chemical identifier should be independent
of the input representation or file format.
• Equivalence between small molecules, peptide and
proteins are best determined by a single identifier,
preferably the existing standard InChI.
• This is possible as increases in computer power and
storage mean that cheminformatics toolkits can
handle huge biopolymers on modern hardware.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Three recent experiments
1. Is it possible to generate standard InChI for
extremely large molecules (polymers)?
2. How well do all-atom canonicalization
algorithms scale and can they be improved?
3. Are there alternative canonical identifiers
that can be useful in bioinformatics and
precision medicine?
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Previous inchi key record [(2014)
• Sequence Identifier: UTP10_KLULA
• Sequence Length: 1774 amino acids
• Molecule size: 28509 atoms
• InChI Length: 119699 characters
• InChI key: PHBRSEQMAKHFGD-ZBXWIJJNSA-N
• InChI Canonicalization Time: 73.2s
• Canonical SMILES Length: 35408 chars
• OEChem SMILES Canonicalization Time: 0.4s
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Special classes of molecules
• Alkanes
– InChI=1S/C6H14/c1-3-5-6-4-2/h3-6H2,1-2H3
– InChI=1S/C8H18/c1-3-5-7-8-6-4-2/h3-8H2,1-2H3
– InChI=1S/C10H22/c1-3-5-7-9-10-8-6-4-2/h3-10H2,1-2H3
– 1 million carbons, InChI is 6,889,942 bytes (~6.9Mbytes)
– 1 billion carbons, InChI is 9,888,888,954 bytes (~9.9Gbytes)
• Polyalanine
– InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
– 1 thousand L-alanines, InChI is 42,965 bytes (~43Kbytes)
– 1 million L-alanines, InChI is 66,888,995 bytes (~66.9Mbytes)
• Theoretically one could write an efficient fasta2inchi
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Algorithm scaling to 100AA
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Algorithm scaling to 1000AA
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Algorithm scaling to 5000AA
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Algorithm scaling to maximum
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Spread of algorithm run-times
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
peptide names (chembl)
• The following names are machine generated
• [15-L-arginine]nociceptin CHEMBL526333
• [2-4-chloro-L-phenylalanine]neuropeptide S [human] CHEMBL441576
• [1-L-threonine]cyclosporin A CHEMBL2370014
• [6-L-tryptophan]sermorelin free acid CHEMBL440438
• angiotensin II (3-8) CHEMBL261120
• nociceptin amide CHEMBL389521
• acetyl-alpha-MSH (4-10) amide CHEMBL410411
• [2-L-cysteine,13-L-cysteine]neurotensin disulfide CHEMBL3278512
• myristoyl-[1-L-lysine,4-L-tryptophan]tetrapandin 2 amide CHEMBL3288219
• [2-(4RS)-thiazolidine-4-carboxylic acid,4-L-proline]endomorphin-2 CHEMBL126611
• [22-L-serine]kalata B1 CHEMBL1801140
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Scaling-up protein variant naming
• The algorithm described for naming peptides can
also be applied to naming arbitary protein variants.
• Consider the a database of the following 11 peptides:
– CFFQNCPRG phenylpressin
– CFVRNCPTG annetocin
– CFWTSCPIG octopressin
– CYFQNCPRG argipressin
– CYFQNCPKG lypressin
– CYFRNCPIG cephalotocin
– CYIQNCPLG oxytocin
– CYIQNCPPG prol-oxytocin
– CYIQNCPRG vasotocin
– CYIQSCPIG seritocin
– CYISNCPIG isotocin
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Dag representation of sequences
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
These 11 peptides may be efficiently represented and
search as a “directed acyclic graph” [38 vs. 99 states]
entirety of uniprot/swissprot
• Using this representation, all 540546 protein
sequences in uniprot_sprot, which contains over
192M amino acids, requires 142M states (1.4Gb).
• This data structure allows close analogues to be
identified much faster than using NCBI blastp.
• For example, all 540546 sequences can be queried
against this database (i.e. all-against-all) in ~9m30s
on a single core on a laptop.
• The sequence from PDB 1CRN (crambin 46AA) is
canonically named as [L25I]P01542 in 0.002s.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Application to precision medicine
• A more realistic example is that sequence of the
gene “spastic paraplegia4” with six mutations from
OMIM:604277 can be canonically named as
[I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0
• Run-time for this query is 0.2s.
• By comparison, blastp 2.2.29+ takes about 6s.
– With default arguments, NCBI blastp run time is 7s.
– Only 6s with –num_descriptions 1 –num_alignments 1.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
conclusions
• “InChI for large molecules” can be achieved,
and remain compatible with small molecule
InChI identifiers, through the evolution of ever
better canonicalization algorithms.
• Journal reviewers who claim that the run-time
of canonicalization algorithms is a non-issue,
and not an area ripe for improvement are…
very mistaken.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
acknowledgements
• Greg Landrum, Novatis, Basel, Switzerland.
• Nadine Schneider, Novartis, Basel, Switzerland.
• Evan Bolton, NCBI PubChem project, Bethesda, MD.
• Joann Prescott-Roy, Novartis, Cambridge, MA, USA.
• Daniel Lowe, NextMove Software, Cambridge, UK.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

More Related Content

What's hot

Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)NextMove Software
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...NextMove Software
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...NextMove Software
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulNextMove Software
 
Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005Iddo
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChemNextMove Software
 
Automated Extraction of Reactions from the Patent Literature
Automated Extraction of Reactions from the Patent LiteratureAutomated Extraction of Reactions from the Patent Literature
Automated Extraction of Reactions from the Patent Literaturedan2097
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Substructure Search Face-off
Substructure Search Face-offSubstructure Search Face-off
Substructure Search Face-offNextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
Cheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveCheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveNextMove Software
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekingeProf. Wim Van Criekinge
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Htos Presentation
Htos PresentationHtos Presentation
Htos Presentationbenmz101
 

What's hot (16)

Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be useful
 
Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 
Automated Extraction of Reactions from the Patent Literature
Automated Extraction of Reactions from the Patent LiteratureAutomated Extraction of Reactions from the Patent Literature
Automated Extraction of Reactions from the Patent Literature
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Substructure Search Face-off
Substructure Search Face-offSubstructure Search Face-off
Substructure Search Face-off
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
Cheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveCheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspective
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
GPCRs_HouseLA
GPCRs_HouseLAGPCRs_HouseLA
GPCRs_HouseLA
 
Htos Presentation
Htos PresentationHtos Presentation
Htos Presentation
 

Viewers also liked

CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...NextMove Software
 
Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?NextMove Software
 
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text miningCINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text miningNextMove Software
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlightsAndrii Gakhov
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!KNIMESlides
 
Finally, Professional Frontend Dev with ReactJS, WebPack & Symfony (Symfony C...
Finally, Professional Frontend Dev with ReactJS, WebPack & Symfony (Symfony C...Finally, Professional Frontend Dev with ReactJS, WebPack & Symfony (Symfony C...
Finally, Professional Frontend Dev with ReactJS, WebPack & Symfony (Symfony C...Ryan Weaver
 
Marketing strategies
Marketing strategiesMarketing strategies
Marketing strategiesmarketpedia_k
 
28 Pitching Essentials
28 Pitching Essentials28 Pitching Essentials
28 Pitching EssentialsMichael Parker
 
What Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
What Would Steve Do? 10 Lessons from the World's Most Captivating PresentersWhat Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
What Would Steve Do? 10 Lessons from the World's Most Captivating PresentersHubSpot
 
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...SlideShare
 

Viewers also liked (12)

CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?
 
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text miningCINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!
 
Survey for International Longevity Day
Survey for International Longevity DaySurvey for International Longevity Day
Survey for International Longevity Day
 
Finally, Professional Frontend Dev with ReactJS, WebPack & Symfony (Symfony C...
Finally, Professional Frontend Dev with ReactJS, WebPack & Symfony (Symfony C...Finally, Professional Frontend Dev with ReactJS, WebPack & Symfony (Symfony C...
Finally, Professional Frontend Dev with ReactJS, WebPack & Symfony (Symfony C...
 
Marketing strategies
Marketing strategiesMarketing strategies
Marketing strategies
 
28 Pitching Essentials
28 Pitching Essentials28 Pitching Essentials
28 Pitching Essentials
 
What Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
What Would Steve Do? 10 Lessons from the World's Most Captivating PresentersWhat Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
What Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
 
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
 

Similar to CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 
Bridging the gap between small molecule and biologics editing
Bridging the gap between small molecule and biologics editingBridging the gap between small molecule and biologics editing
Bridging the gap between small molecule and biologics editingChemAxon
 
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical NomenclatureOPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclaturedan2097
 
Molecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryMolecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryLee Larcombe
 
Introducing the IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb)
Introducing the IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb)Introducing the IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb)
Introducing the IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb)Chris Southan
 
Using Polycaprolactone for Tissue Regeneration
Using Polycaprolactone for Tissue RegenerationUsing Polycaprolactone for Tissue Regeneration
Using Polycaprolactone for Tissue RegenerationSatish Bhat
 
Power point presentation for science research
Power point presentation for science researchPower point presentation for science research
Power point presentation for science researchSatish Bhat
 
ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLDr. Haxel Consult
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Lee Larcombe
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesYasset Perez-Riverol
 
Biologics information in PubChem
Biologics information in PubChemBiologics information in PubChem
Biologics information in PubChemJian Zhang
 
How to become an IHC expert in 4 days
How to become an IHC expert in 4 daysHow to become an IHC expert in 4 days
How to become an IHC expert in 4 daysCJ Xia
 
Monoclonal antibodies,Production + bioseperartion
Monoclonal antibodies,Production + bioseperartionMonoclonal antibodies,Production + bioseperartion
Monoclonal antibodies,Production + bioseperartionshadan87
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary databaseKAUSHAL SAHU
 

Similar to CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers (20)

InChI for Large Molecules
InChI for Large MoleculesInChI for Large Molecules
InChI for Large Molecules
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
WWW (Glibs workshop)
WWW (Glibs workshop)WWW (Glibs workshop)
WWW (Glibs workshop)
 
Bridging the gap between small molecule and biologics editing
Bridging the gap between small molecule and biologics editingBridging the gap between small molecule and biologics editing
Bridging the gap between small molecule and biologics editing
 
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical NomenclatureOPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
 
Molecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryMolecular modelling for in silico drug discovery
Molecular modelling for in silico drug discovery
 
Introducing the IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb)
Introducing the IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb)Introducing the IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb)
Introducing the IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb)
 
In silico pb1 & pb2 Presentation
In silico pb1 & pb2 PresentationIn silico pb1 & pb2 Presentation
In silico pb1 & pb2 Presentation
 
Using Polycaprolactone for Tissue Regeneration
Using Polycaprolactone for Tissue RegenerationUsing Polycaprolactone for Tissue Regeneration
Using Polycaprolactone for Tissue Regeneration
 
Power point presentation for science research
Power point presentation for science researchPower point presentation for science research
Power point presentation for science research
 
ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBL
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014
 
Protein Database
Protein DatabaseProtein Database
Protein Database
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome Coordinates
 
Biologics information in PubChem
Biologics information in PubChemBiologics information in PubChem
Biologics information in PubChem
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 
How to become an IHC expert in 4 days
How to become an IHC expert in 4 daysHow to become an IHC expert in 4 days
How to become an IHC expert in 4 days
 
How to become an IHC expert in 4 days
How to become an IHC expert in 4 daysHow to become an IHC expert in 4 days
How to become an IHC expert in 4 days
 
Monoclonal antibodies,Production + bioseperartion
Monoclonal antibodies,Production + bioseperartionMonoclonal antibodies,Production + bioseperartion
Monoclonal antibodies,Production + bioseperartion
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 

More from NextMove Software

A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsNextMove Software
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsNextMove Software
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightNextMove Software
 

More from NextMove Software (17)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sight
 

Recently uploaded

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 

Recently uploaded (20)

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 

CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

  • 1. Generating canonical identifiers for (glycoproteins and other chemically modified) biopolymers Roger Sayle , john may & Noel O’Boyle Nextmove software, cambridge, uk 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 2. motivation • Non-standard peptides, post-translationally modified proteins and drug-antibody conjugates are becoming increasingly relevant to the life sciences. • Registration of biologics, beyond the FASTA sequence, is considered desirable but technically challenging. • In this talk, I discuss complementary approaches to biologics registration; one based upon expressive all- atom representations, another on tracking deltas to a reference database of protein sequences. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 3. Real world small scale example • Many research reagents contain “hybrid molecules” – Innovagen SP-5125 lauroyl-apelin-13 • dodecanoyl-QRPRLSHKGPMPF – Innovagen SP-5126 myristoyl-apelin-13 • tetadecanoyl-QRPRLSHKGPMPF – Innovagen SP-5124 palmitoyl-apelin-13 • hexadecanoyl-QRPRLSHKGPMPF – Innovagen SP-5127 steroyl-apelin-13 • octadecanoyl-QRPRLSHKGPMPF 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 4. The cutting edge of biosimilarity • The high prevalence of potentially life-threatening hypersensitivity reactions to the antibody cetuximab (Erbitux) in some US states has been traced to its glycosylation [containing a Gal(a1-3)Gal epitope]. Chung et al., “Cetuximab-induced anaphylaxis and IgE specific for galactose-alpha-1,3-galactose”, New England Journal of Medicine, Vol. 358, No. 11, pp. 1109-1117, 13th March 2008. • Similarly, Human Erythropoietin (EPO) alpha, beta, delta and omega share the same primary sequence, but differ in their glycosylation patterns. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 5. Monomer dictionaries don’t scale • Systems based upon monomer dictionaries (such as HELM and PDB) are notoriously difficult to maintain. • The limited number of monomers in proteinogenic peptides and natural nucleic acid sequences leads to a false sense of security; that monomers are finite. • In practice, the number of monomers, post- translational and chemical modifications is infinite. • Even more difficult than standardizing monomer definitions via a central repository, like PDB, is allowing local custom definitions. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 6. 48 hexopyranoses 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 8. 9540 substituted hexopyranoses (4 most common substituents)
  • 9. The current situation • Pistoia HELM can’t [yet] handle/canonicalize glycans and oligosaccharides. – It can’t uniquely canonicalize Fmoc-Ala-OH (between Pistoia and ChEMBL monomer sets). • IUPAC InChI can’t [yet] officially handle more than 1024 atoms. • Folks working on glycoproteins are screwed… (or use expensive commercial software) 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 10. Constructive suggestion… • Ideally, a chemical identifier should be independent of the input representation or file format. • Equivalence between small molecules, peptide and proteins are best determined by a single identifier, preferably the existing standard InChI. • This is possible as increases in computer power and storage mean that cheminformatics toolkits can handle huge biopolymers on modern hardware. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 11. Three recent experiments 1. Is it possible to generate standard InChI for extremely large molecules (polymers)? 2. How well do all-atom canonicalization algorithms scale and can they be improved? 3. Are there alternative canonical identifiers that can be useful in bioinformatics and precision medicine? 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 12. Previous inchi key record [(2014) • Sequence Identifier: UTP10_KLULA • Sequence Length: 1774 amino acids • Molecule size: 28509 atoms • InChI Length: 119699 characters • InChI key: PHBRSEQMAKHFGD-ZBXWIJJNSA-N • InChI Canonicalization Time: 73.2s • Canonical SMILES Length: 35408 chars • OEChem SMILES Canonicalization Time: 0.4s 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 13. Special classes of molecules • Alkanes – InChI=1S/C6H14/c1-3-5-6-4-2/h3-6H2,1-2H3 – InChI=1S/C8H18/c1-3-5-7-8-6-4-2/h3-8H2,1-2H3 – InChI=1S/C10H22/c1-3-5-7-9-10-8-6-4-2/h3-10H2,1-2H3 – 1 million carbons, InChI is 6,889,942 bytes (~6.9Mbytes) – 1 billion carbons, InChI is 9,888,888,954 bytes (~9.9Gbytes) • Polyalanine – InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1 – 1 thousand L-alanines, InChI is 42,965 bytes (~43Kbytes) – 1 million L-alanines, InChI is 66,888,995 bytes (~66.9Mbytes) • Theoretically one could write an efficient fasta2inchi 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 14. Algorithm scaling to 100AA 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 15. Algorithm scaling to 1000AA 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 16. Algorithm scaling to 5000AA 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 17. Algorithm scaling to maximum 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 18. Spread of algorithm run-times 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 19. peptide names (chembl) • The following names are machine generated • [15-L-arginine]nociceptin CHEMBL526333 • [2-4-chloro-L-phenylalanine]neuropeptide S [human] CHEMBL441576 • [1-L-threonine]cyclosporin A CHEMBL2370014 • [6-L-tryptophan]sermorelin free acid CHEMBL440438 • angiotensin II (3-8) CHEMBL261120 • nociceptin amide CHEMBL389521 • acetyl-alpha-MSH (4-10) amide CHEMBL410411 • [2-L-cysteine,13-L-cysteine]neurotensin disulfide CHEMBL3278512 • myristoyl-[1-L-lysine,4-L-tryptophan]tetrapandin 2 amide CHEMBL3288219 • [2-(4RS)-thiazolidine-4-carboxylic acid,4-L-proline]endomorphin-2 CHEMBL126611 • [22-L-serine]kalata B1 CHEMBL1801140 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 20. Scaling-up protein variant naming • The algorithm described for naming peptides can also be applied to naming arbitary protein variants. • Consider the a database of the following 11 peptides: – CFFQNCPRG phenylpressin – CFVRNCPTG annetocin – CFWTSCPIG octopressin – CYFQNCPRG argipressin – CYFQNCPKG lypressin – CYFRNCPIG cephalotocin – CYIQNCPLG oxytocin – CYIQNCPPG prol-oxytocin – CYIQNCPRG vasotocin – CYIQSCPIG seritocin – CYISNCPIG isotocin 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 21. Dag representation of sequences 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015 These 11 peptides may be efficiently represented and search as a “directed acyclic graph” [38 vs. 99 states]
  • 22. entirety of uniprot/swissprot • Using this representation, all 540546 protein sequences in uniprot_sprot, which contains over 192M amino acids, requires 142M states (1.4Gb). • This data structure allows close analogues to be identified much faster than using NCBI blastp. • For example, all 540546 sequences can be queried against this database (i.e. all-against-all) in ~9m30s on a single core on a laptop. • The sequence from PDB 1CRN (crambin 46AA) is canonically named as [L25I]P01542 in 0.002s. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 23. Application to precision medicine • A more realistic example is that sequence of the gene “spastic paraplegia4” with six mutations from OMIM:604277 can be canonically named as [I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0 • Run-time for this query is 0.2s. • By comparison, blastp 2.2.29+ takes about 6s. – With default arguments, NCBI blastp run time is 7s. – Only 6s with –num_descriptions 1 –num_alignments 1. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 24. conclusions • “InChI for large molecules” can be achieved, and remain compatible with small molecule InChI identifiers, through the evolution of ever better canonicalization algorithms. • Journal reviewers who claim that the run-time of canonicalization algorithms is a non-issue, and not an area ripe for improvement are… very mistaken. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 25. acknowledgements • Greg Landrum, Novatis, Basel, Switzerland. • Nadine Schneider, Novartis, Basel, Switzerland. • Evan Bolton, NCBI PubChem project, Bethesda, MD. • Joann Prescott-Roy, Novartis, Cambridge, MA, USA. • Daniel Lowe, NextMove Software, Cambridge, UK. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015