SlideShare a Scribd company logo
1 of 19
Download to read offline
So	
  I	
  have	
  an	
  SD	
  File	
  …	
  
What	
  do	
  I	
  do	
  next?	
  
Rajarshi	
  Guha	
  &	
  Noel	
  O’Boyle	
  
NCATS	
  &	
  NextMove	
  So<ware	
  
ACS	
  Na>onal	
  Mee>ng,	
  Boston	
  2015	
  
What	
  do	
  you	
  want	
  to	
  do?	
  
What	
  is	
  the	
  core	
  issue?	
  
•  What	
  you	
  see	
  on	
  a	
  
screen	
  isn’t	
  necessarily	
  
what	
  you	
  get	
  in	
  a	
  file	
  
•  Need	
  to	
  be	
  aware	
  of	
  
how	
  certain	
  chemical	
  
concepts	
  are	
  handled	
  in	
  
so<ware	
  
	
  
Tasks	
  to	
  be	
  considered	
  
•  Searching	
  for	
  structures	
  
•  Managing	
  inventory	
  
•  Linking	
  /	
  merging	
  
structure	
  data	
  to	
  other	
  
data	
  
•  Predic>ng	
  proper>es	
  or	
  
analysis	
  of	
  bioac>vity	
  
data	
  
Which	
  file	
  format	
  for	
  data	
  storage?	
  
●  The	
  answer	
  to	
  this	
  ques>on	
  is	
  never	
  XYZ	
  or	
  PDB	
  
o  Don’t	
  use	
  a	
  file	
  format	
  that	
  throws	
  away	
  parts	
  of	
  
your	
  chemical	
  structure	
  (connec>vity,	
  bond	
  orders	
  
or	
  formal	
  charges)	
  
o  So<ware	
  has	
  to	
  guess	
  the	
  missing	
  informa>on	
  
●  And	
  probably	
  not	
  InChI	
  
o  Without	
  the	
  ‘AuxInfo’,	
  the	
  chemical	
  structure	
  
obtained	
  from	
  an	
  InChI	
  is	
  not	
  necessarily	
  the	
  same	
  
as	
  the	
  original	
  (e.g.	
  amides	
  to	
  imidic	
  acids)	
  
●  SMILES	
  and	
  MOL	
  are	
  your	
  go-­‐to	
  formats	
  
●  Widely	
  supported	
  (i.e.	
  portable),	
  can	
  recreate	
  the	
  
original	
  structure	
  
The	
  ques?on	
  of	
  iden?ty	
  
●  A	
  file	
  format	
  is	
  not	
  the	
  same	
  as	
  an	
  iden>fier	
  
o  The	
  same	
  molecule	
  can	
  be	
  represented	
  in	
  different	
  
ways,	
  even	
  in	
  the	
  same	
  format	
  
●  A	
  “canonical”	
  representa>on	
  is	
  required	
  
○ To	
  check	
  iden>ty,	
  find	
  or	
  avoid	
  duplicates,	
  find	
  
overlap	
  of	
  two	
  databases	
  or	
  check	
  that	
  a	
  structure	
  
remains	
  unchanged	
  (e.g.	
  a<er	
  some	
  transforma>on)	
  
●  Only	
  InChI	
  (and	
  IUPAC	
  names)	
  are	
  canonical	
  by	
  
defini>on,	
  but	
  canonical	
  versions	
  of	
  other	
  
formats	
  can	
  be	
  generated	
  
C C O C C O
Ethanol can be represented in SMILES format as CCO or OCC (among others)
Canonical	
  SMILES	
  
● Atom	
  order	
  is	
  the	
  same	
  whatever	
  the	
  input	
  	
  
● BUT,	
  every	
  toolkit	
  has	
  its	
  own	
  canonicaliza>on	
  
algorithm	
  (which	
  may	
  change	
  over	
  >me)	
  
○ Consistent	
  within	
  the	
  toolkit,	
  not	
  neccesarily	
  
outside	
  
● Don’t	
  assume	
  that	
  a	
  given	
  SMILES	
  is	
  in	
  a	
  
canonical	
  form	
  
○ If	
  necessary,	
  canonicalize	
  them	
  yourself	
  
Ethanol as CCO, OCC, C(O)C all converted to CCO (by Toolkit#1)
Ethanol as CCO, OCC, C(O)C all converted to OCC (by Toolkit#2)
Depic?ons	
  vs	
  computers	
  
●  Are	
  your	
  structures	
  drawn	
  for	
  humans	
  or	
  computers?	
  
○  There	
  are	
  2D	
  depic>ons	
  of	
  stereochemistry	
  that	
  are	
  instantly	
  
interpretable	
  by	
  a	
  human	
  but	
  which	
  are	
  commonly	
  
misinterpreted	
  by	
  so<ware	
  
●  Chirality	
  of	
  (a)	
  is	
  opposite	
  to	
  (c)	
  
○  But	
  what	
  is	
  the	
  chirality	
  of	
  (b)?	
  
●  Possibili>es:	
  
○  Undefined	
  (according	
  to	
  InChI,	
  if	
  close	
  to	
  180°)	
  
○  Same	
  as	
  (a)	
  or	
  (c)	
  depending	
  on	
  which	
  side	
  of	
  180°	
  
Rings	
  with	
  ‘implicit’	
  3D	
  
You	
  drew	
   You	
  meant	
   You	
  may	
  get	
  
Tetrahedral	
  stereo	
  gotchas	
  
●  R/S	
  in	
  IUPAC	
  names,	
  @/@@	
  in	
  SMILES,	
  1/2	
  in	
  
MOL	
  files,	
  +/-­‐	
  in	
  InChIs	
  
●  None	
  of	
  these	
  directly	
  correspond	
  to	
  another	
  
○ SMILES	
  and	
  Mol	
  files	
  describe	
  stereo	
  in	
  terms	
  of	
  atom	
  
order,	
  but	
  differ	
  in	
  where	
  implicit	
  hydrogens	
  are	
  
located	
  
○ InChI	
  and	
  IUPAC	
  names	
  both	
  use	
  a	
  complex	
  algorithm	
  
to	
  determine	
  the	
  symbol	
  
●  Only	
  two	
  of	
  these	
  formats	
  may	
  always	
  be	
  used	
  to	
  
compare	
  two	
  structures:	
  
○ R/S	
  and	
  /m	
  layer	
  (InChI)	
  
○ Also	
  @/@@,	
  but	
  only	
  if	
  canonical	
  
Illumina?ng	
  the	
  black	
  box	
  
●  Important	
  to	
  know	
  what	
  opera>ons	
  are	
  being	
  done	
  
implicitly	
  and	
  what	
  needs	
  to	
  be	
  done	
  explicitly	
  
○  Are	
  the	
  error	
  rates	
  acceptable?	
  
●  Parse	
  structure	
  
○  Read	
  list	
  of	
  atoms	
  and	
  bonds	
  (incl.	
  charges	
  and	
  isotopes)	
  
○  [Mol,	
  Mol2,	
  Smi]	
  Apply	
  valence	
  model	
  
●  Perceive	
  aroma>city	
  (or	
  preserve	
  from	
  input)	
  
●  Perceive	
  stereochemistry	
  (or	
  preserve	
  from	
  input)	
  
●  Op>onal:	
  recognize	
  atom	
  /	
  bond	
  types,	
  par>al	
  charges,	
  
generate	
  coordinates	
  
c1ccccc1C(=O)Cl
Aroma?city	
  
● Cheminforma>cs	
  aroma>city	
  not	
  quite	
  the	
  
same	
  as	
  chemical	
  aroma>city	
  
○ Mainly	
  a	
  convenience	
  for	
  handling	
  the	
  fact	
  that	
  
the	
  single/double	
  bonds	
  bonds	
  in	
  Kekulé	
  systems	
  
may	
  be	
  set	
  differently	
  
● Usually	
  a	
  good	
  idea	
  to	
  export	
  structures	
  in	
  
Kekulé	
  form	
  
○ More	
  portable	
  -­‐	
  tools	
  may	
  reject	
  some	
  SMILES	
  in	
  
aroma>c	
  form	
  if	
  they	
  cannot	
  kekulize	
  them	
  
○ Allows	
  tools	
  to	
  apply	
  their	
  own	
  aroma>city	
  model	
  
○ Faster	
  if	
  detec>on	
  of	
  aroma>city	
  can	
  be	
  avoided	
  
2D	
  or	
  3D?	
  
No Geometry
No Geometry
2D Geometry
3D Geometry
CN1C2=C(C(C3=CC=CC=C3)=NCC1=O)C=C(Cl)C=C2
Going	
  from	
  2D	
  to	
  3D	
  
●  Key	
  point	
  -­‐	
  easy	
  to	
  get	
  a	
  3D	
  structure,	
  but	
  is	
  it	
  
the	
  3D	
  structure	
  you	
  want	
  (or	
  need)?	
  
○  Do	
  you	
  need	
  a	
  single	
  ‘reasonable’	
  structure	
  or	
  a	
  
large	
  number	
  of	
  conforma>ons?	
  
●  Many	
  tools	
  to	
  generate	
  an	
  acceptable	
  3D	
  
structure	
  from	
  a	
  2D	
  format	
  
○  Usually	
  a	
  low	
  energy	
  conforma>on	
  obtained	
  via	
  
molecular	
  mechanics	
  
●  Conformer	
  generators	
  
○  Important	
  to	
  think	
  about	
  appropriate	
  energy	
  
and/or	
  RMSD	
  cutoffs	
  
Moving	
  from	
  files	
  to	
  a	
  database	
  
●  If	
  you’re	
  going	
  beyond	
  100’s	
  of	
  molecules	
  consider	
  
using	
  a	
  chemically-­‐aware	
  database	
  
○ Instant	
  	
  Jchem	
  
○ MolEditor	
  
●  Not	
  too	
  difficult	
  to	
  roll	
  your	
  own	
  using	
  Open	
  Source	
  
but	
  requires	
  programming	
  skills	
  
●  Don’t	
  use	
  Excel	
  (even	
  with	
  ChemDraw)	
  
○ Missing	
  data	
  is	
  not	
  handled	
  consistently	
  
○ Can	
  mangle	
  iden>fiers	
  (parse	
  them	
  as	
  dates)	
  
○ Complicates	
  workflows	
  
○ Formaqng	
  can	
  hinder	
  efficient	
  data	
  analyses	
  
○ Difficult	
  to	
  have	
  mul>ple	
  users	
  
Verifying	
  data	
  quality	
  
● This	
  is	
  all	
  good	
  if	
  it’s	
  your	
  own	
  compounds	
  
● What	
  about	
  structures	
  from	
  someone	
  else?	
  
○ Need	
  to	
  check	
  (&	
  try	
  to	
  fix)	
  nonsensical	
  chemistry	
  
● Check	
  for	
  
○ invalid	
  valences,	
  nonsense	
  stereo,	
  fragments	
  
○ weird/invalid	
  atoms,	
  mul>ple	
  radical	
  centers	
  
● Consider	
  hrp://cvsp.chemspider.com/	
  
Karapetyan et al, J. Cheminf, 2015
Structures	
  are	
  good.	
  Are	
  they	
  useful?	
  
●  At	
  this	
  point	
  you	
  likely	
  have	
  a	
  set	
  of	
  	
  
correct	
  (valid)	
  structures	
  	
  
○ Are	
  the	
  structures	
  useful	
  for	
  your	
  purpose?	
  
●  A	
  collec>on	
  may	
  have	
  compounds	
  with	
  
problema>c	
  structures	
  
○ Reac>ve	
  groups,	
  fluorophores,	
  ADMET	
  liabili>es,	
  …	
  
●  Consider	
  rules	
  &	
  filters	
  such	
  as	
  REOS,	
  PAINS,	
  Lilly	
  
MedChem	
  Rules	
  
○ Implemented	
  in	
  commercial	
  &	
  OSS	
  tools	
  
○ Don’t	
  use	
  them	
  blindly!	
  
●  Normalisa>on?	
  
○ E.g.	
  -­‐N(=O)=O	
  or	
  –[N+][O-­‐]=O	
  (or	
  doesn’t	
  marer?)	
  
What	
  are	
  you	
  really	
  looking	
  for?	
  
●  Similarity	
  searches	
  are	
  a	
  common	
  task	
  
●  What	
  you	
  get	
  depends	
  on	
  	
  
○ How	
  the	
  structure	
  was	
  entered	
  
○ Normaliza>on	
  of	
  structures	
  	
  
●  But	
  also	
  on	
  what	
  you’re	
  looking	
  for	
  
○ Connec>vity	
  
○ Atom	
  &	
  bond	
  type	
  
○ Shape	
  or	
  pharmacophore	
  features	
  …	
  
●  May	
  be	
  surprised	
  by	
  false	
  	
  
nega>ves	
  
○ Test	
  your	
  query	
  on	
  structures	
  	
  
it	
  should	
  find	
  
may	
  not	
  find	
  
Because	
  we	
  love	
  sta?s?cs	
  &	
  M/L	
  
Alexander	
  et	
  al	
  (2015)	
  
Cherkasov	
  et	
  al	
  (2014)	
  
Huang	
  &	
  Fan	
  (2013)	
  
Chirico	
  &	
  Gramma>ca	
  (2011)	
  
Tropsha	
  (2010)	
  
Jain	
  &	
  Nicholls	
  (2008)	
  
Nicholls	
  (2008)	
  
Hawkins	
  (2004)	
  
Cronin	
  &	
  Schultz	
  (2003)	
  
	
  
•  Look	
  at	
  your	
  data,	
  plot	
  
your	
  data	
  
•  Read	
  up	
  sta>s>cs	
  
•  Linear	
  models	
  are	
  a	
  
good	
  start	
  
•  Most	
  of	
  this	
  is	
  not	
  
about	
  cheminforma>cs	
  
•  But	
  the	
  no>on	
  of	
  
chemical	
  space	
  plays	
  a	
  
key	
  role	
  in	
  this	
  area	
  
Summary	
  
Do	
  
1.  Chose	
  appropriate	
  file	
  
formats	
  
2.  Check	
  data	
  quality	
  
3.  Get	
  involved	
  in	
  the	
  
cheminforma>cs	
  
community	
  
4.  Trust	
  but	
  verify	
  
	
  
Don’t	
  
1.  Treat	
  chemical	
  so<ware	
  as	
  
a	
  black	
  box	
  
2.  Assume	
  geometry	
  
3.  Use	
  M/L	
  blindly	
  
4.  Did	
  we	
  men>on	
  Excel	
  
already?	
  
	
  
Acknowledgements	
  
● John	
  May	
  (NextMove	
  So<ware)	
  
● Adam	
  Yasgar,	
  Madhu	
  Lal-­‐Nag	
  (NCATS)	
  
	
  

More Related Content

Viewers also liked

Investor Presentation - August 2016
Investor Presentation - August 2016Investor Presentation - August 2016
Investor Presentation - August 2016DalradianResource
 
20160914兵庫県図書館協会ソーシャルメディア研修(公開用)【最終版】
20160914兵庫県図書館協会ソーシャルメディア研修(公開用)【最終版】20160914兵庫県図書館協会ソーシャルメディア研修(公開用)【最終版】
20160914兵庫県図書館協会ソーシャルメディア研修(公開用)【最終版】Masahiko Inoue
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformRajarshi Guha
 
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and Administer
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and AdministerOracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and Administer
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and AdministerAndrejs Karpovs
 
Corre corre cabacinha
Corre corre cabacinha  Corre corre cabacinha
Corre corre cabacinha lidiacosta
 

Viewers also liked (8)

Investor Presentation - August 2016
Investor Presentation - August 2016Investor Presentation - August 2016
Investor Presentation - August 2016
 
Em13c features- HotSos 2016
Em13c features- HotSos 2016Em13c features- HotSos 2016
Em13c features- HotSos 2016
 
20160914兵庫県図書館協会ソーシャルメディア研修(公開用)【最終版】
20160914兵庫県図書館協会ソーシャルメディア研修(公開用)【最終版】20160914兵庫県図書館協会ソーシャルメディア研修(公開用)【最終版】
20160914兵庫県図書館協会ソーシャルメディア研修(公開用)【最終版】
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
 
A arca do_tesouro_eeja
A arca do_tesouro_eejaA arca do_tesouro_eeja
A arca do_tesouro_eeja
 
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and Administer
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and AdministerOracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and Administer
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and Administer
 
Corre corre cabacinha
Corre corre cabacinha  Corre corre cabacinha
Corre corre cabacinha
 
D&J TILT AUTO LOCK QUICK HITCH
D&J TILT AUTO LOCK QUICK HITCHD&J TILT AUTO LOCK QUICK HITCH
D&J TILT AUTO LOCK QUICK HITCH
 

Similar to So I have an SD File … What do I do next?

So I have an SD File... What do I do next?
So I have an SD File... What do I do next?So I have an SD File... What do I do next?
So I have an SD File... What do I do next?baoilleach
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontGreg Landrum
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformaticsBenjamin Bucior
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Sprint Boot & Kotlin - Meetup.pdf
Sprint Boot & Kotlin - Meetup.pdfSprint Boot & Kotlin - Meetup.pdf
Sprint Boot & Kotlin - Meetup.pdfChristian Zellot
 
FP vs OOP : Design Methodology by Harshad Nawathe
FP vs OOP : Design Methodology by Harshad NawatheFP vs OOP : Design Methodology by Harshad Nawathe
FP vs OOP : Design Methodology by Harshad NawatheChandulal Kavar
 
Conclusion of the Seminary UPC 2017
Conclusion of the Seminary UPC 2017Conclusion of the Seminary UPC 2017
Conclusion of the Seminary UPC 2017Julio Martinez
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Valery Tkachenko
 
All together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of lifeAll together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of lifeChris Mungall
 
3decision®: Bringing structural data analytics to the masses
3decision®: Bringing structural data analytics to the masses3decision®: Bringing structural data analytics to the masses
3decision®: Bringing structural data analytics to the massesLaura Berry
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examplesFelipe
 
Substructure Search Face-off
Substructure Search Face-offSubstructure Search Face-off
Substructure Search Face-offNextMove Software
 

Similar to So I have an SD File … What do I do next? (20)

So I have an SD File... What do I do next?
So I have an SD File... What do I do next?So I have an SD File... What do I do next?
So I have an SD File... What do I do next?
 
Sharing chemical structures with peer reviewed publications
Sharing chemical structures with peer reviewed publications Sharing chemical structures with peer reviewed publications
Sharing chemical structures with peer reviewed publications
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformatics
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
The tortoise and the ORM
The tortoise and the ORMThe tortoise and the ORM
The tortoise and the ORM
 
Sprint Boot & Kotlin - Meetup.pdf
Sprint Boot & Kotlin - Meetup.pdfSprint Boot & Kotlin - Meetup.pdf
Sprint Boot & Kotlin - Meetup.pdf
 
FP vs OOP : Design Methodology by Harshad Nawathe
FP vs OOP : Design Methodology by Harshad NawatheFP vs OOP : Design Methodology by Harshad Nawathe
FP vs OOP : Design Methodology by Harshad Nawathe
 
Conclusion of the Seminary UPC 2017
Conclusion of the Seminary UPC 2017Conclusion of the Seminary UPC 2017
Conclusion of the Seminary UPC 2017
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...
 
All together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of lifeAll together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of life
 
ORMs Meet SQL
ORMs Meet SQLORMs Meet SQL
ORMs Meet SQL
 
Object Calisthenics in Objective-C
Object Calisthenics in Objective-CObject Calisthenics in Objective-C
Object Calisthenics in Objective-C
 
3decision®: Bringing structural data analytics to the masses
3decision®: Bringing structural data analytics to the masses3decision®: Bringing structural data analytics to the masses
3decision®: Bringing structural data analytics to the masses
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
 
Substructure Search Face-off
Substructure Search Face-offSubstructure Search Face-off
Substructure Search Face-off
 
Approaches for extraction and digital chromatography of chemical data
Approaches for extraction and digital chromatography of chemical dataApproaches for extraction and digital chromatography of chemical data
Approaches for extraction and digital chromatography of chemical data
 

More from Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomeRajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in contextRajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomeRajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMCRajarshi Guha
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsRajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATSRajarshi Guha
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the partsRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesRajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
 
PMML for QSAR Model Exchange
PMML for QSAR Model Exchange PMML for QSAR Model Exchange
PMML for QSAR Model Exchange Rajarshi Guha
 

More from Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 
PMML for QSAR Model Exchange
PMML for QSAR Model Exchange PMML for QSAR Model Exchange
PMML for QSAR Model Exchange
 

Recently uploaded

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 

Recently uploaded (20)

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 

So I have an SD File … What do I do next?

  • 1. So  I  have  an  SD  File  …   What  do  I  do  next?   Rajarshi  Guha  &  Noel  O’Boyle   NCATS  &  NextMove  So<ware   ACS  Na>onal  Mee>ng,  Boston  2015  
  • 2. What  do  you  want  to  do?   What  is  the  core  issue?   •  What  you  see  on  a   screen  isn’t  necessarily   what  you  get  in  a  file   •  Need  to  be  aware  of   how  certain  chemical   concepts  are  handled  in   so<ware     Tasks  to  be  considered   •  Searching  for  structures   •  Managing  inventory   •  Linking  /  merging   structure  data  to  other   data   •  Predic>ng  proper>es  or   analysis  of  bioac>vity   data  
  • 3. Which  file  format  for  data  storage?   ●  The  answer  to  this  ques>on  is  never  XYZ  or  PDB   o  Don’t  use  a  file  format  that  throws  away  parts  of   your  chemical  structure  (connec>vity,  bond  orders   or  formal  charges)   o  So<ware  has  to  guess  the  missing  informa>on   ●  And  probably  not  InChI   o  Without  the  ‘AuxInfo’,  the  chemical  structure   obtained  from  an  InChI  is  not  necessarily  the  same   as  the  original  (e.g.  amides  to  imidic  acids)   ●  SMILES  and  MOL  are  your  go-­‐to  formats   ●  Widely  supported  (i.e.  portable),  can  recreate  the   original  structure  
  • 4. The  ques?on  of  iden?ty   ●  A  file  format  is  not  the  same  as  an  iden>fier   o  The  same  molecule  can  be  represented  in  different   ways,  even  in  the  same  format   ●  A  “canonical”  representa>on  is  required   ○ To  check  iden>ty,  find  or  avoid  duplicates,  find   overlap  of  two  databases  or  check  that  a  structure   remains  unchanged  (e.g.  a<er  some  transforma>on)   ●  Only  InChI  (and  IUPAC  names)  are  canonical  by   defini>on,  but  canonical  versions  of  other   formats  can  be  generated   C C O C C O Ethanol can be represented in SMILES format as CCO or OCC (among others)
  • 5. Canonical  SMILES   ● Atom  order  is  the  same  whatever  the  input     ● BUT,  every  toolkit  has  its  own  canonicaliza>on   algorithm  (which  may  change  over  >me)   ○ Consistent  within  the  toolkit,  not  neccesarily   outside   ● Don’t  assume  that  a  given  SMILES  is  in  a   canonical  form   ○ If  necessary,  canonicalize  them  yourself   Ethanol as CCO, OCC, C(O)C all converted to CCO (by Toolkit#1) Ethanol as CCO, OCC, C(O)C all converted to OCC (by Toolkit#2)
  • 6. Depic?ons  vs  computers   ●  Are  your  structures  drawn  for  humans  or  computers?   ○  There  are  2D  depic>ons  of  stereochemistry  that  are  instantly   interpretable  by  a  human  but  which  are  commonly   misinterpreted  by  so<ware   ●  Chirality  of  (a)  is  opposite  to  (c)   ○  But  what  is  the  chirality  of  (b)?   ●  Possibili>es:   ○  Undefined  (according  to  InChI,  if  close  to  180°)   ○  Same  as  (a)  or  (c)  depending  on  which  side  of  180°  
  • 7. Rings  with  ‘implicit’  3D   You  drew   You  meant   You  may  get  
  • 8. Tetrahedral  stereo  gotchas   ●  R/S  in  IUPAC  names,  @/@@  in  SMILES,  1/2  in   MOL  files,  +/-­‐  in  InChIs   ●  None  of  these  directly  correspond  to  another   ○ SMILES  and  Mol  files  describe  stereo  in  terms  of  atom   order,  but  differ  in  where  implicit  hydrogens  are   located   ○ InChI  and  IUPAC  names  both  use  a  complex  algorithm   to  determine  the  symbol   ●  Only  two  of  these  formats  may  always  be  used  to   compare  two  structures:   ○ R/S  and  /m  layer  (InChI)   ○ Also  @/@@,  but  only  if  canonical  
  • 9. Illumina?ng  the  black  box   ●  Important  to  know  what  opera>ons  are  being  done   implicitly  and  what  needs  to  be  done  explicitly   ○  Are  the  error  rates  acceptable?   ●  Parse  structure   ○  Read  list  of  atoms  and  bonds  (incl.  charges  and  isotopes)   ○  [Mol,  Mol2,  Smi]  Apply  valence  model   ●  Perceive  aroma>city  (or  preserve  from  input)   ●  Perceive  stereochemistry  (or  preserve  from  input)   ●  Op>onal:  recognize  atom  /  bond  types,  par>al  charges,   generate  coordinates   c1ccccc1C(=O)Cl
  • 10. Aroma?city   ● Cheminforma>cs  aroma>city  not  quite  the   same  as  chemical  aroma>city   ○ Mainly  a  convenience  for  handling  the  fact  that   the  single/double  bonds  bonds  in  Kekulé  systems   may  be  set  differently   ● Usually  a  good  idea  to  export  structures  in   Kekulé  form   ○ More  portable  -­‐  tools  may  reject  some  SMILES  in   aroma>c  form  if  they  cannot  kekulize  them   ○ Allows  tools  to  apply  their  own  aroma>city  model   ○ Faster  if  detec>on  of  aroma>city  can  be  avoided  
  • 11. 2D  or  3D?   No Geometry No Geometry 2D Geometry 3D Geometry CN1C2=C(C(C3=CC=CC=C3)=NCC1=O)C=C(Cl)C=C2
  • 12. Going  from  2D  to  3D   ●  Key  point  -­‐  easy  to  get  a  3D  structure,  but  is  it   the  3D  structure  you  want  (or  need)?   ○  Do  you  need  a  single  ‘reasonable’  structure  or  a   large  number  of  conforma>ons?   ●  Many  tools  to  generate  an  acceptable  3D   structure  from  a  2D  format   ○  Usually  a  low  energy  conforma>on  obtained  via   molecular  mechanics   ●  Conformer  generators   ○  Important  to  think  about  appropriate  energy   and/or  RMSD  cutoffs  
  • 13. Moving  from  files  to  a  database   ●  If  you’re  going  beyond  100’s  of  molecules  consider   using  a  chemically-­‐aware  database   ○ Instant    Jchem   ○ MolEditor   ●  Not  too  difficult  to  roll  your  own  using  Open  Source   but  requires  programming  skills   ●  Don’t  use  Excel  (even  with  ChemDraw)   ○ Missing  data  is  not  handled  consistently   ○ Can  mangle  iden>fiers  (parse  them  as  dates)   ○ Complicates  workflows   ○ Formaqng  can  hinder  efficient  data  analyses   ○ Difficult  to  have  mul>ple  users  
  • 14. Verifying  data  quality   ● This  is  all  good  if  it’s  your  own  compounds   ● What  about  structures  from  someone  else?   ○ Need  to  check  (&  try  to  fix)  nonsensical  chemistry   ● Check  for   ○ invalid  valences,  nonsense  stereo,  fragments   ○ weird/invalid  atoms,  mul>ple  radical  centers   ● Consider  hrp://cvsp.chemspider.com/   Karapetyan et al, J. Cheminf, 2015
  • 15. Structures  are  good.  Are  they  useful?   ●  At  this  point  you  likely  have  a  set  of     correct  (valid)  structures     ○ Are  the  structures  useful  for  your  purpose?   ●  A  collec>on  may  have  compounds  with   problema>c  structures   ○ Reac>ve  groups,  fluorophores,  ADMET  liabili>es,  …   ●  Consider  rules  &  filters  such  as  REOS,  PAINS,  Lilly   MedChem  Rules   ○ Implemented  in  commercial  &  OSS  tools   ○ Don’t  use  them  blindly!   ●  Normalisa>on?   ○ E.g.  -­‐N(=O)=O  or  –[N+][O-­‐]=O  (or  doesn’t  marer?)  
  • 16. What  are  you  really  looking  for?   ●  Similarity  searches  are  a  common  task   ●  What  you  get  depends  on     ○ How  the  structure  was  entered   ○ Normaliza>on  of  structures     ●  But  also  on  what  you’re  looking  for   ○ Connec>vity   ○ Atom  &  bond  type   ○ Shape  or  pharmacophore  features  …   ●  May  be  surprised  by  false     nega>ves   ○ Test  your  query  on  structures     it  should  find   may  not  find  
  • 17. Because  we  love  sta?s?cs  &  M/L   Alexander  et  al  (2015)   Cherkasov  et  al  (2014)   Huang  &  Fan  (2013)   Chirico  &  Gramma>ca  (2011)   Tropsha  (2010)   Jain  &  Nicholls  (2008)   Nicholls  (2008)   Hawkins  (2004)   Cronin  &  Schultz  (2003)     •  Look  at  your  data,  plot   your  data   •  Read  up  sta>s>cs   •  Linear  models  are  a   good  start   •  Most  of  this  is  not   about  cheminforma>cs   •  But  the  no>on  of   chemical  space  plays  a   key  role  in  this  area  
  • 18. Summary   Do   1.  Chose  appropriate  file   formats   2.  Check  data  quality   3.  Get  involved  in  the   cheminforma>cs   community   4.  Trust  but  verify     Don’t   1.  Treat  chemical  so<ware  as   a  black  box   2.  Assume  geometry   3.  Use  M/L  blindly   4.  Did  we  men>on  Excel   already?    
  • 19. Acknowledgements   ● John  May  (NextMove  So<ware)   ● Adam  Yasgar,  Madhu  Lal-­‐Nag  (NCATS)