SMILES is a formal language for describing chemical structures. SMILES stands for Simplified Molecular Input Line Entry System. The language allows one to specify a chemical structure, or a fragment of a structure, using a keyboard-oriented notation. Within Pathway Tools, the SMILES language is used to input a chemical substructure for use in a substructure search. That is, all chemical compounds within the Pathway Tools knowledge base that contain the substructure entered by the user are returned by the substructure search.
SMILES Examples
The following table lists the SMILES strings that describe several chemical compounds.
Compound SMILES notation -------- --------------- formate C(=O)O pyruvate CC(=O)C(=O)O malate OC(=O)CC(O)C(O)=O fumarate OC(=O)C=CC(=O)O
Case is significant in the SMILES language (lowercase indicates atoms in aromatic rings).
The SMILES Notation
Here is a quick summary of the SMILES language*. Additional examples are provided below.
1> Atoms are represented by atomic symbols. Atoms not in the organic subset (B, C, N, O, P, S, F, Cl, Br, I), charged atoms, or organic atoms with an unusual valence must be specified within square brackets. Hydrogens need not be explicitly stated. Aromatic atoms may be indicated with lower-case letters. Examples: C methane (CH4) [OH-] hydroxyl anion [Au] elemental gold
2> Single, double, triple, and aromatic bonds are represented by the symbols -, =, #, and :, respectively. Single bonds and aromatic bonds (with atoms in lower case) need not explicitly be stated. Examples: CC ethane (CH3CH3) C-C also ethane C=C ethylene C#N hydrogen cyanide (HCN)
3> Branches are specified by enclosure in parentheses. Example: CC(C)C isopropane
4> Cyclic structures are represented by breaking one bond in the cycle and designating the ring closure with a number next to the two atoms involved in the ring closure bond. Examples: c1ccccc1 benzene C1CCCCC1 cyclohexane
5> Disconnected compounds, including ionic bonds, are written as individual structures separated by a period. Example: [NH4+].[OH-] ammonium hydroxide
6> Isomeric SMILES structures are treated as their generic SMILES counterparts, as CompoundKB does not currently utilize stereochemical or isotopic information. This parser accepts isomeric SMILES and simply screens out the unused information. The code is flagged at the appropriate locations for future changes in case CompoundKB eventually uses isomeric information. The relevant procedures are smiles-bond, smiles-bracketed, and smiles-stereochemistry.
7> Wildcards (an extension of SMILES) The generic wildcard atom * may be used to symbolize an unknown atom of any element.
The generic wildcard bond ~ may be used in place of any bond.
Examples: C*C c1cc*cc1 C=CC~C C~X~C
The wildcard syntax is a subset of SMARTS, an extension of SMILES for specifying structural patterns. See: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
* For a more detailed description, see "SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules," Weininger, D., J. Chem. Inf. Comput. Sci. 1988, 28, pp. 31-36. The summary here borrows heavily from this article.
The following grammar SMILES describes the SMILES language more precisely:
EBNF SMILES Grammar
{aromatic atom} = 'b | 'c | 'n | 'o | 'p | 's
{organic atom} = 'B | 'C | 'N | 'O | 'P | 'S |
'F | 'Cl | 'Br | 'I | {aromatic atom}
{atom} =