Background

SPINUS (Structure-based Predictions In NUclear magnetic resonance Spectroscopy) is an on-going project for the development of structure-based tools for fast prediction of NMR spectra. SPINUS - WEB currently accepts molecular structures via a Java molecular editor, and estimates 1H NMR chemical shifts. The predictions are obtained from ensembles of previously trained feed-forward neural networks, and corrected with data from an additional memory.


Chemical shift prediction overview

Summary

Every time a molecule is submitted to SPINUS - WEB, a set of geometric, physicochemical and topological properties are used as descriptors for every proton bonded to a carbon atom. Geometric descriptors are based on the 3D geometry generated by CORINA, physicochemical descriptors are obtained from properties calculated by PETRA, and other descriptors are derived from the connectivity table.

Ensembles of Feed-Forward Neural Networks (FFNN) are then used, which were previously trained to give the chemical shift as the output when the proton descriptors are submitted as input.

Selection of descriptors was performed by stepwise removal of correlated descriptors. More compact models were hence obtained, yielding the same or even better predictions than the models using all possible descriptors. The optimal number (and the corresponding set) of descriptors were chosen according to the predictions obtained for a cross-validation set.

Before descriptors are calculated, SPINUS - WEB classifies each proton into one of four possible classes: a) aromatic, if it is bonded to an aromatic system; b) non-aromatic pi, if it is bonded to a non-aromatic pi system; c) rigid aliphatic, if a non-rotatable bond is identified in the second sphere of bonds centered on the proton; and d) non-rigid aliphatic, if not included in previous classes. A bond is defined as non-rotatable if it belongs to a ring, to a pi system or to an amid functional group. Each class of protons uses its own specific descriptors and its own ensemble of neural networks. For each class of protons, an ensemble of FFNNs was trained. The prediction of the chemical shift for one proton is obtained as the average of the outputs from the ensemble (uncorrected prediction).

The prediction are finally corrected using data from the additional memory. The k most similar protons in the memory are found, and an average deviation between the uncorrected predictions for these k protons and their experimental chemical shifts is calculated. This is the correction to be added to the uncorrected prediction of the query proton.

Description of FFNNs
Feed-Forward Neural Networks (FFNN) (also called Backpropagation (BPG) neural networks) are multidimensional functions with many parameters (weights), which are iteratively adjusted during the training process. A FFNN reacts to a set of stimuli (input) with a result (output). The output depends on the weights, each weight being associated with a connection between two neurons of different layers. During the training, the weights are adjusted so that the network gives the desired output for a given input. It is usually said that during the training the network learns with the examples of the training set. Once trained, a network is able to give outputs for new stimuli. In SPINUS - WEB, the proton descriptors are the stimuli and the chemical shift is the output.


Development of the models
A data set of 1003 experimental chemical shifts taken from 151 molecular structures were used for training the neural networks and for optimizing the models (cross-validation). The development of the model included a) decision on when to stop the training to avoid overfitting; b) selection of descriptors by stepwise removal of correlations; c) choice of the optimum number of hidden neurons.

Ensembles of FF neural nets
In order to obtain more robust and stable predictions, 50 FFNNs were trained for each class of protons, with the same architecture (only the initial partition of objects between the training and cross-validation sets was different for each net). The prediction is then obtained as the average of the 50 outputs. The series of the 50 outputs is called the output profile.

Correction of predictions using Associative Neural Networks (ASNN)
The knowledge contained in an additional memory of experimental chemical shifts (and the corresponding  molecular  structures) was bound to the ensembles of FFNNs to improve their predictions. No retraining of the nets is required. When a query proton is submitted to the system, a prediction is obtained from the ensemble of FFNNs - the uncorrected prediction. The additional memory is used to correct this prediction - it is searched to identify the k most similar protons (in terms of output profile) and these are used to correct the prediction. The errors of the predictions (from the ensemble of FFNNs) for the k most similar protons are averaged, and the average is the correction term to be added to the prediction of the query proton. The k most similar protons are searched in the output space, i.e. they are the cases in the memory with the most similar output profile to the query proton. The term output profile is used because a number of outputs are obtained, for the same proton, one from each FFNN of the ensemble. This combination of FFNNs and an additional memory (ASNN) was introduced by Igor Tetko (I. V. Tetko. Neural Network Studies. 4. Introduction to Associative Neural Networks. J. Chem. Inf. Comput. Sci. 2002, 42, 717-728). The ASNN program for associative neural networks is available from VCCLAB.

ASNN block schema

Coupling constants are estimated by a modified ASNN procedure, from the most similar pairs of coupled protons found in a second memory of experimental coupling constants. This search is performed in the output space of the FFNNs ensemble previously trained for chemical shifts.

Prection of coupling constants schema

Quality of predictions
For a data set of 100 independent structures representing a wide variety of structural features, SPINUS gave the following average errors: 0.13 ppm for aliphatic class, 0.18 ppm for aromatic class, 0.19 ppm for non-aromatic pi class, and 0.30 ppm for rigid aliphatic class. A global average error of 0.19 ppm was obtained for the 952 predictions.

Web implementation
The system currently implemented in this web service uses an additional memory of 8,500 experimental chemical shifts. Although this memory is not exactly the same as that used for the studies described in the reference papers of SPINUS (see below), the methods are exactly the same.

References
In J. Aires-de-Sousa, M. Hemmer, J. Gasteiger, “Prediction of 1H NMR Chemical Shifts Using Neural Networks”, Analytical Chemistry, 2002, 74(1), 80-90 most of the proton descriptors are explained. In that work they were used for the prediction of 1H NMR chemical shifts by counterpropagation neural networks. In Y. Binev, J. Aires-de-Sousa, "Structure-Based Predictions of 1H NMR Chemical Shifts Using Feed-Forward Neural Networks", J. Chem. Inf. Comp. Sci., 200444(3), 940-945 the development of the FFNNs and the selection of descriptors is explained, and in Y. Binev, M. Corvo, J. Aires-de-Sousa, "The Impact of Available Experimental Data on the Prediction of 1H NMR Chemical Shifts by Neural Networks", J. Chem. Inf. Comp. Sci., 2004, 44(3), 946-949 the use of an additional memory is described.

Credits
Method development: Johann Gasteiger, João Aires de Sousa, Markus C. Hemmer, and Yuri Binev.
Collaboration on testing: Marta Corvo.
Web interface: João Aires de Sousa, and Yuri Binev.


Last updated 14 December 2006