SPINUS (Structure-based Predictions In NUclear magnetic resonance Spectroscopy) is an on-going project for the development of structure-based tools for fast prediction of NMR spectra. SPINUS - WEB currently accepts molecular structures via a Java molecular editor, and estimates 1H NMR chemical shifts and coupling constants. The predictions are obtained from ensembles of previously trained feed-forward neural networks, and corrected with data from an additional memory.
Every time a molecule is submitted to SPINUS - WEB, a set of geometric, physicochemical and topological properties are used as descriptors for every proton bonded to a carbon atom. Geometric descriptors are based on the 3D geometry generated by CORINA, physicochemical descriptors are obtained from properties calculated by PETRA, and other descriptors are derived from the connectivity table.
Ensembles of Feed-Forward Neural Networks (FFNN) are then used, which were previously trained to give the chemical shift as the output when the proton descriptors are submitted as input.
Selection of descriptors was performed by stepwise removal of correlated descriptors. More compact models were hence obtained, yielding the same or even better predictions than the models using all possible descriptors. The optimal number (and the corresponding set) of descriptors were chosen according to the predictions obtained for a cross-validation set.Before descriptors are calculated, SPINUS - WEB classifies each proton into one of four possible classes: a) aromatic, if it is bonded to an aromatic system; b) non-aromatic pi, if it is bonded to a non-aromatic pi system; c) rigid aliphatic, if a non-rotatable bond is identified in the second sphere of bonds centered on the proton; and d) non-rigid aliphatic, if not included in previous classes. A bond is defined as non-rotatable if it belongs to a ring, to a pi system or to an amid functional group. Each class of protons uses its own specific descriptors and its own ensemble of neural networks. For each class of protons, an ensemble of FFNNs was trained. The prediction of the chemical shift for one proton is obtained as the average of the outputs from the ensemble (uncorrected prediction).
The prediction are finally corrected using data from the additional memory. The k most similar protons in the memory are found, and an average deviation between the uncorrected predictions for these k protons and their experimental chemical shifts is calculated. This is the correction to be added to the uncorrected prediction of the query proton.
Description of FFNNs
Feed-Forward Neural Networks (FFNN) (also called Backpropagation (BPG) neural networks) are multidimensional functions with many parameters (weights), which are iteratively adjusted during the training process. A FFNN reacts to a set of stimuli (input) with a result (output). The output depends on the weights, each weight being associated with a connection between two neurons of different layers. During the training, the weights are adjusted so that the network gives the desired output for a given input. It is usually said that during the training the network learns with the examples of the training set. Once trained, a network is able to give outputs for new stimuli. In SPINUS - WEB, the proton descriptors are the stimuli and the chemical shift is the output.
Development of the models
A data set of 18,122 chemical shifts was used that correspond to 5,230 protons bonded to aromatic systems, 872 protons bonded to π nonaromatic systems, 5,320 protons bonded to nonrigid aliphatic substructures, and 6,700 protons bonded to rigid aliphatic substructures. These include data manually collected in our lab from the literature and proprietary data made available by Molecular Networks GmbH (Erlangen, Germany). They were retrieved from spectra measured mostly in CDCl3 but also in DMSO-d6 and in a very few cases in D2O, pyridine-d5, CD2Cl2, CD3OD, and benzene-d6. A data set of 618 1H-1H coupling constants was manually retrieved from the literature. The development of the model included a) selection of training sets, cross-validation sets, and additional memory for ASNN; b) decision on when to stop the training to avoid overfitting; c) selection of descriptors by analysis of impact on FFNN output; d) choice of the optimum number of hidden neurons.
Ensembles of FF neural nets
In order to obtain more robust and stable predictions, 75 FFNNs were trained for each class of protons, with the same architecture (only the initial partition of objects between the training and cross-validation sets was different for each net). The prediction is then obtained as the average of the 75 outputs. The series of the 75 outputs is called the output profile.
Correction of predictions using Associative Neural Networks (ASNN)
The knowledge contained in an additional memory of experimental chemical shifts (and the corresponding molecular structures) was bound to the ensembles of FFNNs to improve their predictions. No retraining of the nets is required. When a query proton is submitted to the system, a prediction is obtained from the ensemble of FFNNs - the uncorrected prediction. The additional memory is used to correct this prediction - it is searched to identify the k most similar protons (in terms of output profile) and these are used to correct the prediction. The errors of the predictions (from the ensemble of FFNNs) for the k most similar protons are averaged, and the average is the correction term to be added to the prediction of the query proton. The k most similar protons are searched in the output space, i.e. they are the cases in the memory with the most similar output profile to the query proton. The term output profile is used because a number of outputs are obtained, for the same proton, one from each FFNN of the ensemble. This combination of FFNNs and an additional memory (ASNN) was introduced by Igor Tetko (I. V. Tetko. Neural Network Studies. 4. Introduction to Associative Neural Networks. J. Chem. Inf. Comput. Sci. 2002, 42, 717-728). The ASNN program for associative neural networks is available from VCCLAB.
1H-1H Coupling constants are estimated by a modified ASNN procedure, from the most similar pairs of coupled protons found in a second memory of experimental coupling constants. This search is performed in the output space of the FFNNs ensemble previously trained for chemical shifts.
Quality of predictions
For a data set of 100 independent structures representing a wide variety of structural features, SPINUS gave the following average errors for chemical shifts: 0.16 ppm for aliphatic class, 0.23 ppm for aromatic class, 0.35 ppm for non-aromatic pi class, and 0.29 ppm for rigid aliphatic class. A global average error of 0.23 ppm was obtained for the 952 predictions. A global error of ca. 0.6 Hz was observed for 1H-1H coupling constants.
The current version is based on the models described in Y. Binev, M.M. Marques, J. Aires-de-Sousa, "Prediction of 1H NMR coupling constants with associative neural networks trained for chemical shifts", J. Chem. Inf. Model. 2007, 47(6), 2089-2097.
In J. Aires-de-Sousa, M. Hemmer, J. Gasteiger, “Prediction of 1H NMR Chemical Shifts Using Neural Networks”, Analytical Chemistry, 2002, 74(1), 80-90 most of the proton descriptors are explained. In that work they were used for the prediction of 1H NMR chemical shifts by counterpropagation neural networks. In Y. Binev, J. Aires-de-Sousa, "Structure-Based Predictions of 1H NMR Chemical Shifts Using Feed-Forward Neural Networks", J. Chem. Inf. Comp. Sci., 2004, 44(3), 940-945 the development of the FFNNs and the selection of descriptors is explained, and in Y. Binev, M. Corvo, J. Aires-de-Sousa, "The Impact of Available Experimental Data on the Prediction of 1H NMR Chemical Shifts by Neural Networks", J. Chem. Inf. Comp. Sci., 2004, 44(3), 946-949 the use of an additional memory is described. Prediction of coupling constants and full-spectrum simulation is described in Y. Binev, M.M. Marques, J. Aires-de-Sousa, "Prediction of 1H NMR coupling constants with associative neural networks trained for chemical shifts", J. Chem. Inf. Model. 2007, 47(6), 2089-2097.
Method development: Johann Gasteiger, Joćo Aires de Sousa, Markus C. Hemmer, and Yuri Binev.
Collaboration on testing: Marta Corvo.
Web interface: Joćo Aires de Sousa, and Yuri Binev.