We are involved in the development of methods and software in chemoinformatics. Current main projects are: a) machine learning from big datasets of DFT-calculated molecular properties, b) automatic learning of chemical reactivity and properties from the molecular structure, c) simulation of NMR spectra, d) representation of molecular chirality, e) applications of neural networks in Chemometrics.

Ultra fast prediction of DFT bond and atom properties

Automatic learning of chemical reactivity and metabolism

We have developed MOLecular Maps of Atom-level Properties (MOLMAPs) to represent the diversity of chemical bonds existing in a molecule. Chemical reactivity, being related to the ability for bond breaking and bond making, is primarily determined by the properties of bonds available in a molecule. In order to use physicochemical properties of individual bonds for an entire molecule, and at the same time having a fixed-length molecular representation, all the bonds of a molecule are mapped into a fixed-size 2D self-organizing map. The pattern of activated neurons is a map of reactivity features of that molecule (MOLMAP) – a fingerprint of the bonds available in that structure. The MOLMAP descriptors can be directly used for data mining or QSAR studies related to chemical reactivity, in situations involving different types of reaction sites in a single data set, more than one reaction site in a single structure, or unknown reaction sites. The application of MOLMAPs was demonstrated with a QSAR study of phenolic antioxidants (S. Gupta, S. Mathew,  P. M. Abreu, J. Aires-de-Sousa, "QSAR analysis of phenolic antioxidants using MOLMAP descriptors of local properties", Bioorg. Med. Chem. 2006, 14 (4), 1199-1206), and the prediction of mutagenicity (Q.-Y. Zhang, J. Aires-de-Sousa, "Random Forest Prediction of Mutagenicity from Empirical Physicochemical Descriptors"J. Chem. Inf. Model. 2007, 47(1), 1-8).

In the course of our studies with MOLMAPs, we recognized that the difference between the MOLMAPs of the products of a reaction and the MOLMAPs of the reactants of the same reaction could be interpreted as a MOLMAP of the reaction - a descriptor of the reaction. One interesting aspect of such approach is that it allows for the representation of chemical reactions without assignment of reaction centers. The application of MOLMAPs to reaction classification was first demonstrated with a data set of photochemical cycloadditions (Q.-Y. Zhang, J. Aires-de-Sousa, "Structure-based classification of chemical reactions without assignment of reaction centers", J. Chem. Inf. Model. 2005, 45(6), 1775-1783).

A very successful application of MOLMAPs to reaction classification involved a genome-scale classification of metabolic reactions from the KEGG database, and the investigation of its relationship with the traditional EC number system. A communication with preliminary results was published in D. A. R. S. Latino, J. Aires-de-Sousa, "Genome-Scale Classification of Metabolic Reactions: A Chemoinformatics Approach", Angew. Chem. Int. Ed. 2006, 45 (13), 2066-2069. Full details were published in D. A. R. S. Latino, Q.-Y. Zhang, J. Aires-de-Sousa, "Genome-scale classification of metabolic reactions and assignment of EC numbers with self-organizing maps", Bioinformatics 2008, 24(19), 2236-2244 and D. A. R. S. Latino, J. Aires-de-Sousa, "Assignment of EC Numbers to Enzymatic Reactions with MOLMAP Reaction Descriptors and Random Forests", J. Chem. Inf. Model. 2009.

MOLMAPs were also used for automatic extraction of knowledge on reactivity patterns from databases of organic reactions: G. Carrera, S. Gupta, J. Aires-de-Sousa, "Machine learning of chemical reactivity from databases of organic reactions", J. Comput. Aided Mol. Des. 2009, 23, 419–429.

The MOLMAP method inspired independent research groups, which adapted it to new applications. Todeschini and co-workers (University of Milano-Bicocca) applied the idea in Chemometrics as a method for the classification of multiway analytical data – Ballabio, D.; Consonni, V.; Todeschini, R. Classification of Multiway Analytical Data Based on MOLMAP Approach. Anal. Chim. Acta 2007, 605(2), 134-146. The same group implemented the method in a collection of MATLAB modules called “MOLMAP multiway toolbox”, available at

Differently, in the works of Hemmateenejad the MOLMAP strategy was used in QSAR studies, in order to integrate quantum molecular descriptors calculated for several regions of a molecule – a) Hemmateenejad, B.; Mehdipour, A. R.; Popelier P. L. A. Quantum Topological QSAR Models Based on the MOLMAP Approach. Chem. Biol. Drug Des. 2008, 72(6), 551-563. b) Hemmateenejad, B.; Mehdipour, A. R.; Miri, R.; Shamsipur, M. Application of MOLMAP Approach for QSAR Modeling of Various Biological Activities Using Substituent Electronic Descriptors. J. Comp. Chem. 2009, 30(13), 2001-2009. These authors described general superior results with the MOLMAP approach comparing to the simple unfolding of quantum descriptors in one vector.

Simulation of NMR spectra

The simulation of NMR spectra has gained much interest in connection with combinatorial synthesis and high-throughput screening (HTS). The large amount of compounds prepared in parallel syntheses need to be analyzed and the structure of the products need to be verified. NMR plays an increasingly important role in this endeavor and the simulation of spectra to compare with the experimental spectra is of high interest. The prediction of NMR spectra is also important for automatic structure elucidation.

A strategy was developed by us for the fast estimation of NMR chemical shifts of CHn protons, which is based on automatic knowledge acquisition from a data set of examples (J. Aires-de-Sousa, M. Hemmer, J. Gasteiger, Analytical Chemistry, 2002, 74(1), 80-90). The system was designed in order that learning of 3D effects is possible. The relationship between protons in defined molecular structures and the corresponding 1H NMR chemical shift was established by counterpropagation neural networks, which used descriptors for hydrogen atoms in organic structures as input, and the chemical shift of the corresponding proton as output.

In order to ensure robustness and generality, various types of descriptors were used, namely topological and empirical physicochemical descriptors. Geometric descriptors were added in some situations to account for stereochemistry and 3D effects.

Genetic algorithms performed the selection of descriptors. The best models yielded very good predictions for an independent prediction set of 259 cases (mean absolute error for whole set = 0.25 ppm, mean absolute error for 90% of cases = 0.19 ppm) and for application cases consisting of four natural products recently described. Some stereochemical effects could be correctly predicted.

The method was further improved with feed-forward neural networks, and with additional data incorporated into associative neural networks. A mean average error of 0.19 ppm could be achieved for an independent test set of 952 protons. The new results are available in the papers Y. Binev, J. Aires-de-Sousa, "Structure-Based Predictions of 1H NMR Chemical Shifts Using Feed-Forward Neural Networks", J. Chem. Inf. Comp. Sci., 200444(3); 940-945. Y. Binev, M. Corvo, J. Aires-de-Sousa, "The Impact of Available Experimental Data on the Prediction of 1H NMR Chemical Shifts by Neural Networks", J. Chem. Inf. Comp. Sci., 2004, 44(3); 946-949.

Later the Associative Neural Networks trained for the prediciton of chemical shifts were applied to the estimation of coupling constants. The new feature enabled the prediction of full spectra: Y. Binev, M.M. Marques, J. Aires-de-Sousa, "Prediction of 1H NMR coupling constants with associative neural networks trained for chemical shifts", J. Chem. Inf. Model. 2007, 47(6), 2089-2097.

The models were implemented in the SPINUS program for the prediction of 1H NMR spectra from the structure. A web interface is available. SPINUS is the provider of NMR predictions for the web service.

Representation of molecular chirality

Molecular chirality is of profound importance in many areas of chemistry. The biological and chemical properties exhibited by opposite enantiomers of chiral compounds are frequently different. This subtle geometrical fact has profound practical consequences in biology, environmental sciences and pharmacology. Particularly, the fact that two enantiomers often have different biological activity makes chirality one of the most important factors in drug safety evaluation today. The chemical and pharmaceutical industries are being forced for safety reasons to commercialize single enantiomers (enantiopure compounds), and drug companies use chirality as a tool for drug life-cycle management, and to redevelop racemic mixtures as single enantiomers (racemic switch).

The development of enantiomerically pure drugs and agro-chemicals has become imperative which demands improved enantioselective methods in organic synthesis, analytical chemistry, separation techniques, and property prediction. Computer applications to predict chiral properties starting from the molecular structure require an adequate representation of molecular chirality.

Particularly in molecular diversity studies and quantitative structure-activity relationships (QSAR) that are influenced by chiral properties, molecular representations incorporating information about chirality are crucial.

Starting with my post-doc with Gasteiger's group, we've developed a chirality code that represents the chirality generated by chiral carbon atoms and is independent of conformation. This code is a molecular transform that represents chirality using a spectrum-like, fixed-length code, and includes information about the geometry of chiral centers, properties of the atoms in their neighborhoods, bond lengths, and distinguishes between enantiomers. Additionally, it was demonstrated that such a code can be successfully applied to the prediction of the enantiomeric selectivity in chemical reactions using artificial neural networks ( J. Aires-de-Sousa, J. Gasteiger, J. Chem. Inf. Comp. Sci., 2001,41 (2), 369-375 ). This code has the advantage of describing chirality without being influenced by the conformation. However, it is restricted to applications in which the chirality arises from a chiral carbon (or at least a chiral atom). For example it cannot be applied to axially chiral compounds in which a locked conformation generates chirality.

Later we proposed a second chirality code that characterizes the chirality of a 3D structure considered as a rigid set of points (atoms) with properties (atomic properties) and connected by bonds. The code includes information about the molecular geometry (including 3D interatomic distances), connectivity, atomic properties, and can distinguish between enantiomers. It depends on the conformation and has the form of radial distribution functions as used in X-ray structure determination. It was shown that the conformation-dependent chirality code (CDCC) can be correlated by means of Kohonen neural networks with the elution order of enantiomers in chiral chromatographic separations (J. Aires-de-Sousa, J. Gasteiger, J. Molec. Graphics and Model., 2002, 20 (5), 373-388).

Chirality codes were applied to the automatic assignment of absolute configuration from 1D NMR data (Q.-Y. Zhang, G. Carrera, M. J. S. Gomes, J. Aires-de-Sousa, "Automatic assignment of absolute configuration from 1D NMR data", J. Org. Chem. 2005, 70(6), 2120-2130). Chirality codes were also used in the representation of metabolic reactions catalysed by racemases and epimerases of E.C. subclass 5.1. (D. A. R. S. Latino, Q.-Y. Zhang, J. Aires-de-Sousa, "Genome-scale classification of metabolic reactions and assignment of EC numbers with self-organizing maps", Bioinformatics 2008, 24(19), 2236-2244).

Physicochemical atomic stereodescriptors (PAS) were proposed in 2006 to represent the chirality of an atomic chiral center on the basis of empirical physicochemical properties of its ligands – the ligands are ranked according to a specific property, and the chiral center takes an “S/R-like” descriptor relative to that property. The procedure is performed for a series of properties, yielding a chirality profile: Q.-Y. Zhang, J. Aires-de-Sousa, "Physicochemical Stereodescriptors of Atomic Chiral Centers", J. Chem. Inf. Model. 2006, 46(6), 2278-2287.

Applications of neural networks in Chemometrics

In the analysis of an environmental disaster caused by spillage of crude oil, limitation of the possible sources to a few geographical origins can help in the identification of the polluting vessel from a group of potential candidates. In a collaboration with Instituto Hidrográfico (Lisbon, Portugal) SOMs were trained to predict the geographical origin of crude oil samples from analytical parameters obtained by GC-MS. These parameters had been designed at Instituto Hidrográfico in order to be robust to environmental conditions (“weathering”). In the M.Sc. thesis of Ana M. Fonseca data sets were used with 188 and 374 samples from 20 geographical origins. It was possible to correctly classify 60-70% of independent test sets, and to obtain robust estimations of the prediction confidence:  A. M. Fonseca, J. L. Biscaya, J. Aires-de-Sousa, A. M. Lobo,"Geographical classification of crude oils by Kohonen self-organizing maps", Anal. Chim. Acta 2006, 556 (2), 374-382.

Later, the impact of weathering on the classification capabilities of SOMs trained to predict the geographical origin of crud
e oils, was studied. Samples were weathered in simulated environmental conditions. We have shown that it is possible to train SOMs with non-weathered samples, which can be successfully applied to weathered samples using as descriptors ratios of contents of specific compounds:  C. Borges, M. P. Gomez-Carracedo, J. M. Andrade, M. F. Duarte, J. L. Biscaya, J. Aires-de-Sousa, "Geographical classification of weathered crude oil samples with unsupervised self-organizing maps and a consensus criterion", Chemom. Intell. Lab. Syst. 2010, 101(1), 43-55.


Former projects

Representation of DNA sequences with virtual potentials - the SEQREP code

Automatic analysis of biological sequences is a hot research topic today. Efficient sequencing techniques are producing huge amounts of biological data that must be processed by computers. The representation of sequences is a crucial step, transforming the data into 'computer-friendly' numbers. We proposed representing individual positions in DNA sequences by virtual potentials generated by other bases of the same sequence. This is a compact representation of the neighbourhood of a base. The distribution of the virtual potentials over the whole sequence can be used as a representation of the entire sequence (SEQREP code). It is a flexible code, with a length independent of the sequence size, does not require previous alignment, and is convenient for processing by neural networks or statistical techniques.

To evaluate its biological significance, the SEQREP code was used for training Kohonen self-organizing maps (SOMs) in two applications: (a) detection of Alu sequences, and (b) classification of sequences encoding for HIV-1 envelope glycoprotein (env) into subtypes A-G. It was demonstrated that SOMs clustered sequences belonging to different classes into distinct regions. For independent test sets, very high rates of correct predictions were obtained (97% in the first application, 91% in the second). Possible areas of application of SEQREP codes include functional genomics, phylogenetic analysis, detection of repetitions, database retrieval, and automatic alignment.

The method and applications are published in Bioinformatics 2003,19(1), 30-36. Software for representing sequences by SEQREP code, and for training Kohonen SOMs is made freely available.

Ph.D. Research

During my Ph.D. research in Prabhakar and Lobo's group, I worked in synthetic organic chemistry. Before I started, a new reaction yielding N-aryl aziridines had been discovered in the group. The method uses hydroxamic acids and olefins as starting material. My task was to shift the method into an enantioselective route to aziridines. Among other approaches, chiral phase-transfer catalysis by quaternary salts of cinchonine in a heterogeneous medium (organic phase / aqueous base) involving non-chiral reagents allowed the isolation of aziridines with e.e. up to 62%. The various factors that influence the reaction were studied, namely the structure of olefin, hydroxamic acid and catalyst, base, solvent and temperature.

This work is published in J. Aires-de-Sousa, A.M. Lobo, S. Prabhakar, "A New Enantioselective Synthesis of N-arylaziridines by Phase-Transfer Catalysts", Tetrahedron Letters, 1996, 37, 3183-3186 and J. Aires-de-Sousa, S. Prabhakar, A. M. Lobo, A. M. Rosa, M. J. S. Gomes, M. C. Corvo, D. J. Williams, and A. J. P. White, “Asymmetric Synthesis of N-Aryl Aziridines”, Tetrahedron: Asymmetry  2002, 12 (24), 3349-3365.

Verification of wine origin using neural networks

In 1995-1996 I tested back-propagation neural networks for the prediction of wine origin. The method was based on chemical analysis of wines (amino acids or anthocyanins). The neural networks were able to find relationships between chemical parameters and wine origin in two different situations.

More information is available in J. Aires-de-Sousa, "Verifying Wine Origin: A Neural Network Approach", American Journal of Enology and Viticulture, 1996,47 (4), 410-414.