Technical Description

Literature Review

Molecules are typical examples of unstructured data for which tasks such as searching, sorting, analyzing and extracting knowledge are challenging. A molecule can have an arbitrary dimension, structure and composition, and moreover, there is not an univocal and unequivocal way of coding and comparing these molecules. Several computational tools have been developed over the years in pursuance of solving this issue. Fundamental observations that justify the amount of methods developed to compare molecules derive from the fact that similarity has a context [bender2004] and the representation of molecular structures implies information loss. Researchers have explored the concept of similarity between molecules which provides an important approach to search databases, predict properties of compounds, design structures with a predefined set of properties and conduct structurebased drug design studies [willett2005, eckert2007]. These studies are based on the “neighborhood” premise, which states that similar molecules usually have similar activities and properties [bender2004]. The definition of similarity for molecules consists of comparing chemical structures, specifically representing the molecules and quantifying the similarity between them. Various methods to define structural similarity between molecules are available in the literature [nikolova2003, bender2004, TeixFalc2013]. The most popular approaches to represent the structure of the molecules under comparison can be divided in three broad categories, approaches based on structural descriptors (two and threedimensional), molecular fragments and graph matching (descriptorindependent methods). A descriptor positions each abstract molecular representation in the descriptor space. It is then possible to compare molecules, considering that the distance of the abstract molecular representations reflects their similarity in this specific descriptor space [bender2004]. Molecular similarity is a nonlinear problem for which there is not a set of descriptors or a similarity measure that correlates with every context of comparisons one can perform [todeschini2009, bender2009].

A commonly used approach to predict chemical, physical and/or biological properties of chemical compounds resorts to the structure of the molecule using data mining methods through quantitative structureproperty/activity relationships (QSPR/QSAR) [katritzky2000, katritzky2002, doucet2011]. The three major difficulties in the development of QSPR/QSAR models are (1)
quantifying the inherently abstract molecular structure, (2) determining which structural features most influence the given property (representation problem) [liu2004, gonzalez2008, teixeira2013] and (3) establishing and validating the functional relationships that most accurately describe the relationship between structural descriptors and the property/activity data
(mapping problem) [tropsha2007, puzyn2009, dearden2009, tropsha2010]. Furthermore, it is acknowledged that it is not possible to develop a model providing reliable predictions for all possible compounds [tetko2009]. Classical QSPR/QSAR approaches have several shortcomings, namely (1) the predictive power of the model is highly dependent on the selection of predictor variables and on the presence of correlation between these variables, (2) the prediction capacity of the model is limited by the molecular diversity and distribution of the molecules in the training set [oprea2001], (3) the models need to be retrained every time new compounds are added or removed. Nevertheless [TeixFalc2014]using a kriging based approach over NAMS, questioned most of these assumptions and showed that it is possible to produce inference models for which no relearning is necessary, are able to produce estimation errors individually for each estimation and are able to produce reasonable estimates ever for molecules widely different. Also the initial conclusions from [Martin200] and Nikolova2003] that compounds that are similar to known active molecules are themselves far less frequently active than one might expect, have been challenged by NAMS [TeixFalc2013, TeixFalc2014]

A molecule can also be represented, using graph theory, as a labeled graph whose vertices correspond to the atoms and edges correspond to the covalent bonds. The representation of molecules using graphs has some advantages, namely, graphs are intuitive when representing a molecule since they are close to our understanding of a molecule and they have a solid mathematical background with different existing techniques to compare labeled graphs [ehrlich2011]. However, representing molecules as graphs raises an important issue, identical graphs do not necessarily represent identical structures and viceversa[ehrlich2011].

The goal of finding common subtructrures for property inference has been pursued[kawabata2011, rahman2009, batista2006]. Nonetheless and despite the NPcompleteness nature of the problem (garey 1979) many approximated heuristics have been proposed to overcome this complexity[ehrlich2011]. The approach followed by [TeixFalc2013] is different in that it makes no assumptions on any structural components of any molecule and is able to consider the characteristics that are not directly faced by graph theory (namely chirality, or cistrans isomerisms). Having a reliable structural matching algorithm is only part of the solution as it is necessary to make use the detected similarity for predictions. This has been accomplished by coupling NAMS with kriging, a metric space based method for inference[TeixFalc2014]. Kriging models have been used previously in chemoinformatics. [fang2004, hawe2010, sun2011]showed that kriging models were able to outperform other methods in the development of predictive models of pharmacological properties However, in all of these studies there was always an explicit use of chemical descriptors arbitrarily chosen according to the nature of the problems.

Plan and Methods

The central goal of this proposal is to improve the molecular similarity algorithm NAMS [TeixFalc2013]. by providing topological enhancements to the main atommatching algorithm so that the tool is able to both give pharmacologists and molecular biologists a direct understanding of the relevant components in a family of active compounds; and secondly to use this newly derived tool as a standalone tool for the screening phases of drugdevelopment programs. NAMS was designed to compare full molecules, thus in a way tapping the graph isomorphism problem and providing an extremely reliable polynomial solution. However for a molecule to function as a drug, many times only specific parts of its structure (not necessarily contiguous) are necessary. Thus it is envisaged that NAMS can be modified by allowing the weighting of specific compositional and structure elements that are deemed to be essential for understanding a molecule pharmacological properties. In this proposal it is aimed to extend NAMS allowing this algorithm to differentiate between parts of the molecule allowing for the discovery of the most relevant parts as well as its differential amplification for molecular property prediction ad drug inference. Differently from other methods NAMS is not bound by the existence of any type of chemical descriptors and can conceivably be used in any chemical property prediction problem. The current inference engine [TeixFalc2014] is based on kriging over the global similarity provided by NAMS, and although providing results on the level of the best stateoftheart QSAR algorithms using virtual no information other than the molecular structure, we believe that a topological differentiation mechanism will be key to extend NAMS over even more different molecules that share pharmacological characteristics. The current version of the inference engine requires no learning, as it is a kriging algorithm, taking advantage of the molecular metric space; it is further expected that the new version will be also able to directly assess the relevant topological characteristics of the most active molecules known and be able to use this intrinsic knowledge to retrieve from large molecular databases the compounds more likely to have the desired characteristics.

The research team for the current project is an assembly of computer scientists and biochemists, pharmacologists and molecular biologists with world leading expertise in the fields of cheminformatics, molecular modeling, bloodbrain barrier permeability and cystic fibrosis.

This work will be divided in 3 major tasks, and each will be critical for the advancement of the project. First, it is necessary to do a thorough evaluation of the best existing QSAR methodologies and compare the current inference engine based on NAMS and Kriging [TeixFalc2014]. This comparison is supposed to be as exhaustive as possible, by testing each existing method over a set of benchmarks created from data collected in publicly available databases. Secondly, NAMS is to be adapted for including differential topological features in assessing chemical structural similarity and then this adaptation is to be optimized and tested within a novel framework, centered on Bayesian learning. The purpose is the empirical identification of the molecular topological characteristics that for each specific benchmark problem. Finally the topological enhanced model is to be put to the test over two distinct problems, for which there are world leading experts within the team. The first problem is the development of new lead compounds for enhancing trafficking of F508delCFTR to the plasma membrane. This problem is in the center of current drug development for cystic fibrosis. The second problem aims to determine whether a new molecule has the potential for crossing the BloodBrain Barrier. This issue is critical in most drugdevelopment programs for drugs that target the central nervous system.
Methods

The first task of the project will create a set of benchmarks for fitting a variety of QSAR models from the literature. The techniques typically used range from usual linear models to sophisticated ensemble and hybrid methodologies [tropsha2010] Typically on most models there are two essential phases. The first one for finding the best possible descriptors for a given problem [Tropsha2007] and secondly testing and validating the models from a variety of methods that range from classic neural networks, support vector machines to hybrid elastic nets and probabilistic graphical models [Koeller2009]. The purpose of this phase is not to find the best possible model, but instead to assess the capabilities and limitations of current QSAR models, and to have a global view of the sources of prediction errors and the nature of the problems. This proposal will focus on a strict Bayesian approach for including localized structural similarity in probabilistic models. The Bayesian emphasis will be critical for realistic results. The majority of molecules do not show pharmacological activity and missing the prior probabilities is one of the key reasons why many in silico screening studies produce unreliable and nonreproducible results [MartFalc2012]. For stringent comparison of models and the assessment of the new approach, a through validation process will be followed, with one independent validation set created for each benchmark set that will not be used in any phase of the training or cross validation procedures, for a stringent unbiased evaluation of all models.

The central part of this proposal is the second task where the purpose is twofold. A) incorporate the topological differentiation mechanisms into NAMS and B) define a new inference engine that may be use this information to predict the pharmacological potential of any molecule The purpose is not to solve the specific benchmarks problems specificities, but rather to identify algorithmically and statistically how can we identify which structural characteristics are fundamental for each binding problem. For this objective to be accomplished it is required a rigorous statistical handling within a Bayesian framework for adequately assessing the relevance of each factor. Structural characteristics of a molecule are not necessarily describable by human terms and are only topological constructs discovered by using a Markov Chain MonteCarlo approach over NAMS so that the most relevant parts of the molecule become prominent and can be used through a dynamic weighting layer over the atoms of each lead molecule. Secondly, the modeling phase will be centered on fitting a Markov network where the similarities to key instances within the chemical metric space will condition the posterior probability of any unknown instance to have biological activity. Each network will be derived in a unsupervised way directly from the existing data. Using a datadriven probabilistic graphical model for each problem will not ensure per se that the Bayesian requirements are met, as it is anticipated that the priors will be difficult to assess, and their determination will be one of the key challenges in the forthcoming work. This phase will require a large computational effort that is to be accomplished with the existing hardware and the new servers to be acquired.

The third task will involve testing the new topologically enhanced similarity metric and the respective inference engine over two current problems in pharmacology. The first one is the retrieval on molecules with the potential to rescue a genetically malformed protein (F508delCFTR) to the plasma membrane. This issue is known to be the central factor causing cystic fibrosis. The second problem involves discovering factors that may induce molecular penetration through the BloodBrain barrier (BBB). This is still an ongoing problem and in silico models have had so far limited success [MartFalc2012]. The results from the new model will be complemented with a virtual screening similarity approach [Lucas2012]. Efforts will focus on the virtual screening of large chemical libraries of commercially available compounds libraries (e.g., ZINC, NCI). The NCI database contains around 400,000 drugs, from these, only 250 000 are available for download. The ZINC database as now around 35 million compounds, however, we will not screen the entire database but only “In –Stock Druglike” subset (~10 million compounds).

After the lead compounds for both problems is defined these will be put to the test in vitro. The rescue of F508del-CFTR traffic in cells treated with lead compounds will be assessed by the F508delCFTR traffic assay we have established for automated fluorescence microscopy[Almaça2011,Farinha2013]. On the other hand BBB penetration will be assessed over primary cultures of human brain microvascular endothelial cells derived from microvessels isolated from temporal tissue removed during operative treatment of epilepsy. Monolayers of human brain microvascular endothelial cells show characteristically high transendothelial electric resistance and have proven useful in multiple functional studies for in vitro modeling of the human
blood-brain barrier[Bernas2010].