1. Benchmark creation and model testing.
Molecular information databases have grown both in number of compounds available as well as the quantity and quality of information for each molecule. Repositories like ChemSpider, ChEMBL or ZINC, allow for easy consultation and data retrieval of molecules with specified biological characteristics, as described and published in the literature. The first task of this project will involve 4 main subtasks. The first one is the deployment of the computational infrastructure which will involve the deployment of an hardware platform for computation and data storage of structural information for molecules; secondly, from the available data required, several well defined benchmark data sets of pharmacological data will be created. Thirdly, NAMS will be adapted for topological searching, using as a model bases the benchmarks created. Fourthly, results will be compared with other QSAR state of the art methodologies. This process is to be developed continuously and the resulting models are to be improved continuously.
Sub task 1.1. Deployment of the data processing infrastructure for chemical data
Along with the computational platform acquisition and deployment within the available computational framework, it is expected to perform all the tasks of database building. Local copies of the central chemical repositories (ZINC, PubChem and ChEMBL) are to be implemented in a common database created for easy and unencumbered access to the repositories, as well as centralizing all the relevant information in one single repository.
Subtask 1.2.Benchmark dataset creation
The creation of reliable datasets for model testing is sufficiently important as to deserve a special distinction. The purpose of this sub-task is to identify in the literature known and reliable test problems with different types of molecules, in the field of drug discovery. This will involve curating and classifying the problem types. Some benchmarks should be small and with little variability while others are expected to be large and very diverse, for adequately balance and test all the models. The conclusion of the benchmarks will represent a fundamental milestone within the project, as only then it will be possible to evaluate. For all the datasets, all chemical descriptors are to be precomputed with the modeling software. Furthermore, as detailed above, model appraisal and validation will be stringent, and will be executed. Each benchmark will be tested using standard nfold cross validation, however a separate set from the main data, never to be tested or used during the model selection phase (task 1.3) must be created beforehand. This set will be used for assessing the best models after the selection procedure has been performed.
Subtask 1.3. Implementation of state of the art QSAR methodologies
Two essential modeling issues to be answered are, in the first place, the determination of the optimal subset of descriptors for each benchmark. And b) the selection of the model framework more adapted for each specific problem. For result comparison, a dedicated database will be created for analyzing the results as these must be compared globally facing not each model’s adequacy for solving one specific problem but on its actual strengths and shortcomings for each specific problem set. The current NAMS implementation using kriging for inference is to be tested as well against all the other models.
Subtask 1.4. Empirical In silico model validation
The best models from the previous subtask for each benchmark will be subject to a final test using the independent validation sets (IVS) created in subtask 1.2. It is expected at this phase that a journal paper is produced in a high profile publication detailing the results of this effort which is a thorough evaluation of a large variety of QSAR models in a set of benchmark problems. The benchmark datasets will be made public which will provide the scientific community with a set of problems with which to test, evaluate and develop other
models.
2. Model building for molecular pharmacol
There are two different aspects for completing this requirement. Firstly, it is to address the question of identifying the structural characteristics that make a molecule active or nonactive for a specific biological target. The second aspect is how to use the discovered characteristics so as to use this information for inference. So the second task of the project is essentially algorithmic and will use the benchmarks defined in task 1.2 for its completion. This is the core of the project and will be critical to its ultimate success. The sub tasks identified reflect these goals and its success depends on the conjoined success of both objectives
Subtask 2.1. Methods for the identification of structural characteristics of pharmacological activity
Identifying the structural elements will require a significant computational effort and it is estimated may require several thousands of hours of CPU. It will involve stochastic simulations over the benchmark sets for each individual problem as well as across benchmark sets, where the devised models will be tested with datasets for which they were not conceived. To accomplish this task it will be necessary to develop and include an “weighting layer” over the basic atommatching component of NAMS, which will be further modified to efficiently test different positionalweighting schemas within a stochastic simulation framework.
Subtask 2.2. Including topological characteristics for inferring pharmacological activity in the NAMS metric space
Identifying the fundamental structural characteristics within a specific pharmacological problem solves only part of the problem. It is required to use the information derived as sources of knowledge for inference over each specific problem. Therefore it will be necessary to develop and test new inferential methods capable of including such knowledge. NAMS has been used as a global “graphisomorphism” like algorithm to assess global molecular similarity. Kriging has further been used for inference over the NAMS metric space. It is an open question how the topological differentiating characteristics inferred for each problem type will impact the kriging inference engine. Therefore the next subtask is the inclusion of the molecular topological knowledge units (MTKUs) within the inference engine and use it for better modeling. This will involve the development of an inference framework structured over NAMS that is able to include the diverse MTKUs for inference
Subtask 2.3. Validating the molecular topological knowledge units within NAMS.
The final sub task will involve learning and making inference for all the benchmarks data sets and comparing the results to the results accomplished in task 1.4 using the independent validation sets. Thus new models will be inferred using the developed framework and the training sets created for each benchmark. This will be the most critical task in this whole project, for failing to produce better results may imply it will be useless to proceed to the most onerous phase of task 3. Therefore if the preceding efforts are unable to consistently outperform the other state of the art methods, the goals of the project should be toned down. On the other hand, if, as is expected, the new methods are able to produce better results then it makes sense to proceed with confidence for the laboratory work.
This task will produce the major model for the project and it is expected that at least tree journal papers are written and published.
3. Model validation
When reaching this phase the developed methodology should be able to suggest new lead compounds for drug development programs, and this will be tested over two well defined problems for which there is in the project team laboratory knowhow to perform the tests and evaluate the predictions. Namely, molecular bloodbrain barrier penetration prediction and F508delCFTR
rescuing.
Sub Task 3.1. Data selection, curation and model fitting.
The procedure for data retrieval and selection will require a through curation process by verification of all the relevant data in the original literature and in patent databases. For each research problems, a separate data-set will be assembled and a model built which will be the base for a subsequent molecule retrieval over the main chemical databases, where the most promising structures will be selected. This initial set is deemed to be large and inclusive, so as to not miss any possible lead candidates.
Sub Task 3.2. Virtual Screening
Results from NAMS will be confronted with a virtual screening (VS) similarity approach [Lucas2012]will be performed on NCI and ZINC databases to search for compounds with the potential to either cross the Blood/Barrier or rescue F508delCFTR. If there is any clue about the target a molecular docking will be performed using GOLD 5.2.0. GoldScore scoring function with the number of GA runs set to 500. Standard default settings mode will be used number of islands = 5, population size = 100, number of operations = 100 000, a niche size = 2, and a selection pressure = 1.1. Finally, the GOLD poses for ca. 1000 compounds will be displayed (Pymol/VMD) and visually inspected for the hydrophobic and hydrophilic interactions between the ligands and active site enzyme residues. Compounds that can predictably be metabolized will be excluded from further refinement. After the compounds selection they will be purchased and assayed.
Sub Task 3.3. In vitro testing of CFTR rescuing
The purpose is To assess the ability of lead compounds to rescue F508delCFTR traffic to the plasma membrane (PM) in a CF cell line using a automated fluorescence microscopy assay established in our lab facilities [Almaça2011]. F508del-CFTR traffic will be assessed in a Cystic Fibrosis Bronchial Epithelial (CFBE) cell line developed at or lab. By using fluorescence microscopy, the total CFTR amount and the amount of PMlocated CFTR can be determined. To determine CFTR traffic, cells will be cultured following standard procedures and seeded onto microscopygrade 96 well plates containing the lead compounds at different concentrations including DMSO controls. F508delCFTR expression will be triggered over time with Doxycycline. Later extracellular llag tags will be labelled and cells imaged on an automated fluorescence microscope (Leica DMI 6000B). As positive controls, F508delCFTR traffic will be rescued by corrector VX809.
The amount of CFTR at the PM and the CFTR traffic efficiency will be determined for each cell. Compounds significantly enhancing F508delCFTR
traffic will be hits.
Sub Task 3.3. In vitro testing of Blood Brain Barrier penetration
Generating primary cultures of human brain microvascular endothelial cells derived from microvessels isolated from temporal tissue removed during operative treatment of epilepsy. The tissue is to be fragmented and size filtered using polyester meshes. The resulting microvessel fragments are placed onto type I collagencoated flasks to allow HBMVECs to migrate and proliferate. The overall process takes less than 3 h and does not require specialized equipment or enzymatic processes. Monolayers of human brain microvascular endothelial cells show characteristically high transendothelial electric resistance and have proven useful in multiple functional studies for in vitro modeling of the human blood-brain barrier.
After this task several journal papers are to be published and the hit molecules subject to patents.