Multiple target-based pharmacophore design from active site structures
Introduction
Target-based drug discovery is one of the major approaches for producing target-specific medicines. However, the ‘one drug one target’ approach is not so effective for complex diseases such as pathogen-borne diseases, cancer, metabolic disorders, inflammation,diabetes and central nervous system disorders [1–3]. The central dogma of the drug design process has now shifted from single targets to multiple targets. There has been considerable interest in multi-targeted drugs for the treatment of complex diseases that are linked to drug resistance and also for the repurposing of existing drugs [4,5]. Due to an increasing unresponsiveness to most monotherapeutic regimens, in the form of resistance, combination therapy has given an important edge in combating various complex cellular/metabolic dis- eases. Though this methodology has come into the picture in the last 10 to 15 years, many of the already known therapeutics are in fact multi-targeted in nature. These therapies are identified by serendipity, phenotypic screening and traditional medicine [5]. Antimalarial drugs such as artemisinin, quinine and chloroquine are major examples of this class.
The combination of different approaches is now in use to achieve polypharmacology. Drug combination therapy (also known as the ‘drug cocktail’) is composed of the use of two or more drugs having different therapeutic mechanisms [6]. The use of a combination of drugs is very well known, and a recommended therapy against malarial disease. Artemisinin- based combination therapy is currently used to handle widespread resistance in Plasmodium falciparum, and also prevent recrudescence due to artemisinin monotherapy [7]. Another approach is the single-pill or multi-component drug approach [8], with the co-formulation of two or more therapeutic agents in a single tablet. This formulation can simplify patient treatment by reducing dosing regimens. One additional approach in use utilizes the common skeleton of the compound. A skeleton key approach [5] is hypothesized, based on a multi- functional strategy, i.e. a single compound with a single active agent can modulate multiple targets of the same or different functions. All these approaches are proposed to treat complex diseases and, simultaneously, delay drug resistance.The haemoglobin degradation pathway [9] is proven to be vital for the survival of the Plasmodium parasite inside the erythrocyte. This parasite is one of the major killers in tropical countries. This pathway occurs in the acidic food vacuole (pH ~5), and several enzymes have been shown to be involved in haemoglobin degradation, from proteins to amino acids [10]. The acid protease class of proteins, mainly plasmepsin I, plasmepsin II and plasmepsin IV, are involved in the degradation of haemoglobin.
Plasmepsin II is a well-validated target; inhibiting this protein disrupts haemoglobin degradation and thereby kills the parasite and, consequently, the disease [10].Genome sequencing of Plasmodium falciparum has revealed at least 10 genes encoding aspartic protease proteins (Plm I, II, IV–X and the closely related HAP). Due to the presence of Plm I, II, IV and HAP in the acidic vacuole, these proteins have been extensively studied [11]. The precise roles of other proteases are not so clear [10]. The aspartic protease Plm I, II and IV share a high level of sequence homology (~60–70%). These proteins also share a high binding site region identity compared to Plm II (Plm I 84%, Plm IV 68%). Plasmepsin II also has ~35% sequence identity with the close human ortholog cathepsin D and pepsin [10]. Cathepsin D is a lysosomal aspartic protease and has a high similarity in the binding region with plasmepsin II. A close binding region similarity poses the problem of low selectivity with the plasmepsin-based inhibitor and cathepsin D, which is used as a marker for cross-in- hibition for a newly developed inhibitor against the plasmepsin class of proteins [10].Earlier, our group has developed a novel de novo method called CliquePharm for tar- get-based drug design [12, 13]. CliquePharm can utilize different probe-based grid points for generating ensembles of pharmacophores with different combinations and types offeatures. This method has provided various advantages over other known techniques such as:In the present study we have pursued exhaustive mapping of the active sites of acid proteases, including two species, i.e. Plasmodium and the human host, with our clique finding approach and different combinations of functional features, thus generating combinatorial pharmacophores to identify specificity towards acid proteases.
The present study provides a novel way of enumerating all possible interaction sites with small and large (MW) chemical inhibitors with different specificity. The study also focused on the identification of selective pharmacophores for Plasmodium aspartic proteases, but not for binding with human coun- terparts, which will provide a selective interaction framework for the design of new non-pep- tide inhibitors. The generality of this new tool allows the design of ensembles of specific and selective pharmacophores for any class of proteins with a known active site structure, hence helping the exploration of new inhibitors for protein targets identified by genomic and systems biology explorations [14–16].A molecular interaction field (MIF) is considered as a continuous interaction energy surface, generated by placing different chemical probes at grid points generated in the available space of the active sites of a protein. Depending upon the nature of the active site, different combination of probes can be used. Here, three types of probes: hydroxyl (OH), carbonyl (O) and amide (N) probes are used to map all of the complementary interactions within the defined binding region of the aspartic proteases. Six aspartic protease proteins (four from plasmodium and two from human) are used for MIF calculation. Plasmepsin I, plasmepsin II and plasmepsin IV from Plasmodium falciparum and one plasmepsin from Plasmodium vivax are used to represent the Plasmodium protease, while cathepsin D and pepsin are used from human. As the MIF energy for each probe varied from least interaction (positive MIF energy score) to most favourable interaction (negative MIF energy score), pharmacophore ensem- bles are calculated by using negative energy values alone.
To overcome the redundancy in the interaction energy for each probe, the approach discussed in a previous study [12] was used to reduce the grid points without losing favourable interaction points.The reduced MIF interaction maps for all six proteins are used as inputs for CliquePharm. From Table 1 it can be seen that, even after grid reduction, the number of favourable energy grid points for each probe is quite large. Some additional topological and domain-specificconstraints [12] are used to overcome the problem of time complexity during frequent clique identification [13].CliquePharm uses the MIF points as vertices, and the distances between two vertices are used as edges to identify the common functional map in the form of frequent cliques of various geometrical types (triangular, quadrilateral, pentagonal, etc.). These cliques are then combinatorially arranged to find pharmacophore ensembles of different sizes and chemical types. Since all MIF-graphs are complete graphs with all connected vertices, finding the total subgraphs (clique) in G will be O(Nk). To reduce the search time and space, topological and biological constraints are used while finding biologically important cliques. In this study, we have selected a minimum of one and a maximum of three probes of each type (N, O and OH) for generating a pharmacophore model. As reported earlier [12], a minimum of 3 Å and a maximum of 20 Å inter-feature distance is used to avoid steric contacts and to encompass the whole binding region of the protein. For every pharmacophore with a combination of any types of features (four, five or six points) the average energy of the model is calculated. Further, the average energy score can be used for ranking the pharmacophore model. To study the ensembles of pharmacophores, −4.0 kcal/mol energy threshold is used, which is nearly equivalent to the hydrogen-bond energy [17]; higher than this limit a pharmacophore is not useful even for replacing water from the active site.
From the chemical point of view, stereochemistry has an important role in determining the activity of compounds, so phar- macophore models with four or more features are analysed.For both (specificity and selectivity pharmacophore modelling), clique space has been exhaustively searched to find all possible variant types. To explore the number of cliques, we have used different upper bound values ranging from 5 × 103 to 5 × 106. In addition to this, we have also tested the importance of P. vivax plasmepsin (with P. falciparum plasmepsin)[12] during pharmacophore ensemble generation. Due to an exhaustive search, a large number of pharmacophore models of various chemical types and size (three-, four-, five-,six-featured, etc.) are generated. Root mean-square deviation (RMSD)-based k-means [18] clustering of such ensembles has been done to find representatives of each cluster with the lowest favourable average energy and the lowest distance from the centroid of the ASP Aspartic Acid catalytic dyad. Clusters having a favourable average energy value higher than−4 kcal/mol are also not included in this study.Complementary to receptor-based pharmacophore, ligand-based pharmacophore model- ling is the conventional approach for finding pharmacophore features using bio-active ana- logues [19]. To compare the ligand- and receptor-based design of pharmacophores, from known plasmepsin II protease inhibitors, a set of chemicals (74 inhibitors), non-peptides tested against plasmepsin II [20–23] are carefully selected based on the specified criteria mentioned in the literature [24,25]. All inhibitors have an in vitro bioactivity value (Ki value reported in nM) and a nearly four-fold activity range (0.56 nM to 4100 nM). The Discovery Studio (DS) software program [26] version 4.5 is used for compound preparation and all pharmacophore hypotheses generation. The HypoGen module of DS is well known for phar- macophore features based quantitative model generation and activity prediction.
All of the compounds are prepared at biological pH and BEST (conformer generation protocol), which uses the Poling algorithm [27] to generate the bio-active conformation space for each compound. Based on the structural properties and the corresponding inhib- itory value, 21 structurally diverse compounds [24,28] are used as the training set to build the pharmacophore hypotheses while the remaining 53 compounds are used as the test set to evaluate the model. Training set compounds with pre-generated conformations have been used as input for HypoGen-based pharmacophore modelling. Default parameters are used with an uncertainty value of 2.0. In addition to hydrogen-bond acceptor (HBA, equiv- alent to O in the receptor-based program) and hydrogen-bond donor (HBD, equivalent to N and OH), hydrophobic (HY) and hydrophobic_aromatic (HY_Aro) features are also available in ligand-based pharmacophore generation and are used here. Finally, 10 top-ranked hypoth- eses are selected for comparison with receptor-based specificity pharmacophore models.To evaluate the performance of the pharmacophore model, four different types of ligand dataset have been prepared. ChEMBL, May 2012 (https://www.ebi.ac.uk/chembl/) [29, 30], is one of the available databases of bioactive compounds, where all inhibitors with known Ki or IC50 values are included. The first dataset extracted is for the inhibitors for plasmepsin II alone. The second dataset includes cathepsin D and pepsin. The third dataset is for protease inhibitors (other than plasmepsin, cathepsin D and pepsin), which include aspartic, serine and cysteine proteases. The fourth dataset, with known Ki and IC50 for non-protease inhibitors such as kinase, phosphodiesterase, phosphates, reductases and cytochrome inhibitors, is used as the negative validation set. After removal of the redundant and compounds with undetermined activity, the total number of compounds is further reduced by considering a threshold of Ki ≤ 50 μM.
The final dataset includes 236 inhibitors for plasmepsin II, 844 for human acid proteases (cathepsin D and pepsin), 500 chemicals tested against non-proteases and 2332 compounds having binding information against aspartic protease. To remove theimbalance in the number of inhibitors between plasmepsin II and other datasets, the Z:1[31] data splitting algorithm is used.For specificity pharmacophore ensemble validation, inhibitors of plasmepsin II and human aspartic protease (cathepsin D and pepsin) and other protease inhibitors are used as the positive set while the rest of the non-protease inhibitors are used as the negative control. To test the selectivity pharmacophores, exclusive inhibitors of plasmepsin II are used as the positive dataset, while all of the other compound datasets (human aspartic protease (cathep- sin D, pepsin), protease and non-protease) are used as the negative datasets.All four compound datasets are prepared in DS and conformation of each compound is generated using the FAST module of DS. To ensure sufficient conformational coverage, 500 conformations of each compound are used with a filter criterion of 20 kcal/mol energy. Pharmacophore-based screening of the compounds is done using the Pharmer [32], and a hit is only considered if all features with corresponding inter-feature distances are satisfied. The hits are collected separately for each compound dataset and different hit statistics are calculated. For the validation of specificity of the pharmacophore models, hits extracted from the plasmepsins, human and other proteases are considered to be true positive (TP) and hits from the non-protease are considered to be false positive (FP).
Selective pharma- cophores are the relevant classes of pharmacophores that hit only the plasmepsin inhibitors. So, plasmepsin inhibitors are considered to be active compounds, while the three other dataset compounds are considered to be inactive. Selectivity of the pharmacophores is calculated using a Guner–Henry matrix, which provides the quantitative goodness of hit (GH) score [24]. This score takes care of both the yield percentages of actives used in the dataset and the percentage ratio of actives in the hits list. The GH score varies from 0 (null model) to 1 (ideal model).The Guner–Henry score (GH score) is given bywhere A is the number of actives in the dataset, D is the number of compounds in the dataset, Ht is the number of hits retrieved, and Ha is the number of actives in the hit list (true positives).Result and discussionExhaustive search and clique number estimationIdentification of all possible cliques is a computationally extensive problem. Table 2 suggests the possible number of cliques using three, four, five and six node points at a time. Cliquenumbers are calculated using plasmepsin II grid points as standard, and a minimum of one and a maximum of three probes are considered during clique formation. These clique num- bers clearly suggest that finding all combinations comes at a high computational cost. However, these numbers can be limited by introducing domain knowledge of the protein class used. The probe energy threshold, maximum and minimum inter-feature distance threshold, clique average energy threshold and distance from the catalytically important aspartic acid residue threshold are used to remove the biologically irrelevant cliques [33].
Before analyzing the cliques for further analysis, convergence analysis is performed. To explore and define the different upper bound number for (three-point) cliques, three node cliques are generated for both five and six target proteins. Figure 1S (in the Supplementary material) shows that the number of cliques initially increased exponentially up to 5 × 105 but saturated after 1 × 106 nodes. It is also interesting to see that both five and six target set-based common cliques grow in nearly the same way. Additionally, out of all possible cliques, only ~5% cliques are found to be functionally relevant for specificity pharmacophore identification. Hereafter, cliques identified by 2 × 106 nodes are analysed for four, five and six point (using different combinations of chemical probes) pharmacophore models (if any).For specificity pharmacophore ensemble generation, 2 × 106 cliques have been explored for four-, five- and six-featured pharmacophores. Table 3 shows all generated models having a diverse number of features and chemical types. For four-featured pharmacophores, only three variants are possible (functional), All of these variants are identified by the program. In the case of five- and six-featured pharmacophores, only five and three variant types are identified, respectively. In all pharmacophore ensembles, a hydroxyl probe seemed to be a good anchor point for specific inhibitor design. Hydroxyl and amide probes contribute more to expanding from four- to five-featured than the carbonyl probe (Figure 1).
In total, 652,989 ensembles of four-featured pharmacophores are identified, out of which only 2.18% survived to five-featured.The largest, i.e. six-featured pharmacophore models (20 functional), are clustered based on the RMSD values of their co-ordinates, and unique 11 representatives have been identified having the lowest favourable average energy and lowest centroid distance from the hydroxyl probe. Table 4 shows the selected representatives (only 10, as one has a higher interaction energy (>−4 kcal/mol)) of two chemical variants. One variant has two amides, one carbonyl and three hydroxyls, while the other variant has three amides, one carbonyl and two hydrox- yls. These six-featured pharmacophore variants can provide the possibility of identifying different interaction points for better inhibitor design. These variants have nearly the same favourable energy (−5.078 to −4.147 kcal/mol) but different centroid distances from the ASP dyad (Asp 34 and Asp 214 of plasmepsin II). The RMSD value of an ensemble of six-featured pharmacophores has a range from 2.8Å to 7.3Å (Table 2S). The analysis showed that phar- macophore models generated bind to a different region of the aspartic protease active site. It also emphasizes the potential power of CliquePharm for exploring the whole active site with unique pharmacophores, and provides the flexibility of interaction of chemicals with binding sites during inhibitor design (Figures 2 and 3).During specificity pharmacophore ensemble analysis, we are interested in identifying those pharmacophore models that are preferentially expanded from four-featured pharma- cophore models to six-featured pharmacophores.
These extended pharmacophore models can provide potential hotspots for fragment-based drug design and lead optimization.We have also identified unique pharmacophore types (RMSD ≥ 1.5 Ǻ) for four- and five-fea- tured pharmacophores that it is not possible to extend at all (Figures 2 and 3).All 10 selected six-featured specificity pharmacophore models are used for screening the four inhibitor datasets. Hits are only considered and collected if all of the pharmacophore features and corresponding inter-feature distances are satisfied by any compound. Table 5 tabulates all of the hits per dataset for all 10 pharmacophore models, and the experimental known activity range has been provided in braces. As mentioned in the Methods section, true positive hits belong to the human, Plasmodium and protease datasets while non-pro- tease dataset hits are considered to be false positives. None of the models hit any compounds belonging to the inactive set. The specificity of the pharmacophores listed is thus established. Model 6.1.1 has the lowest favourable average energy and 16 compounds are selected as hits, while model 6.1.4 has nearly the same favourable energy but a lower ASP dyad centroid distance than model 6.11, and produces a maximum of 37 hits (Figure 4). Out of 10 pharma- cophore models, only eight hit the human dataset compounds, five hit the malarial dataset, six hit the malarial and human dataset, and seven hit the protease dataset compounds, while there were no chemical hits in the non-protease dataset. In the protease dataset, the max- imum numbers of hits was observed for the HIV protease inhibitor (another aspartic pro- tease).
Within the human dataset, two compounds are consistently hit by 7 out of 10 pharmacophore models, while in the case of Plasmodium, only one compound is commonly hit by three pharmacophore models (Table S4).Selective clique identification for the Plasmodium class of aspartic protease is achieved, as mentioned in the Methods section. The target-based drug design process is very much restricted for those proteins, which have close human orthologs. Information about selective interaction can have a large impact on selective inhibitor design for the target of choice. We utilized MIF grid point information for selective clique identification using CliquePharm. The objective of the selective clique is the identification of those cliques that belong exclu- sively to the Plasmodium plasmepsins (I, II, IV and plasmepsin vivax) but not to human (cathepsin D and pepsin). For selective clique identification, four plasmepsin grid points are used as a positive control, while cathepsin D and pepsin grid points are used as a neg- ative control. Using CliquePharm, 106 possible five-featured pharmacophores have been explored and a total of 2630 pharmacophore ensembles have been identified that satisfyall thresholds. Table 6 shows eight different models that are selected based on the favour- able average energy of interactions; for the two top-ranked pharmacophores for each chemical type.As for the specificity models, selectivity pharmacophore models are also analysed for extension (Figure 5) and unique pharmacophore identification (Figure 6) using the same RMSD threshold criteria. It is observed that the maximum chemical type obtained is five-fea- tured; six-featured expansion is not available within the exhaustive node search accom- plished in the present study.
Similar to specificity pharmacophores, selectivity pharmacophores also have a combination of two nitro, two carbonyl and two hydroxyl interactions, or even, three possible interactions for hydroxyl.Hit analysis has shown that model 5.4.1 had the lowest average energy of interaction and was appropriately distant with the centroid of the catalytic dyad, but had almost no hits picked up (Table 7), hence it was not possible to have a selective pharmacophore; whereas models 5.2.1 and 5.2.2 were the nearest to the lowest energy and not far away from the catalytic dyad, picked up selective hits only from plasmepsins and not from human (one against HIV protease), hence it is suggested that these will be good to use when searching chemical databases.It is interesting to note that the model 5.3.2 pharmacophore has many hits against Plasmodium (Tables 7 and 8) and against human and other proteases as well, i.e. it is non-se- lective. A comparison between model 5.3.2 (N1O2OH2) and model 5.2.1 (N1O1OH3) shows (Figure 7(a)) that additional carbonyl features instead of hydroxyl features are a distinct cause of human-associated inhibitor hits. When searched for in the specificity list it is observed that there is a chemically equivalent unique five-featured (N1O2OH2, non- expandable to six-featured) pharmacophore (specificity) with RMSD 5.3Å (Figure 7(b)). This explains how selectivity is lost in some of the five-featured pharmacophores, suggesting that maximizing feature type when designing pharmacophores will yield better selectivity and specificity separately. Such a facility is not possible in the ligand-based design of pharmacophores.We also observe that the same compounds emerge as hits for many five-featured phar- macophores (Table S6), which shows the consistency of the interaction of features depicting the pharmacophore with active sites (Figure 8).To investigate the advantages of our approach we have also compared ligand-based phar- macophores with active site-based pharmacophores, as done in the present study. Twenty- one structurally different inhibitors selected are tested against plasmepsin II, and used to produce a set of 10 pharmacophore hypotheses as mentioned in the Methods section. The hypotheses generated have different statistical parameters such as regression coefficient, RMSE and hypothesis cost, which can assist in the selection of the best hypothesis for further analysis.
A selected hypothesis should have a high correlation coefficient (r), low RMS and a low hypothesis cost. It is also reported in the literature that the cost difference (Δcost = null cost – total cost) should also be more than 60 bits to reject the chance correlation prob- ability in the selected model.Out of the 10 best scored hypotheses, six are four-featured while the remaining four are five-featured. In all hypotheses, two HBA features are consistently present while the rest of the features are combinations of hydrophobic and hydrophobic aromatic features. In all 10 scored hypotheses (Table 9), the cost difference (Δcost) and correlation coefficient (r) range from 107.79 to 55.22 and 0.965 to 0.722, respectively (Figure 9). As Hypo-01 has the highest correlation coefficient (0.965), the highest cost difference (107.79) and the lowest RMS value (0.971), this hypothesis is selected as the best hypothesis for further analysis. Hypothesis Hypo-01 has four pharmacophore features, namely two HBAs, one hydrophobic and one hydrophobic (aromatic). A regression plot for experimental and estimated activities by the selected pharmacophore hypothesis (Hypo-01) generated by HypoGen for the training com- pounds (21) and test set compounds (47) are shown in Figure 9. HypoGen model Hypo-01 predicted the activity of the test set compounds very well except for six compounds (CHEMBL105459, CHEMBL114278, CHEMBL324737, CHEMBL325240, CHEMBL33989 andCHEMBL371182), whose predicted activity is more than 10-fold that of the experimental activity, i.e., over predictive model. (Table 7S). Also, dataset includes the lowest inactive as~4 μM that is not really inactive, which makes the discriminatory power of the model low.
To see the power of discrimination (active and inactive) of Hypo-01, the features are fitted onto the most active and least active compounds from the dataset. Figure 10 shows the pharmacophore features mapping onto the most active (CHEMBL113208, Ki = 0.56 nM) and the least active (CHEMBL177262, Ki = 4100 nM) compounds of the training set. The most active compound (CHEMBL113208) fits all four features with a FitValue of 10.62 while theleast active compound (CHEMBL177262) tested against plasmepsin II fits only three features with a FitValue of 8.04, and one hydrophobic (aromatic) feature is not available with it.It is observed that ligand-based pharmacophores includes features like hydrophobic/ aromatic interactions, whereas in the present active site-based study, we have not includedthe dry probe, which is in progress. However, the active site-based specificity pharmacoph- ores are interacting more through both the HBA and HBD, which are not present in a ligand- based study. One of the main limitations is that in ligand preparation one does not include all of the biological functional properties (like pH) at the active site, which has been included (implicitly) in an active site-based study.
Conclusions
An active site interaction map has been exhaustively searched using cliques generated from a grid map of four plasmepsins and two human acid proteases in the case of specificity pharmacophores. Six-featured pharmacophores are generated from active sites and vali- dated using known ChEMBL inhibitors, and found to be highly specific when searching for compounds within a large chemical database. A comparison with ligand-based pharma- cophores also showed the advantages of active site-based pharmacophores. For generating selective pharmacophores all four plasmepsins are used, while two human acid proteases are excluded, resulting in five-featured ensembles, some of which are shown to be highly selective when getting hits from a known database, but the presence of carbonyl interactions made it similar to five-featured specificity pharmacophores found earlier. The program devel- oped is automated to provide fragment-based chemical design for both specificity and selectivity for a class of proteins, as four- and five-feature-based pharmacophores can be extended to the next level. Interaction mapping without alignment allows the method to compare any classes of active sites in the context of interaction energy from complementary chemical Cathepsin Inhibitor 1 features.