- Research
- Open access
- Published:
Integrating pharmacogenomics and cheminformatics with diverse disease phenotypes for cell type-guided drug discovery
Genome Medicine volume 17, Article number: 7 (2025)
Abstract
Background
Large-scale pharmacogenomic resources, such as the Connectivity Map (CMap), have greatly assisted computational drug discovery. However, despite their widespread use, CMap-based methods have thus far been agnostic to the biological activity of drugs as well as to the genomic effects of drugs in multiple disease contexts. Here, we present a network-based statistical approach, Pathopticon, that uses CMap to build cell type-specific gene-drug perturbation networks and integrates these networks with cheminformatic data and diverse disease phenotypes to prioritize drugs in a cell type-dependent manner.
Methods
We build cell type-specific gene-drug perturbation networks from CMap data using a statistical procedure we call Quantile-based Instance Z-score Consensus (QUIZ-C). Using these networks and a large-scale disease-gene network consisting of 569 disease signatures from the Enrichr database, we calculate Pathophenotypic Congruity Scores (PACOS) between input gene signatures and drug perturbation signatures and combine these scores with cheminformatic data from ChEMBL to prioritize drugs. We benchmark our approach by calculating area under the receiver operating characteristic curves (AUROC) for 73 gene sets from the Molecular Signatures Database (MSigDB) using target gene expression profiles from the Comparative Toxicogenomics Database (CTD). We validate the drugs predicted in our proofs-of-concept using real-time polymerase chain reaction (qPCR) experiments.
Results
Cell type-specific gene-drug perturbation networks built using QUIZ-C are topologically distinct, reflecting the biological uniqueness of the cell lines in CMap, and are enriched in known drug targets. Pathopticon demonstrates a better prediction performance than solely cheminformatic measures as well as state-of-the-art network and deep learning-based methods. Top predictions made by Pathopticon have high chemical structural diversity, suggesting their potential for building compound libraries. In proof-of-concept applications on vascular diseases, we demonstrate that Pathopticon helps guide in vitro experiments by identifying pathways that are potentially regulated by the predicted therapeutic candidates.
Conclusions
Our network-based analytical framework integrating pharmacogenomics and cheminformatics (available at https://github.com/r-duh/Pathopticon) provides a feasible blueprint for a cell type-specific drug discovery and repositioning platform with broad implications for the efficiency and success of drug development.
Background
Systems pharmacology has helped transcend the prevailing “one drug-one target” paradigm in drug discovery by analyzing complex networks of interactions between diseases, drugs, and targets to achieve an organism-wide view of drug action [1,2,3]. Network pharmacological methods often use global drug-disease [4, 5] and drug-target [6, 7] networks as the basis upon which to predict new drug-target interactions and explore drug combinations. Large reference databases of molecular responses to systematic chemical, genetic, and disease perturbations, popularized by the Connectivity Map (CMap) [8], have proven to be an effective resource when used in conjunction with network-based drug discovery methods. In brief, CMap works by matching the transcriptional “signature” of an input set of genes (e.g., differentially expressed genes in a disease) with that of a “perturbagen” (e.g., small molecule drugs), identifying, after a series of statistical steps, the perturbational signatures that best mimic or reverse the input signature [9]. By allowing the rapid generation of testable hypotheses in a cell type- and input context-dependent manner, CMap-based network pharmacological approaches have helped expedite all aspects of drug discovery from drug repositioning [10,11,12,13] to classifying mechanisms of action of uncharacterized small molecules [14, 15] to identifying drug combinations [16] and side effects [17]. Cost-effective transcriptome profiling assays have further accelerated these efforts, scaling up the size of these resources to hundreds of thousands of gene expression signatures spanning tens of thousands of small molecules and dozens of cell lines, as in the case of the Library of Integrated Network-Based Cellular Signatures (LINCS)-CMap [18]. More than a decade after its inception, CMap is now established as a computational drug discovery framework, with many applications and methods [9, 19, 20] built upon its foundation.
Despite their widespread use and clear utility in drug discovery, CMap-based network pharmacology studies have had numerous limitations that can be summarized into three main themes: the first limitation reflects the challenges involving CMap data and the consequent underutilization of CMap as a resource for methods development. CMap data can be considered as a tensor (high-dimensional vector) [21] in five dimensions, consisting of genes, perturbagens, cell lines, time points, and doses, typically with varying numbers of experiments in each dimension. This heterogeneity and multidimensionality of the CMap data present challenges in reliably defining cell type-specific gene-perturbagen networks. In addition, despite its superior coverage, and partly due to its younger age, the more recent LINCS-CMap dataset is still underutilized compared to the extended version of the initial CMap release (CMap Build 2), which remains the prevalent CMap resource. As a result, few efforts exist that generate cell type-specific gene-perturbagen networks from LINCS-CMap data at a large scale as a resource to facilitate network-based pharmacological approaches. The second limitation stems from the focus of CMap-based methods on a single input disease context. CMap-based approaches have typically focused on directly matching disease-associated gene signatures with genome-wide transcriptional responses to perturbations [9], ignoring the larger network of gene-perturbagen interactions within multiple disease contexts. While other interactome-based approaches have recently explored this aspect in a non-cell type-specific way [22], given that diseases and drugs with similar molecular and clinical characteristics are likely to share phenotypes [23], CMap-based approaches are still potentially missing opportunities to identify cell type-specific polypharmacological effects and off-target disease pathways. The third limitation is CMap’s predominant use as a stand-alone resource. It has been suggested that integrating pharmacogenomic data such as CMap with cheminformatic data (e.g., bioactivity and chemical structure data) may be beneficial for drug discovery, but such efforts have so far been confined to quantifying the correlations of transcriptomic response with bioactivity and chemical structure [24]. It remains to be investigated as to whether or not integrating the two sources of information directly benefits drug prioritization. While cheminformatic tools are very useful in their own right in drug discovery given their provision of structure–activity relationships, the cell type-specific extension of such tools using large-scale gene profiling experiments such as CMap has been noted as an unexplored future direction [25]. Two recent such examples, although not utilizing CMap data, use deep learning to integrate orthogonal biomolecular data sources including gene expression and chemical features [26] and combine physicochemical structural features with network-based drug target identification for drug effect prediction [27].
To tackle the above challenges, we have developed a computational framework, Pathopticon (Fig. 1). As the first component of Pathopticon, we propose a gene-centric approach, named QUantile-based Instance Z-score Consensus (QUIZ-C). QUIZ-C uses LINCS-CMap data to build cell type-specific gene-perturbagen networks for 70 cell lines, identifying the consistent and statistically significant relationships between genes and perturbagens. The second component of Pathopticon, the PAthophenotypic COngruity Score (PACOS), measures the agreement between input and perturbagen signatures within a global network of diverse disease phenotypes. After combining PACOS with pharmacological activity data, Pathopticon prioritizes, for each cell line, the perturbagens whose transcriptomic response best aligns with that of the input signature; it surpasses current methods in terms of predictive performance and offers mechanistic clues into the shared intermediate phenotypes and key pathways targeted by the predicted drugs. As such, Pathopticon offers the potential to be a feasible blueprint for a cell type-specific drug discovery and repositioning platform that integrates pharmacogenomics and cheminformatics. The Pathopticon algorithm and app are available at https://github.com/r-duh/Pathopticon.
Methods
Quantile-based Instance Z-score Consensus (QUIZ-C)
We collected the Level 4 plate-normalized expression values (referred to as ZS (https://clue.io/connectopedia/data_levels) or ZSPC values [18]) of all perturbagen instances for each gene. We then calculated a gene-centric z-score for each of these perturbagen instances, denoted as \({pZS}_{g,i}^{c}\), by comparing the Level 4 expression value of gene \(g\) when perturbed by instance \(i\) in cell line \(c\), \({ZS}_{g,i}^{c}\), against the ZS values of all perturbagen instances for the same gene and cell type,
where \(\langle {ZS}_{g}^{c}\rangle\) and \({\sigma }_{{ZS}_{g}^{c}}\) are the mean and standard deviation of ZS scores over all perturbagen instances for the given gene \(g\) and cell type \(c\), respectively. Thus, \({pZS}_{g,i}^{c}\) ensures that we capture the differential expression of perturbagen instance \(i\) against the background of the entire population of perturbagen instances in the given cell type \(c\), in addition to the 384-well plate population. After calculating the \({pZS}_{g,i}^{c}\) values for all perturbagen instances, we aggregated these values at the perturbagen (i.e., “pert_iname”) level to obtain the set of gene-centric z-scores for each perturbagen such that
where \({I}_{p}^{c}\) is the set of all instances for perturbagen \(p\) in cell type \(c\).
Once the gene-specific z-scores were calculated for all possible cell type, gene, and perturbagen \(\left(c, g, i\right)\) triplets and aggregated into perturbagen-level groups, we sought to identify perturbagen-gene pairs in which the perturbagen affects the expression of the target gene (i) significantly and (ii) consistently. The first condition for significance of effect is satisfied if \({pZS}_{g,p}^{c}\le -\widehat{z},\) for \({pZS}_{g,p}^{c}< 0\) (decrease in expression) or \({pZS}_{g,p}^{c}\ge \widehat{z},\) for \({pZS}_{g,p}^{c}> 0\) (increase in expression) where \(\widehat{z}\) is the z-score at the chosen significance threshold. The second condition for consistency of effect, or consensus, is satisfied if a certain portion of \({pZS}_{g,p}^{c}\) values meet the significance criterion based on their distribution. However, due to the variation in the number of perturbagen instances across different cell lines (Additional File 1: Fig. S1), as well as the heterogeneity of expression values within and across perturbagens due to varying doses and time points, it is not feasible to use a fixed number of perturbagen instances as the consensus criterion. We, therefore, introduced a flexible consensus threshold that can accommodate the individual \({pZS}_{g,p}^{c}\) distribution of any given cell type, gene and perturbagen triplet (Fig. 2), which we call the Quantile-based Instance Z-score Consensus (QUIZ-C) approach. Depending on whether perturbagen \(p\) induces a decrease or increase in the expression of gene \(g\) in cell type \(c\), we designate the 1st or the 3rd quartile (25th or 75th %ile) of the \({pZS}_{g,p}^{c}\) distribution, respectively, as the consensus threshold. If the number of significant instances (i.e., above a certain z-score threshold \(\widehat{z}\), taken to be 1.96 here) for the given perturbagen is greater than the number of instances below or above this threshold, then the consensus is satisfied (Fig. 2C), and the perturbagen is deemed to affect the expression of the gene significantly and consistently in the given cell type. A positive or negative edge \({e}_{g,p}^{c}\) with a weight \({w}_{g,p}^{c}\) equal to + 1 or − 1, respectively, is then established between the perturbagen-gene pair in the cell type of interest such that
Overview of the QUIZ-C algorithm. A For every gene \(g\), LINCS-CMap Level 4 data contains information on the expression value of all perturbagen instances \(i\) tested in cell line \(c\), denoted by \({ZS}_{g,i}^{c}\). B We calculate the gene-centric perturbagen z-score for each instance, denoted by \({pZS}_{g,i}^{c}\), by comparing \({ZS}_{g,i}^{c}\) to the mean \(\langle {ZS}_{g}^{c}\rangle\) and standard deviation \({\sigma }_{{ZS}_{g}^{c}}\) of all instances for gene \(g\) and cell line \(c\). C For each gene and cell line, we select perturbagens that pass a consensus threshold based on the \({pZS}_{g,p}^{c}\) distributions of all of their instances. Q1 and Q3 indicate the first and third quartiles of \({pZS}_{g,p}^{c}\). D We build cell type-specific gene-perturbagen networks by repeating this procedure for all genes and perturbagens in each cell line
The positive and negative edges \({e}_{g,p}^{c}\) satisfying the above conditions were then collected into edge lists to be used in the building of the cell type-specific drug-gene networks (Fig. 2D). Due to the large scale of the LINCS-CMap Level 4 data (> 1.3 M perturbagen instances in over 70 cell lines), we used the multiprocess package in Python (https://pypi.org/project/multiprocess/) to perform the identification of edges in parallel for each cell line (code available at https://github.com/r-duh/Pathopticon/blob/main/generate_QUIZC_networks.py). Our sensitivity analysis showed that the resulting QUIZ-C networks were largely robust to the choice of quantiles used under various regimes and model choices (Additional File 1: Supplementary Methods), and different quantiles yielded similar network topologies, as quantified by the edge overlap (Additional File 1: Fig. S2).
Calculating the PAthophenotypic COngruity Score (PACOS)
We obtained a comprehensive list of disease signatures curated from the Gene Expression Omnibus (GEO) through the Enrichr web server [28], namely the “Disease_Perturbations_from_GEO_up” and “Disease_Perturbations_from_GEO_down” gene set libraries (https://maayanlab.cloud/Enrichr/#libraries, accessed September 2020). These gene sets contain genes induced (“up”) and repressed (“down”) in 569 human disease gene sets, representing a broad array of diseases. Overall, these 569 gene sets comprise 21,787 genes, covering the majority of the protein-encoding genome.
To measure the degree of local agreement in terms of the direction of effect between a perturbation signature and a disease signature in a given cell line, we define the signature congruity score (SCS) as
where pcup, pcdown, dup, and ddown are the sets of up and down genes connected to perturbagen p in the QUIZ-C/MODZ/CD network for cell type c and the sets of up- and down-regulated genes for disease d, respectively.
We similarly define the \(SCS\) between an input gene signature and a disease signature as
where gup, gdown, dup, and ddown are the sets of up- and down-regulated genes for input gene set g and disease d, respectively.
To differentiate between the case where \(SCS\) is zero due to the congruent (same direction) and incongruent (opposite direction) effects canceling each other out, and the case where \(SCS\) is zero simply due to zero overlap between the input/perturbagen signature and the disease signature, we calculate \({SCS}_{p,d}^{c}\) and \({SCS}_{g,d}\) only if the perturbagen signature and disease signature (or input gene signature and disease signature) have at least one common gene. In the case of no overlap between the input/perturbagen signature and the disease signature, we assign a null value (NaN in Python) to \(SCS\).
\(SCS\) values are, thus, bound to be in the range [− 1, 1] where − 1 and 1 represent complete incongruity and congruity of the given perturbagen signature/input gene set with the given disease signature, respectively. Complete congruity and incongruity are hard to achieve since disease signatures have on average a few hundred up- and down-regulated genes whereas the number of genes affected by a perturbagen or the input gene sets can be much smaller (Fig. 2C).
Expanding the signature congruity concept to all 569 disease signatures in our dataset, we generate pathophenotypic congruity profiles (PCP) for the given perturbagen p or input gene set g, which are vectors of length \({n}_{d}=569\) (including null values)
and
For each cell type-specific CMap-based gene-perturbagen network, we precompute \({PCP}_{p}^{c}\) values for all perturbagens within that cell type. This computation is performed only once per network and stored as a binary file (see https://github.com/r-duh/Pathopticon/ for file location) to speed up downstream candidate prioritization and benchmarking steps and reduce computational burden.
Finally, we define the pathophenotypic congruity score (PACOS) between input gene set g and perturbation p in cell line c as the Spearman correlation between \({PCP}_{p}^{c}\) and \({PCP}_{g}\)
where \({r}_{{PCP}_{g}}\) and \({r}_{{PCP}_{p}^{c}}\) denote the corresponding rank vectors of \({PCP}_{g}\) and \({PCP}_{p}^{c}\), \(cov({{r}_{{PCP}_{g}},r}_{{PCP}_{p}^{c}})\) denotes the covariance of the rank vectors, and \({\sigma }_{{r}_{{PCP}_{g}}}\) and \({\sigma }_{{r}_{{PCP}_{p}^{c}}}\) denote the standard deviations of the rank vectors. Hence, \({PACOS}_{g, p}^{c}\) is a measure of how well a perturbagen’s global effect across a wide range of pathophenotypic signatures enhances or represses that of an input gene signature. To ensure that the Spearman rank correlation yields reliable results, we calculate \({PACOS}_{g, p}^{c}\) only if the number of diseases that are common between the input gene signature and the perturbagen-cell type pair is above a threshold, which is set to 10 by default.
Pathopticon: an analytical framework for drug prioritization that integrates CMap-based gene-perturbagen networks with PACOS and tool scores
In the Pathopticon framework (Fig. 1), the \({PCP}_{g}\) vector of a given input gene signature g is compared across all cell types and perturbagens to \({PCP}_{p}^{c}\), resulting in \({PACOS}_{g, p}^{c}\) values for all perturbagen-cell type pairs. These \({PACOS}_{g, p}^{c}\) values are then integrated with tool scores of perturbagen p to generate a ranked list of perturbagen-cell type pairs, the top ranks of which offer potential viable therapeutic candidates with high selectivity that target the input gene signature. We scale the tool scores between − 1 and 1 to be compatible with PACOS scores and then impute the missing values using the median value across all perturbations. We then collapse all drug-target pairs for a given drug to a single value by selecting the highest selectivity/tool score for that drug (see Additional File 1: Supplementary Methods, and Additional File 2).
Since we hypothesize that incorporating binding selectivity information will increase candidate drug prediction performance for a given input gene signature g, we determine our final prioritization rankings based on a weighted average of PACOS and tool scores
However, using tool score by itself inherently limits the number of drugs that can be highly prioritized since not all drugs in the CMap-based networks have bioactivity information available on ChEMBL (see Additional File 1: Supplementary Methods). Therefore, the drugs that do not have this information are pushed down the ranks if tool scores are weighed heavily, effectively constraining the number of drugs that can be ranked highly. To counter this bias, we give more weight to PACOS when combining PACOS and tool scores with the primary objective of expanding the pool of therapeutic candidates that can be top-ranked. In particular, to address the tradeoff between expanding the pool of drugs to prioritize and getting a high prediction performance, we take a heuristic approach in determining how to weigh each component in the final combined score. For different values of the ratio of PACOS to tool score, r, we calculate the median AUROC over all benchmark signatures and simultaneously record the number of unique drugs highly ranked in the top 50 ranks across all benchmarks to represent the diversity of the perturbagen space explored in the prioritization. Varying r over a range of values from 0 to 6, we choose a value that retains a high value of unique drugs in the top ranks while not compromising a high AUROC. Based on this heuristic test, we set a combined score ratio of r = 2 where the number of unique drugs prioritized modestly plateaus while the corresponding AUROC is not substantially different from the maximum AUROC achieved for both enhance and repress modes (Additional File 1: Fig. S3A-B).
Benchmarking and cell type-specific performance assessment of Pathopticon
As the input signatures for our benchmark, we used 73 gene sets from the Molecular Signatures Database (MSigDB) [29], all annotated as “UP” and “DOWN,” which have 100 or more “true positives” (i.e., drugs significantly targeting the gene set with a false discovery rate (FDR) less than 0.05) in the Comparative Toxicogenomics Database (CTD) [30] (Additional File 1: Table S1, Additional File 1: Supplementary Methods). Since the number of perturbagens in each cell type is highly heterogeneous (Additional File 1: Fig. S4B), we devised, and applied on each benchmark signature, the following cell type-specific performance assessment: We first ranked the \({PACOS}_{g, p}^{c}\) (or \({\text{combined score}}_{g,p}^{c}\), depending on which model was chosen) values separately within each cell line c. We then mapped the true positives (see Additional File 1: Supplementary Methods for details) corresponding to the input gene signature on this list of perturbagens ranked by \({PACOS}_{g, p}^{c}\) to determine the true positive rate (TPR) and false positive rate (FPR) values. We used the TPR and FPR values to plot the receiver operating characteristic (ROC) and finally measured the area under the ROC curves (AUROC). This procedure resulted in AUROC values for each cell type given the input gene signature. Next, to assess whether these AUROC values are statistically significantly higher than random expectation, we generated AUROC curves that correspond to permuted rankings for N = 400 random permutations. Finally, we calculated empirical p-values such that
where \({PACOS}_{g, p(random)}^{c}\) is the score of the randomized instance. This procedure allows a nested prioritization in which each cell line is ranked based on its corresponding \({p}_{emp}^{c}\) value and each perturbagen within that cell line is ranked based on its \({PACOS}_{g, p}^{c}\) value.
Since our method is intended as a tool to recommend cell type and compound combinations given an input gene signature, when comparing AUROCs for our benchmark, we used the AUROCs of the best-performing cell lines, i.e., the maximum AUROC value across all cell lines. Following this approach, we calculated the AUROC for each input signature for (i) PACOS ranking in both repress and enhance mode, (ii) combined (PACOS + Tool) ranking in both repress and enhance mode, and (iii) tool score ranking only. As an additional, network-agnostic negative control, we calculated AUROCs by randomizing the drug rankings 400 times, for which we used the best AUROC value among the random instances in the benchmark. Overall, we had 14 and 15 sets of AUROCs across different prioritization and network building methods for performance comparison in our intrinsic and extrinsic benchmarks, respectively. We note that tool scores also take into account the best tool score for a drug-target pair (Additional File 1: Supplementary Methods), implying that the corresponding AUROC value for the tool score rankings reflect the best possible AUROC. Thus, further consistency and fair comparison in the benchmark was attained by ensuring the comparison of the best results from each method.
Running Pathopticon for our proof-of-concept studies
For our first proof-of-concept, we used differentially expressed genes (p < 0.01) for subclinical atherosclerosis (SA) obtained from [31] as “up” and “down” gene signatures and ran Pathopticon with the following parameters: Model (PACOS + Tool combined); Effect: Repress; r value: 2.0; Number of randomizations: 400.
For our second proof-of-concept, we used time-course proteomic measurements, described in detail in [32], to identify two gene signatures for vein graft disease, corresponding to early- and late-stage disease. As a proxy of early (late) disease, we used proteins measured on day 1 and day 3 (day 7 and day 28). We took the union of all proteins with false discovery rate (FDR) < 0.05 and fold-change |FC|> 2, at days 1 and 3 and days 7 and 28, respectively, for early and late signatures. We then used these as “up” (FC > 2) and “down” (FC < 0.5) genes and ran Pathopticon with the following parameters: Model (PACOS + Tool combined); Effect: Repress; r value: 2.0; Number of randomizations: 400.
In vitro validation methods
An in vitro model of proinflammatory macrophages relevant to cardiovascular inflammation was used to test drug predictions. U937 cells were purchased from ATCC and cultured using RPMI + 10% FBS. Primary human macrophages were derived from peripheral blood mononuclear cells sourced from buffy coats of healthy adults as previously described [32]. After 10 days of culture, cells were differentiated into macrophages. Both types of cells were exposed to 10 ng/mL of lipopolysaccharide, LPS (PeproTech), and co-incubated with either 0.1% (in culture media) of 1 × phosphate buffered solution, PBS (ThermoFisher), containing 4 ppm of dimethylsulfoxide (DMSO) or one of the identified drugs. RNA extraction (GE illustra RNAspin Mini, Cat# 25–0500-70) followed by gene expression studies using real-time qPCR (Standard BioTools Inc., Fluidigm BioMark HD) with TaqMan Gene Expression Assay probes, FAM (ThermoFisher) were employed. A sample set of unstimulated cells (PBS only, no LPS) was included to represent the baseline non-activated state. GAPDH and RPLP0 were used as endogenous controls to normalize gene expression data. Two-tailed Mann–Whitney U tests were employed to assess the differences between treatments and controls. Pathway enrichment analysis was done by performing two-tailed Fisher’s exact tests on pathway data from the ConsensusPath Database [33] and adjusting the p-values for multiple testing using the Benjamini–Hochberg procedure.
Overview of the public datasets used in the study
We used the Level 4 LINCS-CMap data [18] to build cell type-specific gene-perturbagen networks. To build the global disease-gene network, we downloaded disease gene sets from the Enrichr web server [28]. To build the MODZ-based gene-perturbation networks, we used LINCS Canvas data retrieved from the Harmonizome database [34]. We also used the Harmonizome database to obtain gene set libraries for differentially expressed genes for kinase perturbations, differentially expressed genes for transcription factor perturbations, and GTEx tissue-specific gene expression. To build CD-based gene-perturbation networks, we used L1000CDS2 data obtained through the L1000CDS2 web API [35]. We used the structural reducibility algorithm as described in [36]. The Cellosaurus database [37] and the CLASTR tool [38] were used to calculate single tandem repeat (STR) similarity. We used the Cell Line Ontology (CLO) [39] to calculate the ontological similarity of cell lines. Gene and protein identifier conversions were made using the HUGO Gene Nomenclature Committee website [40]. We used bioactivity data from the ChEMBL database [41] to calculate selectivity and tool scores. Target gene expression data from the Comparative Toxicogenomics Database [30] was used in our benchmarks to identify the true positive compounds for each benchmark gene set. We used data from the Drug Repurposing Hub [42] and the Drug-Gene Interaction Database (DGIdb) [43] to quantify the enrichment of known drug targets in CMap-based gene-perturbagen networks. Multiscale Interactome [22] and DeepDRK [26] data were used to compare our approach to these interactome- and deep learning-based approaches. RNA-seq data from patients with subclinical atherosclerosis [31] and proteomic data from a preclinical study of vein graft disease in mice [32] were used as input to Pathopticon for our first and second proof-of-concept, respectively. Please refer to the corresponding sections in the “Methods” section and Additional File 1: Supplementary Methods for details on these data and how they were used.
Results
A gene-centric approach to building cell type-specific gene-perturbation networks
The recent large-scale CMap powered by the L1000 assay (referred to as LINCS-CMap or CMap 2.0) [18] contains gene expression measurements across the transcriptome in response to thousands of small molecule perturbations (“perturbagens”). These measurements were taken in a wide array of cell lines and over varying numbers of experiments (“instances”) encompassing different doses, time points, and replicate numbers. As a feature of this multidimensionality and heterogeneity, the LINCS-CMap data offer unique opportunities and challenges in building global, cell type-specific gene-perturbagen networks for computational drug repositioning and discovery. However, efforts aimed at building these types of networks, while addressing the potential data biases adequately, remain limited. Notably, two recent approaches that aim to identify differential gene-perturbagen pairs, namely the Moderated Z-score (MODZ) [18] and the Characteristic Direction (CD) measure [35, 44], have sought to form a weighted consensus among experimental replicates in LINCS-CMap. As such, they can be regarded as “perturbagen-centric” approaches that can be extended to build cell type-specific gene-perturbagen networks.
To address the challenges in reliably defining cell type-specific gene-perturbation networks while exploiting the full extent of the next generation CMap, we used LINCS-CMap Level 4 data. Level 4 data retain the information on individual measurements (i.e., instances), which may be for different doses, time points, or technical replicates, without collapsing them to a single value per perturbagen (Fig. 2A; see Additional File 1: Supplementary Note 1). We devised a gene-centric network-building strategy, which we call QUantile-based Instance Z-score Consensus (QUIZ-C), that identifies the perturbagens that (i) differentially and (ii) consistently modulate the expression of each gene (Methods). Briefly, for each gene, we assigned a “perturbagen z-score,” \(pZS\), to each perturbagen instance against the background of all perturbagen instances on that gene (Fig. 2B). The fraction of genes that might have few or no significant perturbagens affecting them, a condition that might bias the \(pZS\) values, was between 0.05% and 1.5% across all cell lines (Additional File 1: Tables S2-3, Additional File 1: Fig. S4; see Additional File 1: Supplementary Note 2). The majority of perturbagens in LINCS-CMap have multiple experimental instances (Additional File 1: Fig. S1), which permits variability between the \(pZS\) values of a perturbagen, primarily due to experimental replicates and, to a limited extent, due to different doses and time points (Additional File 1: Fig. S5; see Additional File 1: Supplementary Note 1). To prevent spurious gene-perturbagen associations and capture high-confidence gene-perturbagen pairs, we required a sufficient number of differential instances of a perturbagen to view it as significantly affecting a gene. To determine this number, we defined a flexible consensus threshold based on the \(pZS\) distributions (Fig. 2C). Perturbagens that passed this flexible consensus threshold were considered differentially affecting the expression of the gene in question (Fig. 2C, D). Repeating this procedure for all genes and perturbagens in every cell line and aggregating all the differential gene-perturbagen links (Fig. 2D), we built global cell type-specific gene-perturbagen networks (Additional File 1: Fig. S6, Additional File 1: Tables S4-5; see Additional File 1: Supplementary Note 3 on the topological properties of these networks). To enable a direct comparison with our cell type-specific gene-perturbagen networks (henceforth, referred to as QUIZ-C networks) in our downstream analyses, we also built cell type-specific gene-perturbagen networks using the MODZ and CD approaches (Additional File 1: Supplementary Methods).
QUIZ-C networks reflect the biological diversity of LINCS-CMap cell lines
QUIZ-C networks include a large proportion of the small-molecule perturbations tested on the CMap cell lines, with 41 out of 70 cell type-specific networks having over 80% of the perturbagens tested in each respective cell line (Fig. 3A). We observed that the largest hub perturbagens (i.e., perturbagens modulating the expression of a large number of genes) are generally cell type-specific and are not shared by many cell lines, whereas perturbagens with fewer targets can be present in a large number of cell lines (Fig. 3B). By contrast, genes targeted by many drugs were observed in a large number of cell lines, while genes targeted by fewer drugs were observed in fewer cell lines (Pearson’s r = 0.62) (Fig. 3C). These findings on hub drugs and targets were recapitulated in CD and MODZ networks (Additional File 1: Figs. S7A-D), indicating that this feature is independent of the specific network-building approach and, rather, reflects the underlying biological differences between distinct cell lines and their transcriptional responses to small-molecule perturbations. Since the L1000 assay relies on the 978 landmark genes to infer the expression of the remaining genes, we tested the enrichment of landmark and inferred genes in our networks and compared their in-degree distributions (Additional File 1: Fig. S8; see Additional File 1: Supplementary Note 3). In 13 cell lines, landmark genes were statistically overrepresented (Additional File 1: Table S6). In terms of individual gene-perturbagen interactions, we found little overlap between pairs of cell types (Additional File 1: Fig. S9; see Additional File 1: Supplementary Note 3). In terms of the direction of effect (i.e., up- or down-regulation of the gene by the perturbagen) of the overlapping edges across cell lines, we observed a high level of agreement between cell lines in QUIZ-C networks, which was supported by MODZ networks and, to some degree, by CD networks (Additional File 1: Fig. S9F). This observation suggests that gene-perturbagen pairs that are conserved across cell lines tend to have the same directional effect.
Properties of QUIZ-C networks. A The proportion of LINCS-CMap drugs tested in each cell line that was in the QUIZ-C network of that cell line. B The number of cell type-specific QUIZ-C networks a given drug appears in, as a function of the mean normalized out-degree of each drug across all QUIZ-C networks. Every circle represents a drug, and circle size is proportional to the standard deviation of the normalized out-degree. C The number of cell type-specific QUIZ-C networks a given gene appears in, as a function of the mean normalized in-degree of each gene across all QUIZ-C networks. Every circle represents a gene, and the circle size is proportional to the standard deviation of the normalized in-degree. D Reducibility of QUIZ-C networks. Left panel shows the relative entropy quality function \(q(\cdot)\) of each hierarchical clustering branch, with the maximum value of \(q(\cdot)\) corresponding to the optimal configuration of the aggregation of layers. Right panel shows the hierarchical clustering dendrogram and the optimal clustering threshold. Cell lines belonging to the same cluster are in the same color, and clusters consisting of only one cell line are shown in light grey
To supplement our findings on the structural uniqueness of QUIZ-C networks, we used a recent method that leverages information theoretic measures to quantify the degree of information redundancy (or “reducibility”) in multilayer networks (Additional File 1: Supplementary Methods) [36]. QUIZ-C, MODZ, and CD networks can each be represented as a multilayer network in which every layer is a cell type-specific network. Within this framework, QUIZ-C networks were largely non-redundant with a reducibility metric \(\chi\) of 0.32 (Fig. 3D), whereas MODZ and CD were highly reducible with reducibility values of 0.93 and 0.73, respectively (Additional File 1: Fig. S10). This finding suggests that, while most cell lines in MODZ and CD can be grouped together in terms of their topological similarity, the same cannot be said about QUIZ-C networks. Given the genetic and ontological diversity of the cell lines involved, which we have verified by quantifying their single tandem repeat (STR) profile similarity (Additional File 1: Fig. S11) and Cell Line Ontology (CLO) similarity (Additional File 1: Fig. S12) (Additional File 1: Supplementary Methods), this finding suggests that QUIZ-C networks remain sensitive to and thereby reflect differences among cell lines, a property that has important implications in disease-specific (rather than endophenotype-specific) drug development.
Overall, these results suggest that QUIZ-C networks have sufficient representation of drugs tested in CMap, are distinct in terms of their strongest perturbagens, capture non-overlapping sets of perturbagen-gene pairs, and contain non-redundant information between cell types. These findings support the cell type-specificity of the QUIZ-C networks and suggest that they can serve as a good substrate on which to carry out in silico drug prediction.
Drug repurposing and discovery potential of QUIZ-C networks
Using the Drug Repurposing Hub (DRH) [42], we inquired about the mechanism of action and clinical phase of drugs in LINCS-CMap-derived networks (Additional File 1: Supplementary Methods). More than 50% of drugs in 50 out of 70 QUIZ-C networks had a known mechanism of action, and these percentages were similar in MODZ and CD networks (Additional File 1: Fig. S13A). Drugs with a known mechanism of action were over-represented in QUIZ-C networks, with the exception of the nine “core” cell lines used in the Touchstone dataset [18] (Fig. 4A), whereas drugs with a known mechanism of action were under-represented in the MODZ and CD networks (Additional File 1: Figs. S13B-C). These results indicate that the “core” cell lines can be mined for novel therapeutic agents, whereas the other cell lines may be used to identify drugs with repurposing potential. In terms of clinical development, we found a balanced representation of experimental drugs (drugs in the preclinical phase), investigational drugs (drugs in clinical development phases 1 through 3), and approved (launched) drugs (Fig. 4B). This distribution was also observed in MODZ and CD networks (Additional File 1: Fig. S14), suggesting that CMap-based gene-perturbation networks offer a combination of drug repurposing and novel drug discovery opportunities with drugs represented across all stages of drug development.
A The odds ratio of enrichment of drugs in each QUIZ-C network that have MoA information in the Drug Repurposing Hub. Error bars indicate 95% confidence intervals. Blue text indicates significantly under-represented, red text indicates significantly over-represented, and black text indicates neither over- nor under-represented. B Clinical development phase breakdown of the drugs in each QUIZ-C network. C The topological drug specificity of QUIZ-C. MODZ and CD networks. D The odds ratio of enrichment of known drug-target interactions in each QUIZ-C network. Error bars indicate 95% confidence intervals and cell lines in red indicate a significant enrichment
Finally, combining the number of cell lines and out-degree, we defined “topological drug specificity” as \(1/({nc}_{p}* {\langle {k}_{out,p}^{c}\rangle }_{c})\) where \({nc}_{p}\) is the number of cell lines in which perturbagen p is present and \({\langle {k}_{out,p}^{c}\rangle }_{c}\) is the mean normalized out-degree of the perturbagen over cell lines. Hence, drugs that are exclusive to a few cell types and have few target genes have high topological specificity. QUIZ-C networks had significantly higher topological drug specificity compared to both MODZ and CD networks (two-sided Mann–Whitney U test p-value < 10−12 for both) (Fig. 4C).
QUIZ-C networks have a higher enrichment of drug targets than other CMap-derived networks
We hypothesized that the degree of enrichment of known drug-target interactions would be an additional indicator of the utility of these networks for predicting novel and repurposed drugs. We used known drug-target interactions from the literature to determine the enrichment of known interactions among the gene-perturbagen edges in CMap-derived networks. To maximize coverage of known drug-target interactions, we combined the DRH and the Drug-Gene Interaction Database (DGIdb) [43] (Additional File 1: Supplementary Methods). The addition of DGIdb data increased overall coverage (Additional File 1: Fig. S15A). Forty-four percent (31/70) of QUIZ-C cell type-specific networks had a significant enrichment (two-sided Fisher’s exact p-value < 0.05) of known drug-target interactions (Fig. 4D) compared to 24% (17/70) and 18% (13/70) for MODZ and CD networks, respectively (Additional File 1: Fig. S15B-C). In terms of the difference of log odds-ratios, 73% and 72% of cell lines had a higher enrichment of literature-evidenced drug-target interactions in QUIZ-C compared to MODZ and CD, respectively (Additional File 1: Fig. S15D-E). Together, these results suggest the potential utility of QUIZ-C networks in drug discovery and repurposing.
Using pathophenotypic profiles to determine the congruity between input signatures and perturbagens
The discovery and repositioning of potentially therapeutic drugs using CMap methods have largely focused on the direct comparison of input gene signatures (which can be associated with a disease, treatment, or any other biological context) with perturbagen-induced gene signatures [9]. As such, this approach provides a local view of correlation between the two gene sets that is limited to the biological context of the input gene signature and is, therefore, agnostic to the larger network of gene-perturbagen interactions in other disease contexts. Gene-perturbagen networks derived from LINCS-CMap data, on the other hand, enable us to investigate the global effect of perturbagens on genes that are implicated in disease etiologies other than those of the input genes. Here, we exploited the fact that QUIZ-C and other CMap-derived networks can be used to study simultaneously the transcriptional response of multiple sets of disease-associated genes to a perturbagen in a cell type-dependent manner. In particular, we hypothesized that using a comprehensive network of human disease signatures (pathophenotypes) as a common basis for similarity and dissimilarity between perturbagens and input signatures will facilitate and provide additional insights into the prioritization of novel and repositioned drugs.
Here, we describe the PAthophenotypic COngruity Score (PACOS), which uses CMap-derived gene-perturbagen networks to predict small molecule and cell line combinations that can enhance or repress a given input signature. PACOS is summarized in Fig. 5 and described in detail in the “Methods” section. In brief, we define the Signature Congruity Score (SCS) to quantify the agreement between two gene signatures in terms of the direction of effect (i.e., an increase or decrease in expression). As an enrichment metric that is akin to the odds ratio or the chi-square statistic (Additional File 1: Fig. S16), SCS enables us to assign a scalar value in the range [− 1, 1] to any perturbagen signature (i.e., gene-perturbagen edges of a given perturbagen in QUIZ-C or other CMap networks)-disease signature or input signature-disease signature pair (Fig. 5A). Extending the SCS calculation to all diseases in a large-scale, multi-sample disease-gene network consisting of 569 disease signatures (Methods), we define Pathophenotypic Congruity Profiles (PCPs), which are vectors composed of SCS values (Fig. 5B). We then calculate PCPs for each perturbagen in each cell type-specific CMap-derived network as well as for the input signature. Finally, we calculate PACOS, which quantifies the similarity/dissimilarity between the PCP values of any input signature and cell line-perturbagen pair (Fig. 5B). A positive PACOS indicates an “enhancing” perturbagen whose transcriptional effect is congruent with that of the input signature, whereas a negative PACOS indicates a “repressing” perturbagen whose transcriptional effect is incongruent with that of the input signature (Fig. 5B). A sensitivity analysis demonstrated that potential correlations between the intermediary disease signatures do not affect PACOS distributions (Additional File 1: Figs. S17-18, Additional File 1: Supplementary Note 4). Finally, given an input signature, we rank PACOS values across all perturbagens in each cell line, for which we calculate the area under the receiver operating characteristic curve (AUROC) using positives obtained from the Toxicogenomics Database (CTD) [30] via a directional enrichment approach (Fig. 5C, see the “Methods” section). These AUROC values are compared with those of permuted rankings to determine an empirical p-value for each cell line. This procedure enables us to create a ranking of both cell lines and perturbagens whereby the cell lines with statistically significant AUROC values can be queried for the most enhancing/repressing perturbagens (Fig. 5C).
Overview of the PACOS algorithm. Blue squares indicate perturbagens, black hexagons indicate intermediary disease phenotypes, orange diamonds indicate input signatures, and pink circles indicate genes. A The signature congruity score, \({SCS}_{p,d}^{c}\), between each disease \(d\) and each perturbagen \(p\) in cell line \(c\) is calculated. The same procedure is repeated for the input signature \(g\) and each disease \(d\), resulting in \({SCS}_{g,d}\). B Aggregating \({SCS}_{p,d}^{c}\) and \({SCS}_{g,d}\) values for all intermediary diseases, we form the \({PCP}_{p}^{c}\) and \({PCP}_{g}\) vectors, respectively. The Spearman correlation between \({PCP}_{p}^{c}\) and \({PCP}_{g}\) yields the PACOS value for each perturbagen for the given input signature in the given cell line. C Given the PACOS ranking of perturbagens in each cell line, receiver operating characteristic (ROC) curves are generated for each cell line. This procedure is repeated for randomized ranks to yield an empirical p-value for each cell line, which can in turn be used to rank cell lines and then perturbagens within each cell line. D Workflow summarizing the ranking procedure Pathopticon uses for cell lines and perturbagens within each cell line
Benchmarking Pathopticon
To assess the prediction performance of our framework under various scenarios, we designed a benchmarking strategy that uses 73 gene signatures from the literature as input and drugs targeting each gene signature from CTD [30] (Fig. 5C, Additional File 1: Table S1; see the “Methods” section). We performed drug prioritization using the Pathopticon framework: we calculated the AUROC values for all benchmark signatures and cell lines, resulting in 73 × 70 AUROC values, and identified the cell lines with statistically significant AUROCs. Across all benchmark signatures, we observed a balanced representation of significant cell lines, pointing towards the lack of bias towards certain cell lines in our prioritization. We found that 68% and 71% of cell lines were significant (empirical p-value < 0.05) in at least one benchmark signature in both repressing and enhancing modes, respectively (Additional File 1: Fig. S19). Moreover, repressing and enhancing modes were complementary to each other with non-overlapping sets of significant cell lines.
In our benchmark, we were particularly interested in testing (i) how using QUIZ-C as the underlying gene-perturbagen network compares to using other CMap-derived networks; (ii) how PACOS compares with purely cheminformatic, yet input- and cell type-agnostic, measures; (iii) how PACOS compares with recent CMap-based prioritization methods such as L1000CDS2 [35] as well as other similar approaches such as DeepDRK [26] and Multiscale Interactome [22]; and (iv) how each of the above comparisons fare in terms of chemical diversity of the top predicted drugs. Below we present the results of these tests.
QUIZ-C networks surpass other CMap-derived networks in predictive performance within the Pathopticon framework
To compare the predictive capability of PACOS across CMap-derived networks, we focused on the best-performing cell line (i.e., the cell line with the highest AUROC value among significant cell lines) for each benchmark input signature. PACOS-QUIZ-C showed, on average, higher maximum AUROCs compared to both PACOS-MODZ and PACOS-CD in both “enhancing” and “repressing” mode; the difference was statistically significant (two-sided Wilcoxon signed-rank test p-value < 0.05) in all four comparisons (Fig. 6A). Overall, these data suggest that the underlying gene-perturbagen network derived from CMap has a non-trivial function in drug prioritization performance and that the cell type-specificity of QUIZ-C networks may play a role in improving prioritization performance in comparison to other CMap-based networks.
Benchmark results. A Boxplots of the maximum area under the ROC curves across all benchmark input gene sets for different LINCS-CMap networks and methods. Each dot corresponds to a benchmark input gene set. B Boxplots of the number of positives (known drug targets) in the top 50 predictions for QUIZ-C against L1000CDS2. C Boxplots of the maximum area under the ROC curve values for QUIZ-C against DeepDRK and the Multiscale Interactome
Integrating pharmacogenomic and cheminformatic data improves cell type-specific drug prediction performance
Despite being frequently used as reliable predictors of potential novel or repurposed small molecules [45], ligand-based measures such as binding selectivity and tool score [46] derived from bioactivity data [41] largely focus on single protein targets. As such, these measures prioritize compounds by taking as input single targets rather than multiple targets [25]—an approach that is at odds with recent bioinformatic findings that drugs have, on average, 32 targets in the proteome [47]. In this sense, these cheminformatic measures are agnostic to disease context in the form of multiple disease-associated genes. They nevertheless provide useful information on chemical binding properties of a given drug, a feature that is crucial in drug development. In our benchmarks, tool score indeed performs better than random expectation (two-sided Wilcoxon signed-rank test p-value < 0.05) and similar to PACOS (Fig. 6A). Given the drawbacks of solely relying on pharmacogenomic data in drug discovery [48, 49], we hypothesized that cheminformatic and pharmacogenomic information are complementary and that their integration will be more beneficial in drug discovery than either approach alone. To test this hypothesis, we used a heuristic measure that optimally combines PACOS with tool score (Methods, Additional File 1: Fig. S3A-B). Supporting our hypothesis, the integration of PACOS and tool score significantly surpassed PACOS alone in all three cases in both enhancing and repressing mode (two-sided Wilcoxon signed-rank test p-value < 10−12 in all cases), with the PACOS-QUIZ-C-Tool measure outperforming both the PACOS-MODZ-Tool and the PACOS-CD-Tool measures in both modes (Fig. 6A). Moreover, these observations remained valid when we used other intermediary gene sets such as kinase perturbation signatures, transcription factor perturbation signatures, and tissue-specific gene expression signatures in lieu of disease phenotypes, implying that the choice of intermediary gene sets did not have a significant effect on prediction performance (Additional File 1: Fig. S3C). The integration with cheminformatic data also increased the number of significant cell lines identified by Pathopticon (Additional File 1: Fig. S20) compared to LINCS-CMap data alone (Additional File 1: Fig. S19).
As an additional external benchmark, we compared PACOS to a state-of-the-art method that also uses LINCS-CMap data to prioritize drugs, namely L1000CDS2 [35], as well as two other highly relevant approaches, the Multiscale Interactome [22] and DeepDRK [26]. Comparison with L1000CDS2 is important since L1000CDS2 is another tool that makes full use of LINCS-CMap data and simultaneously ranks drug candidates along with the relevant cell lines. Although they do not incorporate CMap data directly, a comparison with DeepDRK and the Multiscale Interactome is important since DeepDRK is also an integrative tool that integrates gene expression and cheminformatic data and Multiscale Interactome is also a tool that incorporates multiple pathways to interpret disease mechanisms. In terms of the number of known targets captured in the top 50 candidates, L1000CDS2 performed significantly better (two-sided Wilcoxon signed-rank test p-value < 0.05) than PACOS-QUIZ-C, and PACOS-QUIZ-C-Tool significantly surpassed L1000CDS2 for both enhancing and repressing modes (Fig. 6B, Additional File 1: Table S7). Similarly, PACOS-QUIZ-C and PACOS-QUIZ-C-Tool significantly surpassed (two-sided Wilcoxon signed-rank test p-value < 0.05) both DeepDRK and the Multiscale Interactome in all but one comparison (Fig. 6C, Additional File 1: Table S7).
Together, these results demonstrate the predictive advantage of integrating cheminformatic data with LINCS-CMap data for cell type-specific drug prioritization, and suggest that QUIZ-C networks best complement cheminformatic data compared to the other CMap-based networks.
High chemical diversity of the PACOS-prioritized drugs
While the top-prioritized compounds can be selected as potential candidates for downstream validation studies, a larger set of highly ranked drug candidates can also be thought of as a screening set of compounds. In this case, it is desirable that the set of drugs chosen for screening are structurally diverse rather than similar to ensure the exploration of a larger portion of the chemical space [25, 45, 50]. We, therefore, sought to quantify the structural diversity of the top-ranked drugs for each benchmark signature. Overall, the chemical structure similarity, measured by the Tanimoto coefficient (Additional File 1: Supplementary Methods), was low for all methods with a median value between 0.2 and 0.4 (Additional File 1: Figs. S21-24), in agreement with a recent report that suggested that setting a Tanimoto structural similarity threshold of 0.2 was an effective proxy for chemical diversity [25]. QUIZ-C had significantly lower (two-sided Mann–Whitney U p-value ≤ 0.05) Tanimoto similarity (therefore higher structural diversity) in the top-50 predicted drugs than MODZ and CD for more than half of benchmark signatures, with or without tool scores (Additional File 1: Table S8, Additional File 1: Figs. S21-22). Compared to L1000CDS2, QUIZ-C had a higher number of benchmark signatures with significantly lower Tanimoto similarity for three out of four comparisons (Additional File 1: Table S8, Additional File 1: Figs. S23-24). These results suggest that the top drugs predicted by Pathopticon hold the potential for being used as cell type-aware compound screening libraries focused on a specific input signature.
Pathopticon identifies potential therapeutic candidates targeting inflammation-related pathways involved in vascular diseases
To investigate the utility of Pathopticon to guide in vitro validation studies by using disease-associated omic data, we carried out two proof-of-concept studies on vascular diseases. First, we considered a hypothesis-driven approach where researchers identify a priori the cell types to be studied based on the pathophysiology of interest. Since Pathopticon was built on gene expression data, we prioritized the use of transcriptomic data as our input data type. Using RNA-seq data from patients with subclinical atherosclerosis [31], a clinically silent disease that is highly prevalent in middle-aged individuals [51], we first identified a gene signature for subclinical atherosclerosis that is associated with circulating immune cells, which was used in Pathopticon to identify compounds capable of regulating this gene signature (Methods). We then examined the cell types that were found to be significant by Pathopticon (Fig. 7A). To match the identified subclinical atherosclerosis gene signature, we searched cell types that would match the types of immune cells used in the original study [31]. We chose U937 monocytes since infiltrating monocytes differentiate into macrophages within atherosclerotic plaques and since the inflammatory status of monocyte-derived macrophages is a well-known determinant of atherosclerotic disease progression [52]. The top compounds identified by Pathopticon to regulate the subclinical atherosclerosis disease signature in a repressing manner are shown in Fig. 7B. Using a combination of three criteria, (i) commercial availability of compounds, (ii) previously known cytotoxic profile of compounds in monocytes, and (iii) scalability of validation experiments, we chose six of these compounds for experimental validation. Of the compounds selected, simvastatin, a widely prescribed drug for patients at risk for atherosclerotic disease, has well-known anti-inflammatory pleiotropic properties [53]. Similarly, cyclosporin A, a commonly used immunosuppressive agent, potentially regulates atherosclerosis risk [54]. AG-1478 is a chemotherapeutic agent with potential protective effects in atherosclerotic mouse models [55]. However, little remains known about the contribution of Neratinib, GW-408533, and TG-101348 within the context of inflammation associated with atherosclerosis. The intermediary disease signatures that overlapped with the subclinical atherosclerosis signature for each of these top candidates were highly enriched in pathways related to IL-17 signaling, innate immunity, and toll-like receptor (TLR) signaling (Fig. 7C, Additional File 1: Fig. S25). We then built a network consisting of subclinical atherosclerosis genes, intermediary pathophenotypes and their associated genes, and the six identified drugs and their targets in the QUIZ-C networks for U937 cells (Fig. 7D, Additional File 1: Fig. S26; see the “Methods” section). Pathway genes for the identified pathways had higher eigenvector and closeness centrality in this network compared to the other genes in the network (Fig. 7E, Additional File 1: Fig. S27), indicating that these pathways may be the principal mediating axes of the regulation of subclinical atherosclerosis by the six identified drugs (Methods). Based on these findings, we designed an in vitro validation experiment to test the downstream effects of the six drugs on some of the prototypical markers of these pathways. Lipopolysaccharide (LPS) is a well-known stimulant of TLR signaling within monocytes and macrophages. We, therefore, stimulated U937 cells with LPS and measured expression of genes selected from TLR signaling pathways. We identified that five out of six compounds were able to successfully reduce LPS induced IL-1β gene expression (Fig. 7F). On the other hand, none of the compounds reduced LPS-induced STAT1 gene expression (Fig. 7F). Interestingly, simvastatin reduced the expression of both NF-kB and the gene expression of its downstream target genes S100A8 and S100A9, which were the genes with the highest centrality (Additional File 3).
A Ranking of cell lines based on their AUROC values. Dark red represents significant cell lines with empirical p-value < 0.05. B Top ten predictions based on PACOS + Tool combined score (blue circles). PACOS scores only (red circles) and tool scores only (brown triangles) are also shown. C The most highly enriched pathways identified by Pathopticon for each of the six drugs chosen for in vitro experiments. Darker colors indicate higher enrichment (i.e., lower two-tailed Fisher’s exact test p-values) and blank cells indicate not significant (p-value > 0.05). D Combined network showing the input disease (subclinical atherosclerosis) (pink nodes), the six identified drug candidates (teal nodes), the intermediary disease phenotypes (orange nodes), and the genes targeted by each drug/intermediary disease phenotype (purple nodes). Red and blue edges indicate an increase and decrease in gene expression, respectively. E Boxplots showing the closeness centrality distributions of the three identified pathways. ** indicates two-tailed Mann–Whitney U test p-value < 0.01 and *** indicates two-tailed Mann–Whitney U test p-value < 0.001. F q(PCR) results of the in vitro identified drugs for the selected panel of key pathway genes
Second, we tested the use of Pathopticon in an extended scenario, namely, to guide studies on animal models of disease that use data from diverse omic modalities. In diseases of vascular remodeling, insights into accumulated differences in protein content (such as the extracellular matrix) can be obtained using mass spectrometry-based proteomics approaches. We used proteomic data from our previous preclinical study of vein graft disease in mice [32] as the input (Methods) to identify candidate drugs that can reduce inflammatory responses of macrophages, a central immune-regulatory cell which determines disease progression. Thirteen out of the top 20 predicted drugs (top 10 from early- and late-stage disease combined) had literature evidence of clinical investigation in the context of vascular medicine), albeit not in the context of vein graft failure (Additional File 1: Table S9; see Additional File 1: Supplementary Note 5). For downstream in vitro validation, we focused on a subset of these 13 candidate compounds that were predicted for adipose stromal cells (ASCs) (Fig. 8A). ASCs are a mixed population of cells, including proliferative cells (e.g., adipose progenitor cells), inflammatory cells (e.g., macrophages), and vascular cells (e.g., endothelial cells), important in a variety of vascular diseases (see Additional File 1: Supplementary Note 6). The diseases that had a high signature overlap with vein graft disease for each of these top candidates were enriched in inflammation-related pathways such as the innate immune system, inflammatory response pathway and endogenous TLR signaling (Fig. 8B, Additional Files 4–5). We, therefore, chose human primary macrophages, an innate immune cell type, which reflect the immune cell fraction central to immune regulation within the vein graft, for our validation experiments. To investigate endogenous TLR signaling, we stimulated these primary macrophages with LPS and measured gene expression associated with NF-kB signaling downstream of TLR activation via real-time polymerase chain reaction (qPCR) (Methods). Four of the eight compounds reduced the expression of the NF-kB subunit, RelB (Fig. 8C, Additional File 1: Figs. S28-29). Of these, sunitinib also displayed decreased expression of genes downstream of NF-kB activation such as CCL2, CCL7, and S100A10. On the other hand, rosiglitazone displayed decreased expression of inflammatory genes CCL2, CCL5, CCL7, CCL8, and GBP1 but caused no changes in RelA or RelB. These data suggest that compounds identified using input proteomics data are capable of reducing gene expression of TLR signaling and its effector genes in human primary macrophages. We also observed pathways associated with oxidative damage for some of our identified compounds (Fig. 8B). PPARα, a known regulator of processes associated with oxidative damage response [56] and vein graft lesion development, was modulated using two of these compounds, GW-7647 and SB-203580 (Fig. 8C).
In vitro validation. A The eight drug candidates repressing the vein graft disease signature that were selected for in vitro validation. The signature congruity score, \({SCS}_{p,d}^{c}\), between each disease \(d\) and each perturbagen \(p\) in cell line \(c\), is shown on the y-axis and the signature congruity score for the input signature \(g\) and each disease \(d\), \({SCS}_{g,d}\), is shown in the x-axis. Each circle represents an intermediary disease phenotype. The lines indicate the estimated regression model and the shaded regions indicate the 95% confidence intervals. B The pathway enrichment of the top five disease phenotypes with the highest signature overlap with the vein graft disease signature (i.e., the leftmost and rightmost diseases on the x-axis). Darker colors indicate a higher enrichment in terms of the -log(p) value and empty cells indicate instances where the enrichment was not significant (p > 0.05). C Summary of the results of the q(PCR) experiments. Z-scores are calculated separately for each column (i.e., each gene) and centered around the reference state (i.e., the undifferentiated macrophage M(-)). Black framed cells indicate statistically significant changes (p < 0.05, two-sided Mann–Whitney U test) with respect to M(LPS) for the candidate drugs and with respect to the baseline (i.e., M(-)) for M(LPS)
Discussion
While cell lines have proven to be versatile instruments that allow for the study of multiple disease contexts [57], they are also often highly specific in terms of their transcriptional and phenotypic response to perturbations by drugs [58]. In line with this observation, our benchmarks showed the advantage of QUIZ-C over other CMap-based gene-perturbagen networks in terms of reflecting the unique transcriptional response of each cell line to perturbations.
Our signature congruity score (SCS), which is the base scoring component underlying PACOS, is not novel by itself. It is akin to established measures of association, such as the odds ratio or the chi-square statistic, as well as methods previously introduced such as XSum, XCos [59], or, more recently, EMUDRA [60], in that it is a simple metric quantifying the agreement between an input gene set and a perturbation signature [61, 62]. Characteristics of SCS that can arguably make it advantageous to these similar scores are that its values are easily interpreted as they range between − 1 and 1 and that it does not require the fold-change values of the input signature, which may not always be available to the researcher. What makes PACOS unique as a prioritization method, rather, is that candidate drug predictions are made based on the agreement of the transcriptional signatures of the drug and input signature on a large ensemble of diseases, as opposed to predictions directly based on the similarity of input and drug signatures. This feature becomes highly informative as it introduces an additional layer of information about which disease signatures are similarly affected by the perturbagen and the input signature, thereby giving mechanistic clues about specific pathways and intermediary pathophenotypes (endophenotypes) [63] targeted by the predicted candidates. A related limitation is that, while Pathopticon is designed to provide clues into which genes and pathways are potentially affected for a given disease context, it does not provide a ranking that predicts the extent to which gene expression is reduced in in vitro experiments.
Despite having an ample amount of compound bioactivity information on single targets in humans, only 12% of this portion of ChEMBL contained information about cell lines at the time these data were retrieved. We hypothesized that a way to maximize the utility of ChEMBL in the context of cell type-specific drug discovery is to integrate it with resources that have complete cellular context, such as LINCS-CMap. In favor of this hypothesis, our benchmarks showed that these resources combined results in a prediction performance that surpasses those of each resource alone.
Our study has several limitations due to its reliance on large public databases, the first one of which concerns LINCS-CMap data itself. Although Pathopticon is a generalizable computational model whose primary goal, like other CMap-based approaches, is to expedite drug discovery and repurposing, we acknowledge the recent concerns about the reproducibility within and between different CMap versions [64]. We, therefore, recommend general caution in interpreting the results and encourage biological follow-up studies for the identified drug candidates. In addition to the original LINCS-CMap study’s efforts at a diligent curation of doses and time points wherever possible [65], we demonstrate that the differential contribution of time points and doses to our framework is minimal. However, we note that if candidates have been identified using consensus methods such as ours, a careful scrutiny of the time points and doses of the prioritized drugs in the CMap data should accompany subsequent in vitro studies in the rare event that an administered dose in the LINCS dataset is non-viable. A second limitation pertains to the ChEMBL data, which, being an enormous database in which the level of annotation of compounds varies widely, is not without data quality issues and possible erroneous entries, such as negative nanomolar values [66]. Diligent curation and data cleaning efforts are, therefore, warranted in studies utilizing this resource. Yet, another limitation is that of the landmark (detected) and inferred genes in LINCS-CMap: the degree distributions of our networks were skewed towards landmark genes, which may be due to non-landmark genes having less consistent patterns in replicates.
While LINCS-CMap’s cell line coverage is remarkable for a high-throughput genomic resource, it is still a limited subset of cell types available to researchers, which may lead to differences with the cell types on which omics data were generated. Should this discordance prevail, one can consider leveraging heterologous cell types that may colocalize in the same organ with the cell type of primary interest. For example, distinct cell types have been shown to act in concert in vascular diseases via heterologous cell–cell interactions mediated by metabolic exchange [67]. Therefore, compounds that target “biochemically adjacent” cell types can be used in lieu of the available cell type. Alternatively, single-cell RNA-seq can help shed light on similar cell subpopulations, leading to the identification of “transcriptionally homologous” cell types to the LINCS-CMap cell lines. Finally, since the cell types available in LINCS-CMap are a mixture of immortalized cell lines, primary cells, and end-differentiated cells, it is important to be cognizant of the challenges in interpreting the perturbagen signatures in these distinct and often vastly different genomic contexts.
The mechanisms of action of the majority of the top-ranked drugs predicted by Pathopticon in our use case were supported by previous investigations in the context of vascular disease. By contrast, three predictions appeared to be relevant but effective in the opposite direction, and four predictions were approved for other indications not directly relevant to vascular disease (see Additional File 1: Supplementary Note 5 for an extended discussion and more references). While we gave priority to a subset of the drugs with literature evidence to validate in vitro, this latter set of drugs may have repurposing potential and warrants a closer look at their mechanisms of action and tissue-specific targets. Furthermore, we observed additional pathways of interest in our analysis in both atherosclerotic and vein graft disease datasets. We focused on TLR signaling by stimulating cells with LPS, a ligand for TLR4. Future studies could take complementary approaches to utilize appropriate stimuli (such as interferon γ or complement 5a) to investigate pathways associated with interferon signaling or complement activation respectively. Of note, regulation of NF-κB activation by our compounds in both validation approaches is also observed downstream of IL-17 signaling, which was identified as a significantly enriched pathway. We also observed pathways associated with scavenger receptors, a key functional pathway of lipid handling by macrophages. Future studies can evaluate the capacity of these compounds to regulate unique signaling pathways relevant to macrophage functionality within the context of vascular diseases. In our validation studies, not all measured genes were found to be modulated by the identified compounds. However, this could be due to the activation of inflammatory pathways by our chosen stimuli and the timeframe of regulation of pro-inflammatory, anti-inflammatory, and pro-resolving processes. This warrants appropriate selection of stimuli, dose, and time-point to investigate the ability of selected compounds in modulating cellular responses. Even within TLR signaling, we observed discrepancies in the expression of transcription factors and downstream effector genes. This could be attributed to the complex nature of regulation of these processes. Future studies could perform time-course studies and supplement gene expression findings with protein expression (e.g., secretion of CCL2), protein localization (e.g., nuclear translocation of NF-κB), and protein activation (e.g., phosphorylation of STAT1) to investigate further the regulatory capacity of these compounds. In both of our validation approaches, we focused on inflammation as a driver for vascular diseases. This selection showcases the applicability of our tool to investigating specific subsets (e.g., inflammation) of multifactorial diseases. Pathopticon can be readily used to test different processes involved in disease progression with careful curation of input data sets and cell types of interest. In our validation, we separately used transcriptomic and proteomic datasets from human and murine models respectively, as our input data.
Conclusions
In this study, we approach cell type-guided computational drug discovery as a two-stage prioritization task that involves the simultaneous ranking of cell lines and perturbagens. Based on our broad overview of the LINCS-CMap data, we propose a statistical consensus approach, QUIZ-C, to generate cell type-specific gene-perturbagen networks. We design a drug discovery framework, Pathopticon, that integrates these gene-perturbagen networks with diverse disease phenotypes and compound bioactivity data. Through intrinsic and extrinsic benchmarks, we demonstrate that Pathopticon surpasses relevant methods in terms of its prediction performance and the chemical diversity of its predicted candidates. We provide our framework as a user-friendly app through which users can explore each cell type-specific network, make predictions based on their omics experiments, and analyze the results.
Methodological extensions to our work might include the use of different intermediary gene sets such as transcription factors or kinase signatures, or the increased integration of bioactivity and cell type-specific multi-omic data, especially as the latter become more widely available with multi-institutional efforts such as the Trans-Omics for Precision Medicine (TOPMed) program. Different omic modalities can then be incorporated in our pathophenotypic congruity scoring schema, which would allow for the query of therapeutic agents that invoke similar signatures in terms of multiple omic types. The current advances in single-cell resolution sequencing methods and resources will undoubtedly help refine further cell-type guided drug discovery in the near future.
Data availability
The publicly available datasets used in this study are accessible via the references and links listed below:
LINCS-CMap data: [18]https://clue.io/data/ .
Enrichr data: [28] https://maayanlab.cloud/Enrichr/#libraries.
Harmonizome data: [34] https://maayanlab.cloud/Harmonizome/.
L1000CDS2 data: [35] https://maayanlab.cloud/public/L1000CDS_download/.
Structural reducibility algorithm: [36] https://github.com/KatolaZ/multired.
Cellosaurus database: [37] https://www.cellosaurus.org/.
CLASTR tool: [38] https://www.cellosaurus.org/str-search/.
Cell Line Ontology (CLO) data: [39] https://bioportal.bioontology.org/ontologies/CLO.
HUGO Gene Nomenclature data: [40] https://www.genenames.org/.
ChEMBL database: [41] https://www.ebi.ac.uk/chembl/.
Comparative Toxicogenomics Database (CTD) data: [30] https://ctdbase.org/downloads/.
Drug Repurposing Hub: [42] https://www.broadinstitute.org/drug-repurposing-hub.
Drug-Gene Interaction Database (DGIdb): [43] https://www.dgidb.org/.
Multiscale Interactome data: [22] https://snap.stanford.edu/multiscale-interactome/.
DeepDRK data: [26] https://github.com/wangyc82/DeepDRK.
RNA-seq data from patients with subclinical atherosclerosis: [31].
Proteomic data from our previous preclinical study of vein graft disease in mice: [32].
The datasets generated during the current study are available through [68] https://github.com/r-duh/Pathopticon.
References
Hopkins AL. Network pharmacology. Nat Biotechnol. 2007;25(10):1110–1.
Berger SI, Iyengar R. Network analyses in systems pharmacology. Bioinformatics. 2009;25(19):2466–72.
Zhao S, Iyengar R. Systems pharmacology: network analysis to identify multiscale mechanisms of drug action. Annu Rev Pharmacol Toxicol. 2012;52:505–21.
Hu G, Agarwal P. Human disease-drug network based on genomic expression profiles. Jordan IK, editor. PLoS One. 2009;4(8):e6536.
Guney E, Menche J, Vidal M, Barábasi AL. Network-based in silico drug efficacy screening. Nat Commun. 2016;7: 10331.
Cheng F, Liu C, Jiang J, Lu W, Li W, Liu G, et al. Prediction of drug-target interactions and drug repositioning via network-based inference. Altman RB, editor. PLoS Comput Biol. 2012;8(5):e1002503.
Cheng F, Kovács IA, Barabási AL. Network-based prediction of drug combinations. Nat Commun. 2019;10(1):1197.
Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science (1979). 2006;313(5795):1929–35.
Musa A, Ghoraie LS, Zhang SD, Glazko G, Yli-Harja O, Dehmer M, et al. A review of connectivity map and computational approaches in pharmacogenomics. Brief Bioinform. 2018;19(3):506–23.
Huang L, Garrett Injac S, Cui K, Braun F, Lin Q, Du Y, et al. Systems biology–based drug repositioning identifies digoxin as a potential therapy for groups 3 and 4 medulloblastoma. Sci Transl Med. 2018;10(464):eaat0150.
Dudley JT, Sirota M, Shenoy M, Pai RK, Roedder S, Chiang AP, et al. Computational repositioning of the anticonvulsant topiramate for inflammatory bowel disease. Sci Transl Med. 2011;3(96):96ra76.
Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, et al. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci Transl Med. 2011;3(96):96ra77-96ra77.
Iskar M, Zeller G, Blattmann P, Campillos M, Kuhn M, Kaminska KH, et al. Characterization of drug-induced transcriptional modules: towards drug repositioning and functional understanding. Mol Syst Biol. 2013;9(1):662.
Iorio F, Bosotti R, Scacheri E, Belcastro V, Mithbaokar P, Ferriero R, et al. Discovery of drug mode of action and drug repositioning from transcriptional responses. Proc Natl Acad Sci U S A. 2010;107(33):14621–6.
El-Hachem N, Gendoo DMA, Ghoraie LS, Safikhani Z, Smirnov P, Chung C, et al. Integrative cancer pharmacogenomics to infer large-scale drug taxonomy. Cancer Res. 2017;77(11):3057–69.
Parkkinen JA, Kaski S. Probabilistic drug connectivity mapping. BMC Bioinformatics. 2014;15(1):113.
Lee S, Lee KH, Song M, Lee D. Building the process-drug–side effect network to discover the relationship between biological Processes and side effects. BMC Bioinformatics. 2011;12(S2):S2.
Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017;171(6):1437-1452.e17.
Qu XA, Rajpal DK. Applications of Connectivity Map in drug discovery and development. Drug Discov Today. 2012;17(23–24):1289–98.
Keenan AB, Wojciechowicz ML, Wang Z, Jagodnik KM, Jenkins SL, Lachmann A, et al. Connectivity mapping: methods and applications. Annu Rev Biomed Data Sci. 2019;2(1):69–92.
Iwata M, Yuan L, Zhao Q, Tabei Y, Berenger F, Sawada R, Akiyoshi S, Hamano M, Yamanishi Y. Predicting drug-induced transcriptome responses of a wide range of human cell lines by a novel tensor-train decomposition algorithm. Bioinformatics. 2019;35(14):i191–9.
Ruiz C, Zitnik M, Leskovec J. Identification of disease treatment mechanisms through the multiscale interactome. Nat Commun. 2021;12(1):1–15 Available from: https://www.nature.com/articles/s41467-021-21770-8. Cited 2023 Oct 30.
Vogt I, Prinz J, Campillos M. Molecularly and clinically related drugs and diseases are enriched in phenotypically similar drug-disease pairs. Genome Med. 2014;6(7):1–17.
Chen B, Greenside P, Paik H, Sirota M, Hadley D, Butte AJ. Relating chemical structure to cellular response: an integrative analysis of gene expression, bioactivity, and structural data across 11,000 compounds. CPT Pharmacometrics Syst Pharmacol. 2015;4(10):576–84.
Moret N, Clark NA, Hafner M, Wang Y, Lounkine E, Medvedovic M, et al. Cheminformatics tools for analyzing and designing optimized small-molecule collections and libraries. Cell Chem Biol. 2019;26(5):765-777.e3.
Wang Y, Yang Y, Chen S, Wang J. DeepDRK: a deep learning framework for drug repurposing through kernel-based multi-omics integration. Brief Bioinform. 2021;22(5):1–10. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/bib/bbab048. Cited 2023 Oct 30.
Späth J, Wang RS, Humphrey M, Baumbach J, Loscalzo J. Machine learning–based integration of network features and chemical structure of compounds for SARS-CoV-2 drug effect analysis. CPT Pharmacometrics Syst Pharmacol. 2024;13(2):257.
Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44(W1):W90-7 Available from: http://genome.ucsc.edu/ENCODE. Cited 2020 Dec 13.
Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst. 2015;1(6):417–25.
Davis AP, Grondin CJ, Johnson RJ, Sciaky D, Wiegers J, Wiegers TC, et al. Comparative Toxicogenomics Database (CTD): update 2021. Nucleic Acids Res. 2021;49(D1):D1138-43. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/nar/gkaa891. Cited 2023 Nov 26.
Sánchez-Cabo F, Fuster V, Silla-Castro JC, González G, Lorenzo-Vivas E, Alvarez R, et al. Subclinical atherosclerosis and accelerated epigenetic age mediated by inflammation: a multi-omics study. Eur Heart J. 2023;44(29):2698–709. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/eurheartj/ehad361. Cited 2024 May 29.
Decano JL, Singh SA, Gasparotto Bueno C, Ho Lee L, Halu A, Chelvanambi S, et al. Systems approach to discovery of therapeutic targets for vein graft disease: PPARα pivotally regulates metabolism, activation, and heterogeneity of macrophages and lesion development. Circulation. 2021;143(25):2454–70 Available from: https://www.ahajournals.org/doi/abs/10.1161/CIRCULATIONAHA.119.043724. Cited 2022 Oct 30.
Kamburov A, Pentchev K, Galicka H, Wierling C, Lehrach H, Herwig R. ConsensusPathDB: toward a more complete picture of cell biology. Nucleic Acids Res. 2010;39(suppl_1):D712–7.
Rouillard AD, Gundersen GW, Fernandez NF, Wang Z, Monteiro CD, McDermott MG, et al. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database (Oxford). 2016;2016. Available from: https://academic.oup.com/database/article/doi/10.1093/database/baw100/2630482. Cited 2020 Oct 30.
Duan Q, Reid SP, Clark NR, Wang Z, Fernandez NF, Rouillard AD, et al. L1000CDS2: LINCS L1000 characteristic direction signatures search engine. NPJ Syst Biol Appl. 2016;2(1):16015 Available from: http://www.nature.com/articles/npjsba201615. Cited 2020 Oct 30.
De Domenico M, Nicosia V, Arenas A, Latora V. Structural reducibility of multilayer networks. Nat Commun. 2015;6(1):6864 Available from: http://www.nature.com/articles/ncomms7864. Cited 2020 Dec 16.
Bairoch A. The cellosaurus, a cell-line knowledge resource. J Biomol Tech. 2018;29(2):25–38 Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC5945021/?report=abstract. Cited 2020 Dec 19.
Robin T, Capes-Davis A, Bairoch A. CLASTR: The Cellosaurus STR similarity search tool - a precious help for cell line authentication. Int J Cancer. 2020;146(5):1299–306 Available from: https://onlinelibrary.wiley.com/doi/abs/10.1002/ijc.32639. Cited 2020 Dec 19.
Sarntivijai S, Lin Y, Xiang Z, Meehan TF, Diehl AD, Vempati UD, et al. CLO: The cell line ontology. J Biomed Semantics. 2014;5(1):37 Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC4387853/?report=abstract. Cited 2020 Dec 19.
Seal RL, Braschi B, Gray K, Jones TEM, Tweedie S, Haim-Vilmovsky L, et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 2023;51(D1):D1003-9. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/nar/gkac888. Cited 2025 Jan 6.
Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Félix E, et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019;47(D1):D930-40 Available from: https://academic.oup.com/nar/article/47/D1/D930/5162468. Cited 2021 Jul 31.
Corsello SM, Bittker JA, Liu Z, Gould J, McCarren P, Hirschman JE, et al. The Drug Repurposing Hub: a next-generation drug library and information resource. Nature Med. 2017;23:405–8 Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5568558/. Nature Publishing Group. Cited 2020 Dec 18.
Cotto KC, Wagner AH, Feng YY, Kiwala S, Coffman AC, Spies G, et al. DGIdb 3.0: a redesign and expansion of the drug-gene interaction database. Nucleic Acids Res. 2018;46(D1):D1068-73 Available from: https://academic.oup.com/nar/article/46/D1/D1068/4634012. Cited 2020 Dec 18.
Clark NR, Hu KS, Feldmann AS, Kou Y, Chen EY, Duan Q, et al. The characteristic direction: a geometrical approach to identify differentially expressed genes. BMC Bioinformatics. 2014;15(1):79.
March-Vila E, Pinzi L, Sturm N, Tinivella A, Engkvist O, Chen H, et al. On the integration of in silico drug design methods for drug repurposing. Front Pharmacol. 2017;0(MAY):298.
Wang Y, Cornett A, King FJ, Mao Y, Nigsch F, Paris CG, et al. Evidence-based and quantitative prioritization of tool compounds in phenotypic drug discovery. Cell Chem Biol. 2016;23(7):862–74.
Chartier M, Morency LP, Zylber MI, Najmanovich RJ. Large-scale detection of drug off-targets: hypotheses for drug repurposing and understanding side-effects. BMC Pharmacol Toxicol. 2017;18(1):1–16.
Haibe-Kains B, El-Hachem N, Birkbak NJ, Jin AC, Beck AH, Aerts HJWL, et al. Inconsistency in large pharmacogenomic studies. Nature. 2013;504:7480 2013;504(7480):389–93.
Niepel M, Hafner M, Mills CE, Subramanian K, Williams EH, Chung M, et al. A multi-center study on the reproducibility of drug-response assays in mammalian cell lines. Cell Syst. 2019;9(1):35-48.e5.
Hu Y, Bajorath J. Compound promiscuity: what can we learn from current data? Drug Discov Today. 2013;18(13–14):644–50.
Fernández-Friera L, Fuster V, López-Melgar B, Oliva B, Sánchez-González J, Macías A, et al. Vascular inflammation in subclinical atherosclerosis detected by hybrid PET/MRI. J Am Coll Cardiol. 2019;73(12):1371–82.
Libby P. Inflammation during the life cycle of the atherosclerotic plaque. Cardiovasc Res. 2021;117(13):2525–36.
Tahara N, Kai H, Ishibashi M, Nakaura H, Kaida H, Baba K, et al. Simvastatin attenuates plaque inflammation: evaluation by fluorodeoxyglucose positron emission tomography. J Am Coll Cardiol. 2006;48(9):1825–31.
Voloshyna I, Teboul I, Kasselman LJ, Salama M, Carsons SE, DeLeon J, et al. Macrophage lipid accumulation in the presence of immunosuppressive drugs mycophenolate mofetil and cyclosporin A. Inflamm Res. 2019;68(9):787–99.
Zeboudj L, Maître M, Guyonnet L, Laurans L, Joffre J, Lemarie J, et al. Selective EGF-receptor inhibition in CD4+ T cells induces anergy and limits atherosclerosis. J Am Coll Cardiol. 2018;71(2):160–72.
Kim T, Yang Q. Peroxisome-proliferator-activated receptors regulate redox signaling in the cardiovascular system. World J Cardiol. 2013;5(6):164.
Poletto E, Baldo G. Creating cell lines for mimicking diseases. Prog Mol Biol Transl Sci. 2021;181:59–87.
Niepel M, Hafner M, Duan Q, Wang Z, Paull EO, Chung M, et al. Common and cell-type specific responses to anti-cancer drugs revealed by high throughput transcript profiling. Nat Commun. 2017;8(1):1–11.
Cheng J, Yang L. Comparing gene expression similarity metrics for connectivity map. In 2013 IEEE International Conference on Bioinformatics and Biomedicine. 2013:(pp. 165-170). IEEE.
Zhou X, Wang M, Katsyv I, Irie H, Zhang B. EMUDRA: Ensemble of Multiple Drug Repositioning Approaches to improve prediction accuracy. Bioinformatics. 2018;34(18):3151.
Cheng J, Yang L, Kumar V, Agarwal P. Systematic evaluation of connectivity map for disease indications. Genome Med. 2014;6(12):540.
Samart K, Tuyishime P, Krishnan A, Ravi J. Reconciling multiple connectivity scores for drug repurposing. Briefings in Bioinformatics. 2021:22(6):p.bbab161.
Ghiassian SD, Menche J, Chasman DI, Giulianini F, Wang R, Ricchiuto P, et al. Endophenotype network models: common core of complex diseases. Sci Rep. 2016;6:27414.
Lim N, Pavlidis P. Evaluation of connectivity map shows limited reproducibility in drug repositioning. Scientific Reports. 2021;11(1):1–14.
Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. bioRxiv. 2017:136168.
Papadatos G, Gaulton A, Hersey A, Overington JP. Activity, assay and target data curation and quality in the ChEMBL database. J Comput Aided Mol Des. 2015;29(9):885.
Marcus AJ, Broekman MJ, Drosopoulos JHF, Islam N, Pinsky DJ, Sesti C, et al. Heterologous cell–cell interactions: thromboregulation, cerebroprotection and cardioprotection by CD39 (NTPDase-1). J Thromb Haemost. 2003;1(12):2497–509.
Halu A, Chelvanambi S, Decano JL, Matamalas J, Whelan M, Asano T, Kalicharran N, Singh SA, Loscalzo J, Aikawa M. https://github.com/r-duh/Pathopticon.
Funding
Research reported in this publication was supported by the National Institutes of Health under award numbers K25HL150336 and R01HL171141 (A.H.); R01HL155107, R01HL155096, R01HL166137, U01HG007690, and U54HL119145 (J.L.); and R01HL126901 and R01HL149302 (M.A.) and by the American Heart Association under award numbers AHA 957729 and AHA 24MERIT1185447 (J.L.) and EU HorizonHealth 2024 grant 101057619 (J.L.).
Author information
Authors and Affiliations
Contributions
Conceptualization: AH; data curation: AH, JTM, TA, NK; formal analysis: AH; funding acquisition: AH, JL, MA; investigation: AH, SC, JLD, methodology: AH, SC, JLD; resources: SC, JLD, MW, SAS; software: AH; supervision: AH, MA; validation: AH, SC, JLD, MW; visualization: AH; writing—original draft: AH; writing—review and editing: AH, SC, JLD, JTM, SAS, JL, MA. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
Research reported in this publication was supported by Kowa Company, Ltd, Nagoya, Japan, under award number BWH A11014 (M.A.). Kowa was not involved in the study design, data acquisition, data interpretation, or the preparation of the manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
13073_2025_1431_MOESM1_ESM.pdf
Additional file 1. Supplementary Notes, Supplementary Methods, Supplementary Figs. S1-S29, and Supplementary Tables S1-S9.
13073_2025_1431_MOESM4_ESM.xlsx
Additional file 4. The pathway enrichment (nominal p -values) of the top five disease phenotypes with the highest signature overlap with the vein graft disease signature (i.e., the leftmost and rightmost diseases on the x-axis).
13073_2025_1431_MOESM5_ESM.xlsx
Additional file 5. The pathway enrichment (FDR values) of the top five disease phenotypes with the highest signature overlap with the vein graft disease signature (i.e., the leftmost and rightmost diseases on the x-axis).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Halu, A., Chelvanambi, S., Decano, J.L. et al. Integrating pharmacogenomics and cheminformatics with diverse disease phenotypes for cell type-guided drug discovery. Genome Med 17, 7 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13073-025-01431-x
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13073-025-01431-x