Game-theoretic link relevance indexing on genome-wide expression dataset identifies ... - Nature .

R Coding



Biomarkers and targeted therapeutics were introduced for the early detection and clinical management of all cancer types, including CRC7,44,45. Therapeutics and cure of CRC include targeted medication in early diagnosis and chemotherapy and surgical resection in severe cases of metastases to other organs and tissues followed by medications6,8,9,10,11. Yet, recurrence of CRC in the presence of poor diagnostic measures was reportedly found to cause additional risk and reduce the life expectance of the people2,7,9,10,44,46 which can be attributed to widespread occurrences and recurrence of CRC, thus, making it one of the most dreaded diseases in the world1,2,3,5,6,46. Significant challenges were evident in successfully implementing specific biomarkers as a tool for cancer diapeutics7,47,48. Furthermore, despite several advances in cancer diapeutics, CRC continues to remain an unabated disease eventually leading to the death of the patient1,2,5,6,49. Therefore, a retrospection on our present knowledge on the factors with a prime etiological role in CRC is a must for mitigating the occurrence of CRC through targeted diapeutics.

Conventional algorithms for discovering genes of importance/biomarkers responsible for a physiological condition such as cancer rely on differentially expressed genes considered to be critical factors in the progression or manifestation of cancer condition. The conventional method identifies and prioritizes genes based on the degree of difference in expression values in cases (cancer) compared to control (normal healthy). These differentially expressed genes exhibit a high fold difference between gene expression values in cancer cases compared to normal conditions. In other words, the conventional methods convey that the genes with a greater degree of difference in expression level in disease samples than normal samples are more important than genes with a lesser degree of difference18. These methods possess many immediate challenges as the method dictates that the genes that exhibit greater fold difference in gene expression values are considered of greater importance, which may not be valid18.

Genes that initiate the cancer progression might have less fold difference in expression values than downstream effector genes, which often exhibit higher fold difference in expression15,16,17,18. Many passenger genes may exhibit greater fold difference though their contribution in the manifestation of the cancer is incidental15,16,17. On the other hand, driver genes, which are the causal factor, with comparatively lower fold difference in gene expression contribute more in the progression and advancement of the cancerous cell15,16. Also, these methods ignore the contribution of each gene in the overall gene network of cancer/case. Reassessment of diagnostic and prognostic markers for breast cancer and other cancer type were reported previously18,50. The investigators questioned the conventional approach and revealed that designating an etiological role in complex human disease conditions simply to the higher expressed genes may not be the correct methodology18.

Game theory (GT) has unlocked newer frontiers in solving various bioinformatics and computational biology challenges, from evolutionary genetics and virulence evolution modelling to high-throughput genomics data and biological networks19,19,20,21,51,52,53,54,55,56. Coalition GT on large-scale biological networks bestows estimation of the power of each gene governing biological pathways of interest and the associated etiological role in complex human health conditions19,20. GT application in quantitative evaluation of prominence of genes, by considering their relationships with others, in initiation and progression of disease condition contributed immensely in understanding the behaviors of salient genes in manifesting disease54,55. Cooperative Game theoretical approaches such as Shapley value and Banzhaf value provided valuable insight into gene expression data analysis by screening the dataset for the most relevant genes involved in the condition of interest52,53. Previously, the GT approach exhibited its ability at par with classical centrality indices in evaluating each gene by its relevance. It also emphasized the function of genes as nodes present in the periphery of a co-expression network in modulating the complex biological pathways56.

We adopted a Game Theoretic approach in this model, especially the approach of Network games. An improved GT method, LRI, was recently proposed to identify salient genes involved in cancer or other metabolic syndromes21. The LRI in this model is brought from the concept of Shapley value of cooperative game theory in networks which can be used as a relevant approach for the classification of genes21. A substantial attribute of LRI model of game theory is that it provides an innovative property-driven classification of the use of Shapley value as an index to validate and contextualize genes57,58. In microarray games, Shapley value was used to quantitatively evaluate the underlying genes involved in disease manifestation and characterise their role in gene-regulatory pathways54,56,59. LRI, on the other hand, utilizes a co-operative framework to analyze the microarray data of gene co-expression networks where genes and their connecting links play a significant role in determining the overall structure. It emphasizes that when we consider such a co-expression network, LRI can substitute Shapley value. This is because LRI focuses on the linking abilities of the genes as a suitable candidate to demonstrate the significance of the genes and is based on the position value (a link based allocation rule)21, while Shapley value is based on the Myerson value, which is a player based allocation rule54,55. LRI establishes that any network game can precisely describe the gene interactions as it considers the cooperation among genes and how the genes are connected to a network providing a comprehensive description of the genetic markers and their combined effects21.

E-MTAB-6698 is a large (meta-) dataset that comprises gene expression of colorectal tissue samples data with relevant clinical history and conditions of 1566 persons from both the biological gender. The whole-genome gene expression microarray data built on the GPL570 platform includes 121 colon samples from normal persons (as control), 1393 samples from the diseased part of the CRC patient, 37 samples from Adenoma patients, and 15 samples from patients suffering from Inflammatory bowel diseases. The overall dataset already proved its effectiveness by offering a very high degree of classification accuracy (0.997) to test the RNAseq dataset during training and modelling the disease condition. It functions as an unmatched cohort data for investigating and determining salient genes crucial for CRC development22,23.

Herein, we applied this game-theoretic LRI scoring21 on the high-throughput CRC transcriptomics dataset to identify salient genes in CRC. Contrary to the conventional approach, the Game-Theoretic Link relevance Index identifies a gene’s importance by considering genes’ contribution to overall disease manifestation.

We obtained 126 genes with a positive LRI score (LRI > 0) (refer to Supplementary File Table S1) which we referred as salient genes in the article. These 126 salient genes consist of 117 protein-coding genes, 8 non-coding RNA and 1 uncharacterized gene. Of these genes, four (4) were mapped on the X chromosome and the rest one hundred and twenty-two (122) on autosomal chromosomes. None of these 126 genes was mapped on the Y chromosome. Of these 126, the top 15 genes with the highest LRI score (Table 1) consist of one ncRNA and 14 protein-coding genes. MIR143HG and AMOTL1 scored the highest LRI score (0.01604), followed by ACTG2 (0.01587).

Table 1 LRI score of top 15 salient genes with their types and full name from nomenclature authority.

The distribution of the LRI value of 126 salient genes was analyzed to understand the nature of the data. Other genes with zero (0) LRI scores were not considered here for distribution study. A violin plot (Fig. 1A) depicts distributions of LRI values of 126 genes using density curves where the width of the curve specifies the approximate frequency of data points in that region. Quantile–quantile (Q–Q) plot (Fig. 1B) exhibits LRI data points falling in the middle of the plot and curve off in the extremities, indicating that the behaviors of the LRI data points are suggestive of high extreme values than would be expected if the data were normally distributed. The density-cum-histogram plot (Fig. 1C) describes the distribution of the LRI values of 126 genes against the count of genes (or proportion in secondary axis). All the exploratory data distribution analysis suggests that the distribution of LRI values is not normal and that disproportionate extremes of LRI values are assigned to those 126 genes.

Figure 1
figure 1

Distribution of LRI values of the salient genes. The figure exhibits the distribution of 126 salient genes. Violin plot (A) with jittered points data points and mean value, Q–Q plot (B) and histogram combined density plot (C) to show distribution LRI values of these 126 salient genes. (D) Comparison of the 124 salient genes (out of 126 Gene IDs, two Gene IDs did not map to Entrez ID) with 171 (after removing duplicates from the total 180 genes) unique cancer biomarkers from CellMarker database exhibits overlap of only two genes and 122 genes exhibits uniqueness. (E) Comparison of the 124 salient genes with 24 DEGs of CRC exhibits overlap of only two genes.

CellMarker is an enormous curated database of biomarkers, especially at the single-cell level containing more than 22,000 cell markers of different cell types, including cancer cells30. To assess the uniqueness and novelty of the result in the present investigation, we extracted information of all the known biomarkers from the CellMarker database. Information of all the markers genes for cell types including Cancer cell, Cancer stem cell, Cancer stem-like cell, Tumor endothelial cell, and Tumor-propagating cell from the CellMarker database was retrieved. All the cell type individually provided information of 180 genes, which were further reduced to 171 after removing duplicates. We mapped 126 salient genes to Entrez ID for maintaining uniformity. One hundred twenty-four genes mapped to their corresponding Entrez ID except for two Genes IDs viz, LOC100129461 and LOC400965. A comparative analysis (Fig. 1D) revealed that two genes, viz, ITGA5, and MCAM exhibited overlapped with the known biomarkers, suggesting that the information about these two genes is already present in the existing curated knowledge base cancer biomarkers. Except for these two, however, all the salient genes demonstrated no overlap with cancer biomarkers suggesting the resulting salient genes information exhibits novelty.

The conventional method of identifying genes associated with disease relies on the assumption that the greater the difference between the expression of the gene under the normal sample and the disease sample, the greater the chance that the gene is responsible for disease occurrence. The conventional method identifies the genes associated with CRC disease by isolating genes differentially expressed in the CRC sample compared to normal. The DEGs of CRC (Table 2) were compared to salient genes to ascertain the novelty in the LRI approach. The analysis (Fig. 1E) revealed that only two salient genes viz, MT1M (LRI value 0.007937), and SI (LRI value 0.007937) exhibited overlapped with the DEGs suggesting that the genes identified using the LRI approach is very much different from the conventional approach. These non-overlapping salient genes that do not overlap with the known biomarkers or with DEGs (Fig. 1D,E) present an opportunity to assess their role as diapeutics biomarkers further.

Table 2 DEGs (Differentially Expressed Genes) in the CRC dataset. The genes that exhibited adjusted p-value of ≤ 0.05 and Log2 Fold Change (LFC) ≥ 2 (two) were considered as DEG in the CRC dataset. DEGs that are also present in the list of salient genes are denoted by (*).

These non-overlapping genes present an opportunity to assess them further for their role as diapeutics biomarkers.

The 126 salient genes were found to be associated with eleven Biological Process terms falling in seven GO groups (Fig. 2; Supplementary File Table S2), exhibiting overrepresentation. The enriched BP terms include ryanodine-sensitive calcium-release channel activity (GO:0005219), wound healing, spreading of cells (GO:0044319), muscle tissue morphogenesis (GO:0060415), platelet aggregation (GO:0070527), mesenchyme morphogenesis (GO:0072132), regulation of muscle contraction (GO:0006937), regulation of cardiac muscle contraction (GO:0055117), positive regulation of blood circulation (GO:1903524), positive regulation of muscle contraction (GO:0045933), negative regulation of vascular smooth muscle cell proliferation (GO:1904706) and regulation of smooth muscle contraction (GO:0006940). Ryanodine-sensitive calcium-release channel activity results in several skeletal myopathies due to dysregulation of intracellular Ca2+ and several muscle myopathies60. Wound healing and spreading of cells process is marked by collective migration of epithelial cells in the form of coherent sheets to heal wounds61,62. During cancer, one of the most prevalent phenomena is muscle dysfunction, where patients, irrespective of tumor stage and nutritional state, are subjected to compromised muscular function63. There have been several evidence that mitochondrial dysfunction can be induced by chemotherapy, which in turn contributes to muscle atrophy64,65,66,67,68. The biological process of tumor–induced platelet aggregation has several mechanisms involved which vary from tumor cell to the other and are generally activated by the generation of tumor cell-induced thrombin69.

Figure 2
figure 2

Enrichment of the major Biological Processes (BP) associated with the 126 salient genes. (A) Represents the network of various sub-ontologies, and associated genes, (B) describes percentage terms per group for various BP that are significantly enriched in pie chart and (C) number of genes in each term with significance sign. Node size is inversely proportioned to the p-value, i.e., the lower the value, the bigger the node size and color represent a different group of terms. *Significant at p ≤ 0.05, and **Significant at p ≤ 0.001.

Furthermore, AHNAK, CASQ2, CCL2, CHRM3, FXYD6, PLN, and PRKCE genes are associated with the GO term for regulation of ion transmembrane transporter activity (GO:0032412) exhibited overrepresentation of the MF (Fig. 3A; Supplementary File Table S3). These genes may affect the progression of CRC as a consequence of perturbation of the critical process that modulates the activity of an ion transporter. This GO term demonstrated overrepresentation in a list of Cytosine—phosphate—Guanine (CpG) sites that exhibited a steady depolarization change70.

Figure 3
figure 3

Enrichment of the major Molecular Function (MF) (A) and Cellular component (CC) associated with the 126 salient genes. (A) Describes percentage terms per group for various MF that are significantly enriched, and (B) shows various sub-ontologies of CC and associated genes. Node size is inversely proportional to the p-value, i.e., the smaller the value more considerable the node size and colour represents a different group of terms.

Overrepresented CC include Sarcoplasmic reticulum membrane (GO:0033017) with three associated genes (CASQ2, PLN, and RYR3), myofibril (GO:0030016) with nine (ACTC1, AHNAK, CASQ2, FLNA, LMOD1, MYL9, RYR3, TPM1, and VCL), sarcomere (GO:0030017) with seven (ACTC1, CASQ2, FLNA, LMOD1, MYL9, RYR3, TPM1), and I band (GO:0031674) with five associated genes (ACTC1, CASQ2, FLNA, MYL9, RYR3) (Fig. 3B; Supplementary File Table S4). CASQ2 and RYR3 are common genes between the Sarcoplasmic reticulum membrane (GO:0033017) and myofibril (GO:0030016) ontology terms. The Sarcoplasmic reticulum membrane has been associated with inherited dysfunctions and deficiencies like cardiac arrhythmias71. The enzymes involved in Sarco/endoplasmic reticulum calcium transport ATPases play a crucial role in loss or reduction of colon carcinomas and apoptosis72. Different myofibrils have been found to be associated with either oncogenic or tumor suppressor roles in different cancers like lung cancer, breast cancer, prostate and CRC73.

No enrichment with respect to chromosome location was observed for the 126 salient genes. These salient genes were observed to be distributed through the genome and not tandemly in a particular affected region. This suggests that the aberrant gene expression of the genes is not a consequence of the activation of a particular region in a chromosome.

Thirty eight (38) KEGG pathway terms, viz., wound healing, spreading of cells (GO:0044319), acylglycerol catabolic process (GO:0046464), regulation of hair follicle development (GO:0051797), platelet aggregation (GO:0070527), mesenchyme morphogenesis (GO:0072132), positive regulation of cellular carbohydrate metabolic process (GO:0010676), ATP transmembrane transporter activity (GO:0005347), anion:anion antiporter activity (GO:0015301), positive regulation of blood circulation (GO:1903524), positive regulation of muscle contraction (GO:0045933), regulation of muscle contraction (GO:0006937), smooth muscle cell migration (GO:0014909), muscle filament sliding (GO:0030049), regulation of smooth muscle cell migration (GO:0014910), positive regulation of smooth muscle contraction (GO:0045987), negative regulation of vascular smooth muscle cell proliferation (GO:1904706), regulation of smooth muscle contraction (GO:0006940), sarcomere organization (GO:0045214), regulation of cardiac muscle contraction (GO:0055117), negative regulation of vascular associated smooth muscle cell migration (GO:1904753), Dilated cardiomyopathy (DCM) (KEGG:05414), ryanodine-sensitive calcium-release channel activity (GO:0005219), release of sequestered calcium ion into cytosol by sarcoplasmic reticulum (GO:0014808), regulation of cardiac muscle contraction by regulation of the release of sequestered calcium ion (GO:0010881), relaxation of cardiac muscle (GO:0055119), regulation of muscle contraction (GO:0006937), muscle filament sliding (GO:0030049), negative regulation of neurotransmitter uptake (GO:0051581), myofibril assembly (GO:0030239), muscle tissue morphogenesis (GO:0060415), cellular response to caffeine (GO:0071313), cardiac ventricle morphogenesis (GO:0003208), negative regulation of cation transmembrane transport (GO:1904063), sarcomere organization (GO:0045214), regulation of cardiac muscle contraction (GO:0055117), regulation of cardiac muscle cell contraction (GO:0086004), negative regulation of vascular associated smooth muscle cell migration (GO:1904753), and negative regulation of calcium ion transmembrane transporter activity (GO:1901020) exhibited significantly high enrichment (Fig. 4; Supplementary File Table S5). Acylglycerol catabolic process has been used as a biomarker for the diagnosis and/or prognosis of CRC, and the enzymes of acylglycerols are involved in CRC tumor growth survival and metastasis74. Changes in the cellular carbohydrate metabolic process may precede the acquisition of driver mutations, ultimately leading to colonocyte transformation. These changes may not be uniform b