1Chifeng Center for Disease Control and Prevention, Department of Chronic Disease, No. 70A Yulong Street, Chifeng, 024000 Inner Mongolia Autonomous Region, China
Shufeng Bi, MD, Chifeng Center for Disease Control and Prevention, Department of Chronic Disease, No. 70A Yulong Street, Chifeng, 024000 Inner Mongolia Autonomous Region, China
Jia Yu, Yang Yu, Huixia Yu, Yanfeng Jia, Lianjie Xu, Shufeng Bi. (2025). Development and Validation of a classifier for discriminating Esophageal Carcinoma patients in early stage from advanced stage. International Journal of Genetics and Genomic Science. 4(1). DOI:10.58489/2836-2306/008
© 2025 Yang Yu, this is an open-access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
esophageal cancer, advanced stages, diagnosis, miRNA, classifier, computer algorithm.
Purpose: This study aimed to construct a model which consisted of miRNAs and individual risk factors to accurately classify esophageal cancer (EC) patients in early-stage from the advanced stage ones.
Methods: miRNAs of differential expression (DE-miRNAs) were analyzed between EC cancerous and para-cancerous tissues. The prognosis value of these miRNAs was investigated by Kaplan-Meier (KM) survival curves. The prognosis miRNAs and personal factors were selected by the least absolute shrinkage and selection operator (LASSO) logistic regression to build the classifier. The support vector machine (SVM) algorithm verified and evaluated the classifier’s accuracy and reliability. Furthermore, the nomogram model was established based on those factors for predicting the stage status of EC and was assessed using Harrell’s concordance index (C-index), the area under the time-dependent receiver operating characteristics curve (ROC-AUC), decision curve analysis (DCA), and calibration curve.
Results: Among 23 DE-miRNAs between the cancerous (n =111) and para-cancerous samples (n = 9), 18 were prognosis-related. hsa-let-7g-5p, hsa-miR-452-5p, hsa-miR-92b-5p, hsa-miR-139-3p, gender, M status, N status, T status, and postoperative radiation which were identified by LASSO. The results of SVM showed that the classifier which consisted of these 9 factors could classify the early and advanced stage of EC patients precisely and exhibit satisfied classification value with favorable reliability. These 9 factors were entered into the nomogram which had favorable discrimination, clinical usage, and calibration. hsa-let-7g-5p was the most significant one for predicting the probability of early stage, followed by N status, T status, M status, and hsa-miR-139-3p.
Conclusion: We established and validated a model to discriminate the EC patients from early-stage to late-stage for diagnosing.
Esophageal cancer (EC) which has 604,000 new cases and 544,000 deaths, ranked in 7th place for incidence and was the sixth leading cause in 2020(1). The 5-year survival rates of esophageal cancer patients are still unfavorable, ranging from 15 % to 20% [1-4]. The EC patients always diagnosed in their advanced stages which are caused by the histologic, molecular, and etiological heterogeneity are inherently resistant to systemic therapy. However, survival outcome improves and mortality decreases significantly, for the EC patients if they are diagnosed at the early stage[5, 6].
microRNA (miRNA) is one type of RNA which without coding function and single-stranded RNA molecule[7, 8]. In eukaryotes, miRNAs played an essential role in various biological processes. Besides that, their abnormal expressions intensify tumor malignancy, such as tumor differentiation, proliferation, metastasis, and apoptosis[9-12]. Many efforts have been made to explore the relationship of the expression of miRNAs with the prognosis of EC, such
as miR-10b-3p, miR-30a-3p/5p, and miR-515-3p[13-15], as well as with the biological function and related genetic pathways, etc., which provide the evidence for treating and diagnosis EC as a useful target. However, most of them were studied as the only miRNA, and were not reproducible by other researchers. That’s maybe the reason they could not eventually be applied to large-scale population screening of cancers. In the present study, we study a group of miRNA network, demographic, and clinical data to separate the two stages of EC patients: early stage versus advanced stage. Based on the data of EC patients downloaded from The Cancer Genome Atlas (TCGA) database, the prognosis-related miRNAs and personal factors were selected by LASSO regression to build the classifier, SVM algorithm was applied to verify the classification effect of the model. Furthermore, the nomogram which focuses on every factor entered into it was established to predict the probability of patients in the early stage. Finally, Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses were used on these genes to further penetrate the molecular mechanisms of EC detection.
Data Source and Preparation
Differentially expressed (DE) miRNA analysis was performed based on the portal miRNA expression data of cancerous tissues and para-cancerous tissues downloaded from The Cancer Genome Atlas (TCGA) data portal (https://portal-gdc-cancer-gov/) using the “limma” package in R 4.1.2. Then, the DE-miRNAs were analyzed the prognosis value by Kaplan-Meier (KM) survival method. Subsequently, LASSO regression was performed to select variables among those miRNAs and the corresponding personal information (https://tcga-xenahubs-net/download) by using the “glmnet” package in R 4.1.2. The SVM model was applied to verify and evaluate the performance of those factors to classify the patients from early stage to advanced ones by using the “e1071” package. The function of this classifier was evaluated by C-index, ROC curve, sensitivity, specificity, PPV, NPV, accuracy, and kappa. Finally, the miRNA signatures and personal factors were entered into the nomogram model which focuses on the contribution of every factor to predict the probability of early stage of EC. The accuracy, clinical usage, and calibration for the present model were assessed using C-index, ROC curve, DCA, and calibration curve.
Statistical Analysis
The statistical analysis was performed by R software (version 4.1.2). p < 0.05 was regarded as statistically significant.
Identification of DE and prognosis related miRNAs
When the screening criteria as p < 0.05 and |log2 (FC)| >1, 23 DE-miRNAs (including 7 upregulated and 16 downregulated) between the EC cancerous group and the control group were identified among the total of 2001 miRNAs. The heatmap and volcano plot of the DE-miRNAs are shown in Figure1A,B. 18 miRNAs were prognosis-related for the EC patients (p < 0.05; Figure1C).
Fig 1: Identification of DE and prognosis related miRNAs in Esophageal cancer.
(A) The heatmap of 24 potential DEs.
(B) The volcano plot of differentially expressed miRNAs (DEs) in esophageal cancer from the TCGA dataset. The red dots and blue dots represent upregulated DEs and downregulated DEs with significance (adjusted p-value < 0.05 and |log2 (FC)| > 1), respectively. The grey dots are those DEs without significance.
(C) Forest plot of K-M analysis for prognostic related miRNAs. The maroon squares on the line represent the hazard ratio (HR) and the line represents the 95% confidence interval (CI).
Construction of classifier by LASSO method
LASSO regression was applied to select the most effective variables from 18 miRNAs, demographic, and clinical data to construct a discriminator for stage status. hsa-let-7g-5p, hsa-miR-452-5p, hsa-miR-92b-5p, hsa-miR-139-3p, gender, M status, N status, T status, and postoperative radiation were selected by LASSO regression, which is shown in Figure 2A, B. The AUC of the LASSO algorithm was 0.970 (Figure 2C).
Fig 2:
(A) The vertical dashed lines were calculated at the best log (lambda) value and (B) Lasso coefficient values.
(C) ROC curves to evaluate the discrimination of EC patients between early and advanced stages by LASSO regression.
Verification and evaluation of the classifier by SVM algorithm
As shown in Fig 3A, the classifier which consisted by these 9 factors could discriminate early-stage patients from late-stage exactly. Moreover, the classifier reach an AUC value of 0.981(Fig 3B), and the sensitivity, specificity, PPV, NPV, accuracy, and kappa were 0.983, 0.979, 0.983, 0.978, 0.981, and 0.962 respectively. These results illustrated that the model based on these factors could classify the early stage and advanced stage of EC patients accurately and reliably.
Fig 3:
(A) SVM classification plot. Red “#” represents EC patients in the advanced stage and Black “#” represents EC patients in the early stage.
(B) ROC curves to evaluate the discrimination of EC patients between early and advanced stages by the SVM algorithm.
Establishment and validation of the nomogram
The 9 factors were finally entered into the present nomogram. Among these factors, hsa-let-7g-5p was the most significant one for predicting the probability of patients in early stage, followed by N status, T status, M status, and hsa-miR-139-3p. All the results are shown in Fig 4A.
The discrimination of the nomogram was evaluated using C-index and ROC-related AUC. In the training and validation cohort, C-index were 0.960 and 0.991 respectively, and for the ROC-AUC, it were 0.970 and 1.000 (Table 1).
Table 1: The validation of the nomogram by ROC-AUC and C-index
Index |
Training cohort |
Validation cohort |
||
Estimate |
95%CI |
Estimate |
95%CI |
|
ROC-AUC |
0.970 |
0.957-0.983 |
1.000 |
0.981-1.019 |
C-index |
0.960 |
0.942-0.978 |
0.991 |
0.971-1.011 |
The calibration curves showed that the predicted probability matched the observed ones in both training and validation groups each other consistently (Fig 4B, C).
The clinical validity of the nomogram was valued by the DCA, as shown in Fig 4 D, E the nomogram achieved favorable net clinical benefits across a wide range of threshold probabilities in both the training cohort and the validation cohort.
Fig 4:
(A) Nomogram to predict the probability of early stage for EC patients. After taking values for these 9 factors, the total score is calculated, and the corresponding risk probability of the early stage is obtained according to the total points. miRNAs: 1=High expression, 0=Low expression; post-radiation: 0=without radiation therapy, 1= had radiation after surgery; gender: 1= male, 2=female; T stage: 1-4=T1-T4; N stage: 0-3= N0-N3; M stage: 0-1=M0-M1. To apply the nomogram, the score for the female patients with upregulated expressed miR-let-7g-5p, miR-92b-5p, miR-139-3p, and downregulated miR-452-5p, at T1N0M0 stage, and received postoperative radiation was 300 points and the probability of being early stage was over 99%.
(B) The calibration curve in the training cohort, and in the validation cohort(C). The red solid line indicates a coincidence between the actual and predicted probabilities without bias correcting, the green dot line indicates the coincidence with bias correcting, the blue solid line indicates the coincidence in ideal situation.
(D)The decision curve in the training cohort, in the validation cohort (E). The x-axis represents the threshold probability and the benefit ratio, and the y-axis represents the standardized net benefit. The horizontal black line indicates that no patient has an extreme condition of early stage, and the gray line indicates that all patients have another extreme condition of early stage
GO and KEGG Analysis
Using miRDB, miRWalk, and TargetScan to predict the target genes and taking the intersection to obtain the common ones through the website of
http://bioinformatics.psb.ugent.be/webtools/Venn/ (Table 2).
Table 2 Target gene analysis of miRNAs in prediction model
miRNA |
Target genes |
|
|
|
|
|||||
hsa-let-7g-5p |
KLHL13 |
AEN |
SLC10A7 |
HIC2 |
MAP3K3 |
SCYL3 |
COL1A1 |
KCNC1 |
NAT8L |
CCNY |
hsa-miR-92b-5p |
CNTN2 |
ZDHHC22 |
KLRG1 |
ZDHHC3 |
TMEM150C |
RHPN2 |
IMPDH1 |
CSMD1 |
VCL |
ATP10B |
hsa-miR-139-3p |
F8A1 |
MMP16 |
CD177 |
KRTAP5 |
GRIN3A |
ARHGDIB |
COX16 |
SDC3 |
SLC44A4 |
STAG2 |
According to the KEGG, the top3 pathways for the target genes were the Hypertrophic cardiomyopathy, Hippo signaling pathway, and Th17 cell differentiation (Figure 5A). From the GO analysis, they were involved in calcium ion transmembrane transport, trans-differentiation, positive regulation of JNK cascade, and positive regulation of JUN kinase activity (Figure 5B).
Fig 5:
(A) Biological process enrichment analysis of candidate target genes.
(B) KEGG pathway analysis was performed for the candidate target genes
Even if early diagnosis of EC is clinically crucial, it cannot be ignored that it is technically difficult. The reason for that is the patients always exhibit no symptoms till in their advanced stages. Low sensitivity and specificity of traditional detection biomarkers (e.g., CEA and P53-Ab) in the diagnosis process of cancer, may cause missed and delayed diagnosis[16, 17]. Recently, studies showed that under various storage conditions miRNAs are stable and resistant to degradation[18-21], which made them good candidates for diagnosis and treatment biomarkers[22-25].
There was a study about making the miRNAs as diagnosis markers for ESCC patients[26]. This study had a large sample size and the evaluated index was favorable. In that study, the greedy algorithm was used to establish a diagnostic model, and then the diagnostic ability was evaluated by the receiver operating characteristic curve. In this study, there was no other algorithm was used to verify and evaluate the whole model. Besides that, there were no personal factors except miRNAs considered in the model, such as clinical factors, and demographic factors.
The miRNA expression data of EC tumor samples and para-cancerous samples from TCGA was used to identify DEs and the prognostic ones. 18 miRNAs among 24 DE-miRNAs that had the prognosis value which were found by the K-M method. Then, the LASSO regression analysis to eliminate multicollinearity selected 9 miRNAs and 5 personal factors into the classifier model. SVM algorithm showed that the model can be used to separate the patents of the early stage from advanced one with favorable accuracy and reliability, as evidenced by values of C-index, AU-ROC, sensitivity, specificity, PPV, and NPV. Moreover, a qualified nomogram that not only be used to evaluate the whole model but consider the contribution of every factor in the meantime, was established and validated. From the results of the validation index, as can be seen, the nomogram had good discrimination, clinical usage, and calibration capabilities. hsa-let-7g-5p was the most significant factor for predicting the probability of patients in the early stage, then were the N status, T status, hsa-miR-139-3p, and M status.
hsa-miR-let-7g-5p plays an essential role in Alzheimer Disease and cognitive impairment[27-31]. It's worth noting that myopathy and chronic adipose inflammation are caused through hsa-miR-let-7g-5p triggering the chronic inflammation in elderly patients who have diabetes mellitus[32-34]. hsa-miR-139-3p is observed to be down-regulated in cervical cancer and HPV-16-positive head and neck cancer[35]. For head and neck cancer patients who had high expression of hsa-miR-139-3p would have improved OS as well. Up-regulated hsa-miR-139-3p can significantly inhibit the migration, invasion, proliferation, and growth of xenograft and cervical cancer HeLa cells and played as new intervention targets for cervical cancer treatment [36]. Besides that, it was a key miRNA related to renal cell carcinoma, breast carcinoma, ovarian carcinoma, hepatocellular carcinoma [37-42]. Several studies were about hsa-miR-92b-5p may serve as a marker for cardiovascular diseases and Heart Failure[43-45].
N stage, T stage, M stage, gender, and postoperative radiation were also found to affect the predicted probabilities. The probability of early stage for patients would decrease accompanied by the increasing level of N, T, and M stage. Gender is the only demographic factor that entered into the nomogram. Male acted as a negative factor for the probability of early stage which means compare to females, the male EC patient is the one who had more probability to have a poorer outcome. The unfavorable survival time of males might be largely contributed by their unhealthy behaviors such as smoking, and alcohol abuse. The patients who had radiation after surgery compared to the ones who did not receive radiation therapy at all had more chances without disease prognosis. As the study by Zhang Y and Fu C, etc. pointed out that adjuvant radiotherapy followed surgery prevents recurrence of the early stage of EC patients, and achieves an excellent response rate, favorable survival for EC patients in the advanced stage [46, 47].
To explore the functional roles of the 4 miRNA signatures in EC, we predict the target genes of miRNAs and analyze the signaling pathways as well as possible biological processes which were involved in these target genes. Results showed that the 4 miRNAs might be functionally related to the Hippo signaling pathway, Th17 cell differentiation, and several cellular biological responses to RLC. Several studies verified that tumorigenesis of esophageal was contributed to abnormal activation of Hippo signaling pathway [48-51]. Th17 cells may serve as the positive role on anti-EC immunity. It is characterized as IL-17 producing CD4+ T cells that also produce IL-21, IL-22, and IL-26[52-55] which are involved in the regulation of inflammation, cell growth, metabolism, and a variety of other physiological functions that play an important role in EC progression[56-59] might explain its function on antitumor immunity. The RLC which is associated with miRNAs display cell-independent capacity to process precursor microRNAs (pre-miRNAs) into mature miRNAs[60]. Nevertheless, cell functional research is needed to confirm the findings of our study.
Our study has several limitations
An external validation cohort for patients with EC was not available in the present study. Although the personal factors, such as demographic and clinical ones were analyzed, the other known risk factors for EC, e.g., smoking and drinking habits, were not provided in the data of the present study. Therefore, multicenter clinical validation is needed to evaluate the external utility of this model.
In summary, our study established and validated a model consisting of a group of miRNA signatures and personal factors to classify the early stage and the late stage of EC patients. Several biological processes and signaling pathways are unveiled to show that these miRNAs may participate in various molecular mechanisms of EC.
There was no funding from the public, commercial, or not-for-profit sectors for the present study.
Acknowledgments
The authors would like to express our gratitude to Haili Jin for making the language more fluent and Yumao Cai for assistance with data analysis and comments that improved the manuscript substantially.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
The datasets generated and/or analyzed during the current study are available in the [TCGA], [https://www.cancer.gov/aboutnci/organization/ccg/research/structural-genomics/tcga].