Transcriptomics and the “curse of dimensionality”: Monte Carlo simulations of ml-models as a tool for analyzing multidimensional data in tasks of searching markers of biological processes

G. J. Osmak; Осьмак Г. Ж.; M. V. Pisklova; Писклова М. В.

doi:10.31857/S0026898425010117

Transcriptomics and the “curse of dimensionality”: Monte Carlo simulations of ml-models as a tool for analyzing multidimensional data in tasks of searching markers of biological processes

Authors: Osmak G.J.¹^,2, Pisklova M.V.¹^,2
Affiliations:
1. Сhazov National Medical Research Center for Cardiology
2. Pirogov Russian National Research Medical University
Issue: Vol 59, No 1 (2025)
Pages: 154-161
Section: БИОИНФОРМАТИКА
URL: https://innoscience.ru/0026-8984/article/view/682236
DOI: https://doi.org/10.31857/S0026898425010117
EDN: https://elibrary.ru/HCCMTU
ID: 682236

Cite item

Full Text

Open Access
Restricted Access

Access granted
Restricted Access

Subscription or Fee Access

Abstract
Full Text
About the authors
References
Supplementary files
Statistics

Abstract

High-throughput transcriptomic research methods provide the assessment of a vast number of factors, valuable for researchers. At the same time the “curse of dimensionality” issues arise, which lead to increasing requirements on data processing and analysis methods. In this study, we propose a new algorithm that combines Monte Carlo methods and machine learning. This algorithm will enable feature space reduction by highlighting genes most likely associated with the investigated diseases. Our approach allows not only to generate a set of “interesting” genes but also to assign weight to each gene, indicating its “importance”. This measure can be used in subsequent statistical analysis, visualization, and interpretation of results. Algorithm performance was demonstrated on open transcriptomic data of patients with HCM (GSE36961 and GSE1145). The analysis revealed genes MYH6, FCN3, RASD1, and SERPINA3, which is in good agreement with the available literature.

Keywords

transcriptomics, machine learning, Monte Carlo, hypertrophic cardiomyopathy, biomarkers

Full Text

About the authors

G. J. Osmak

Сhazov National Medical Research Center for Cardiology; Pirogov Russian National Research Medical University

Author for correspondence.
Email: german.osmak@gmail.com
Russian Federation, Moscow; Moscow

M. V. Pisklova

Сhazov National Medical Research Center for Cardiology; Pirogov Russian National Research Medical University

Email: german.osmak@gmail.com
Russian Federation, Moscow; Moscow

References

Akond Z., Alam M., Mollah Md.N.H. (2018) Biomarker identification from RNA-seq data using a robust statistical approach. Bioinformation. 14(4), 153–163.
Tang M., Sun J., Shimizu K., Kadota K. (2015) Evaluation of methods for differential expression analysis on multi-group RNA-seq count data. BMC Bioinformatics. 16(1), 360.
Barbiero P., Squillero G., Tonda A. (2020) Modeling generalization in machine learning: a methodological and computational study. arXiv. 2006.15680.
Robinson M.D., McCarthy D.J., Smyth G.K. (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 26(1), 139–140.
Smyth G.K. (2005) Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor. New York: Springer.
Benjamini Y., Hochberg Y. (1997) Multiple hypotheses testing with weights. Scandinavian J. Statistics. 24(3), 407–418.
Holm S. (1979) A simple sequentially rejective multiple test procedure. Scandinavian J. Statistics. 6(2), 65–70.
Gui J., Tosteson T.D., Borsuk M. (2012) Weighted multiple testing procedures for genomic studies. BioData Mining. 5(1), 4.
Basu P., Cai T. T., Das K., Sun W (2018) Weighted false discovery rate control in large-scale multiple testing. J. Am. Stat. Assoc. 113(523), 1172–1183.
Mann H.B., Whitney D.R. (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann. Mathemat. Statistics. 18(1), 50–60.
Benjamini Y., Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Statist. Soc.: Series B (Methodological). 57(1), 289–300.
Genovese C.R., Roeder K., Wasserman L. (2006) False discovery control with p-value weighting. Biometrika. 93(3), 509–524.
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Duchesnay E. (2011) Scikit-learn: machine learning in python. J. Machine Learning Res. 12(Oct), 2825–2830.
Anfinson M., Fitts R.H., Lough J.W., James J.M., Simpson P.M., Handler S.S., Mitchell M.E., Tomita-Mitchell A. (2022) Significance of α-myosin heavy chain (MYH6) variants in hypoplastic left heart syndrome and related cardiovascular diseases. J. Cardiovascular Dev. Dis. 9(5), 144.
Ntelios D., Meditskou S., Efthimiadis G., Pitsis A., Zegkos T., Parcharidou D., Theotokis P., Alexouda S., Karvounis H., Tzimagiorgis G. (2022) α-Myosin heavy chain (MYH6) in hypertrophic cardiomyopathy: рrominent expression in areas with vacuolar degeneration of myocardial cells. Pathol. Int. 72(5), 308–310.
Suzuki T., Saito K., Yoshikawa T., Hirono K., Hata Y., Nishida N., Yasuda K., Nagashima M. (2022) A double heterozygous variant in MYH6 and MYH7 associated with hypertrophic cardiomyopathy in a Japanese family. J. Cardiol. Cases. 25(4), 213–217.
Michalski M., Świerzko A.S., Pągowska-Klimek I., Niemir Z.I., Mazerant K., Domżalska-Popadiuk I., Moll M., Cedzyński M. (2015) Primary ficolin-3 deficiency — is it associated with increased susceptibility to infections? Immunobiology. 220(6), 711–713.
Prohászka Z., Munthe-Fog L., Ueland T., Gombos T., Yndestad A., Förhécz Z., Skjoedt MO, Pozsonyi Z., Gustavsen A., Jánoskuti L., Karádi I., Gullestad L., Dahl C.P., Askevold E.T., Füst G., Aukrust P., Mollnes T.E., Garred P. (2013) Association of ficolin-3 with severity and outcome of chronic heart failure. PLoS One. 8(4), e60976.
Li D., Lin H., Li L. (2020) Multiple feature selection strategies identified novel cardiac gene expression signature for heart failure. Front. Physiol. 11, 604241.
Song H., Chen S., Zhang T., Huang X., Zhang Q., Li C., Chen C., Chen S., Liu D., Wang J., Tu Y., Wu Y., Liu Y. (2022) Integrated strategies of diverse feature selection methods identify aging-based reliable gene signatures for ischemic cardiomyopathy. Front. Mol. Biosci. 9, 805235.
Wie J., Kim B.J., Myeong J., Ha K., Jeong S.J., Yang D., Kim E., Jeon J.H., So I. (2015) The roles of Rasd1 small G proteins and leptin in the activation of TRPC4 transient receptor potential channels. Channels. 9(4), 186–195.
Kemppainen R.J., Behrend E.N. (1998) Dexamethasone rapidly induces a novel Ras superfamily member-related gene in AtT-20 cells. J. Biol. Chem. 273(6), 3129–3131.
McGrath M.F., Ogawa T., De Bold A.J. (2012) Ras dexamethasone-induced protein 1 is a modulator of hormone secretion in the volume overloaded heart. Am. J. Physiol. Heart Circ. Physiol. 302(9), H1826–H1837.
Baker C., Belbin O., Kalsheker N., Morgan K. (2007) SERPINA3 (aka alpha-1-antichymotrypsin). Front. Biosci. 12(8–12), 2821–2835.
de Mezer M., Rogaliński J., Przewoźny S., Chojnicki M., Niepolski L., Sobieska M., Przystańska A. (2023) SERPINA3: stimulator or inhibitor of pathological changes. Biomedicines. 11(1), 156.
You H., Dong M. (2023) Prediction of diagnostic gene biomarkers for hypertrophic cardiomyopathy by integrated machine learning. J. Int. Med. Res. 51(11), 03000605231213781.

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

2. Fig. 1. Research scheme.

Download (420KB)

Indexing metadata

3. Fig. 2. Results of Monte Carlo simulations for training classifiers. a — Convergence of the algorithm by the size of the set of the most significant genes; red dashes along the abscissa axis show the moments of change in the composition of this set. b — Dynamics of growth depending on the iteration of the algorithm of the number of selected genes (green line); weights of genes included in more than half of the models (red line); iteration at which the set of the most significant genes was changed (red vertical dashes along the abscissa axis). c — Histogram of the distribution of the ROC-AUC measure for ML classifiers in 3000 Monte Carlo simulations. d — Histogram of the distribution of the estimated weight of genes included in at least one model.

Download (410KB)

Indexing metadata

4. Fig. 3. Testing hypotheses about the association of selected genes on the independent GSE1145 dataset. a — Gene expression comparison graph (Volcano plot), the size of the dots denotes their WeightML. b — Summary table of statistics; only significant (by p-value) results are shown. p-valMW — p-value according to the Mann–Whitney criterion; FDRBH — Benjamini–Hochberg multiple comparison correction; FDRwBH — weighted Benjamini–Hochberg multiple comparison correction; WeightML — gene weight, reflecting its significance for classification models based on the results of Monte Carlo simulations; log2FC — logarithm of the ratio of means.

Download (791KB)

Indexing metadata

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register

Vol 59, No 4 (2025)

Vol 59, No 4 (2025)

Transcriptomics and the “curse of dimensionality”: Monte Carlo simulations of ml-models as a tool for analyzing multidimensional data in tasks of searching markers of biological processes

Full Text

Abstract

Keywords

Full Text

About the authors

G. J. Osmak

M. V. Pisklova

References

Supplementary files