Transcriptomics and the “curse of dimensionality”: Monte Carlo simulations of ml-models as a tool for analyzing multidimensional data in tasks of searching markers of biological processes

Cover Page

Cite item

Full Text

Open Access Open Access
Restricted Access Access granted
Restricted Access Subscription Access

Abstract

High-throughput transcriptomic research methods provide the assessment of a vast number of factors, valuable for researchers. At the same time the “curse of dimensionality” issues arise, which lead to increasing requirements on data processing and analysis methods. In this study, we propose a new algorithm that combines Monte Carlo methods and machine learning. This algorithm will enable feature space reduction by highlighting genes most likely associated with the investigated diseases. Our approach allows not only to generate a set of “interesting” genes but also to assign weight to each gene, indicating its “importance”. This measure can be used in subsequent statistical analysis, visualization, and interpretation of results. Algorithm performance was demonstrated on open transcriptomic data of patients with HCM (GSE36961 and GSE1145). The analysis revealed genes MYH6, FCN3, RASD1, and SERPINA3, which is in good agreement with the available literature.

Full Text

Restricted Access

About the authors

G. J. Osmak

Сhazov National Medical Research Center for Cardiology; Pirogov Russian National Research Medical University

Author for correspondence.
Email: german.osmak@gmail.com
Russian Federation, Moscow; Moscow

M. V. Pisklova

Сhazov National Medical Research Center for Cardiology; Pirogov Russian National Research Medical University

Email: german.osmak@gmail.com
Russian Federation, Moscow; Moscow

References

  1. Akond Z., Alam M., Mollah Md.N.H. (2018) Biomarker identification from RNA-seq data using a robust statistical approach. Bioinformation. 14(4), 153–163.
  2. Tang M., Sun J., Shimizu K., Kadota K. (2015) Evaluation of methods for differential expression analysis on multi-group RNA-seq count data. BMC Bioinformatics. 16(1), 360.
  3. Barbiero P., Squillero G., Tonda A. (2020) Modeling generalization in machine learning: a methodological and computational study. arXiv. 2006.15680.
  4. Robinson M.D., McCarthy D.J., Smyth G.K. (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 26(1), 139–140.
  5. Smyth G.K. (2005) Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor. New York: Springer.
  6. Benjamini Y., Hochberg Y. (1997) Multiple hypotheses testing with weights. Scandinavian J. Statistics. 24(3), 407–418.
  7. Holm S. (1979) A simple sequentially rejective multiple test procedure. Scandinavian J. Statistics. 6(2), 65–70.
  8. Gui J., Tosteson T.D., Borsuk M. (2012) Weighted multiple testing procedures for genomic studies. BioData Mining. 5(1), 4.
  9. Basu P., Cai T. T., Das K., Sun W (2018) Weighted false discovery rate control in large-scale multiple testing. J. Am. Stat. Assoc. 113(523), 1172–1183.
  10. Mann H.B., Whitney D.R. (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann. Mathemat. Statistics. 18(1), 50–60.
  11. Benjamini Y., Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Statist. Soc.: Series B (Methodological). 57(1), 289–300.
  12. Genovese C.R., Roeder K., Wasserman L. (2006) False discovery control with p-value weighting. Biometrika. 93(3), 509–524.
  13. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Duchesnay E. (2011) Scikit-learn: machine learning in python. J. Machine Learning Res. 12(Oct), 2825–2830.
  14. Anfinson M., Fitts R.H., Lough J.W., James J.M., Simpson P.M., Handler S.S., Mitchell M.E., Tomita-Mitchell A. (2022) Significance of α-myosin heavy chain (MYH6) variants in hypoplastic left heart syndrome and related cardiovascular diseases. J. Cardiovascular Dev. Dis. 9(5), 144.
  15. Ntelios D., Meditskou S., Efthimiadis G., Pitsis A., Zegkos T., Parcharidou D., Theotokis P., Alexouda S., Karvounis H., Tzimagiorgis G. (2022) α-Myosin heavy chain (MYH6) in hypertrophic cardiomyopathy: рrominent expression in areas with vacuolar degeneration of myocardial cells. Pathol. Int. 72(5), 308–310.
  16. Suzuki T., Saito K., Yoshikawa T., Hirono K., Hata Y., Nishida N., Yasuda K., Nagashima M. (2022) A double heterozygous variant in MYH6 and MYH7 associated with hypertrophic cardiomyopathy in a Japanese family. J. Cardiol. Cases. 25(4), 213–217.
  17. Michalski M., Świerzko A.S., Pągowska-Klimek I., Niemir Z.I., Mazerant K., Domżalska-Popadiuk I., Moll M., Cedzyński M. (2015) Primary ficolin-3 deficiency — is it associated with increased susceptibility to infections? Immunobiology. 220(6), 711–713.
  18. Prohászka Z., Munthe-Fog L., Ueland T., Gombos T., Yndestad A., Förhécz Z., Skjoedt MO, Pozsonyi Z., Gustavsen A., Jánoskuti L., Karádi I., Gullestad L., Dahl C.P., Askevold E.T., Füst G., Aukrust P., Mollnes T.E., Garred P. (2013) Association of ficolin-3 with severity and outcome of chronic heart failure. PLoS One. 8(4), e60976.
  19. Li D., Lin H., Li L. (2020) Multiple feature selection strategies identified novel cardiac gene expression signature for heart failure. Front. Physiol. 11, 604241.
  20. Song H., Chen S., Zhang T., Huang X., Zhang Q., Li C., Chen C., Chen S., Liu D., Wang J., Tu Y., Wu Y., Liu Y. (2022) Integrated strategies of diverse feature selection methods identify aging-based reliable gene signatures for ischemic cardiomyopathy. Front. Mol. Biosci. 9, 805235.
  21. Wie J., Kim B.J., Myeong J., Ha K., Jeong S.J., Yang D., Kim E., Jeon J.H., So I. (2015) The roles of Rasd1 small G proteins and leptin in the activation of TRPC4 transient receptor potential channels. Channels. 9(4), 186–195.
  22. Kemppainen R.J., Behrend E.N. (1998) Dexamethasone rapidly induces a novel Ras superfamily member-related gene in AtT-20 cells. J. Biol. Chem. 273(6), 3129–3131.
  23. McGrath M.F., Ogawa T., De Bold A.J. (2012) Ras dexamethasone-induced protein 1 is a modulator of hormone secretion in the volume overloaded heart. Am. J. Physiol. Heart Circ. Physiol. 302(9), H1826–H1837.
  24. Baker C., Belbin O., Kalsheker N., Morgan K. (2007) SERPINA3 (aka alpha-1-antichymotrypsin). Front. Biosci. 12(8–12), 2821–2835.
  25. de Mezer M., Rogaliński J., Przewoźny S., Chojnicki M., Niepolski L., Sobieska M., Przystańska A. (2023) SERPINA3: stimulator or inhibitor of pathological changes. Biomedicines. 11(1), 156.
  26. You H., Dong M. (2023) Prediction of diagnostic gene biomarkers for hypertrophic cardiomyopathy by integrated machine learning. J. Int. Med. Res. 51(11), 03000605231213781.

Supplementary files

Supplementary Files
Action
1. JATS XML
2. Fig. 1. Research scheme.

Download (420KB)
3. Fig. 2. Results of Monte Carlo simulations for training classifiers. a — Convergence of the algorithm by the size of the set of the most significant genes; red dashes along the abscissa axis show the moments of change in the composition of this set. b — Dynamics of growth depending on the iteration of the algorithm of the number of selected genes (green line); weights of genes included in more than half of the models (red line); iteration at which the set of the most significant genes was changed (red vertical dashes along the abscissa axis). c — Histogram of the distribution of the ROC-AUC measure for ML classifiers in 3000 Monte Carlo simulations. d — Histogram of the distribution of the estimated weight of genes included in at least one model.

Download (410KB)
4. Fig. 3. Testing hypotheses about the association of selected genes on the independent GSE1145 dataset. a — Gene expression comparison graph (Volcano plot), the size of the dots denotes their WeightML. b — Summary table of statistics; only significant (by p-value) results are shown. p-valMW — p-value according to the Mann–Whitney criterion; FDRBH — Benjamini–Hochberg multiple comparison correction; FDRwBH — weighted Benjamini–Hochberg multiple comparison correction; WeightML — gene weight, reflecting its significance for classification models based on the results of Monte Carlo simulations; log2FC — logarithm of the ratio of means.

Download (791KB)

Copyright (c) 2025 Russian Academy of Sciences