Research Article
Open access
Published on 15 November 2024
Download pdf
Wang,X. (2024). Towards thermophilic protein stability prediction: A comprehensive study of machine learning approaches. Theoretical and Natural Science,59,164-173.
Export citation

Towards thermophilic protein stability prediction: A comprehensive study of machine learning approaches

Xin Wang *,1,
  • 1 School of Computing, Australian National University, 108 North Road, Action, ACT 2601, Australia

* Author to whom correspondence should be addressed.

https://doi.org/10.54254/2753-8818/59/20241392

Abstract

Thermophilic proteins are critical in biology due to their enhanced thermal stability. Different machine learning approaches have been applied to estimate protein thermal stability. Current study categorizes the previous research into classification and regression tasks and explores the impact of different data representations, including tabular, sequence, and graph data on evaluation performance. Pipelines for prediction using different representations are thoroughly described. Current challenges, such as insufficient and imbalanced datasets, are addressed with potential solutions, such as transfer learning and re-sampling methods. Additionally, model interpretability, discussing various approaches to obtain model explanations and highlighting that some explanations are inconsistent have also been included in current study. Such comprehensive overview provides insights into existing methodologies and suggests potential research directions and improvements.

Keywords

The paper must have at least three keywords. Protein Thermal Stability, Machine Learning, Thermophilic Proteins, Protein Sequence.

[1]. Li M, Wang H, Yang Z, Zhang L, and Zhu Y 2023 DeepTM: A deep learning algorithm for prediction of melting temperature of thermophilic proteins directly from sequences Comput. Struct. Biotechnol. J. 21 5544–60

[2]. Jung F, Frey K, Zimmer D, and Mühlhaus T 2023 DeepSTABp: a deep learning approach for the prediction of thermal protein stability Int. J. Mol. Sci. 24 7444

[3]. Zhao J, Yan W, and Yang Y 2023 DeepTP: a deep learning model for thermophilic protein prediction Int. J. Mol. Sci. 24 2217

[4]. Pei H, Li J, Ma S, Jiang J, Li M, Zou Q, and Lv Z 2023 Identification of thermophilic proteins based on sequence-based bidirectional representations from transformer-embedding features Appl. Sci. 13 2858

[5]. Guo Z, Wang P, Liu Z, and Zhao Y 2020 Discrimination of thermophilic proteins and non-thermophilic proteins using feature dimension reduction Front. Bioeng. Biotechnol. 8 584807

[6]. Charoenkwan P, Chotpatiwetchkul W, Lee VS, Nantasenamat C, and Shoombuatong W 2021 A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides Sci. Rep. 11 23782

[7]. Meng C, Ju Y, and Shi H 2022 TMPpred: A support vector machine-based thermophilic protein identifier Anal. Biochem. 645 114625

[8]. Feng C, Ma Z, Yang D, Li X, Zhang J, and Li Y 2020 A method for prediction of thermophilic protein based on reduced amino acids and mixed features Front. Bioeng. Biotechnol. 8 28

[9]. Haselbeck F, John M, Zhang Y, Pirnay J, Fuenzalida-Werner JP, Costa RD, and Grimm DG 2023 Superior protein thermophilicity prediction with protein language model embeddings NAR Genom. Bioinform. 5 lqad087

[10]. Charoenkwan P, Schaduangrat N, Moni MA, Manavalan B, and Shoombuatong W 2022 SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins Comput. Biol. Med. 146 105704

[11]. Ahmed Z, Zulfiqar H, Khan AA, Gul I, Dao FY, Zhang ZY, et al 2022 iThermo: a sequence-based model for identifying thermophilic proteins using a multi-feature fusion strategy Front. Microbiol. 13 790063

[12]. Pudžiuvelytė I, Olechnovič K, Godliauskaite E, Sermokas K, Urbaitis T, Gasiunas G, and Kazlauskas D 2024 TemStaPro: protein thermostability prediction using sequence representations from protein language models Bioinformatics 40 btae157

[13]. Yang Y, Zhao J, Zeng L, and Vihinen M 2022 ProTstab2 for prediction of protein thermal stabilities Int. J. Mol. Sci. 23 10798

[14]. Dehouck Y, Folch B, and Rooman M 2008 Revisiting the correlation between proteins' thermoresistance and organisms' thermophilicity Protein Eng. Des. Sel. 21 275–78

[15]. Gromiha MM, Oobatake M, and Sarai A 1999 Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins Biophys. Chem. 82 51–67

[16]. Lin H, and Chen W 2011 Prediction of thermophilic proteins using feature selection technique J. Microbiol. Methods 84 67–70

[17]. Montanucci L, Capriotti E, Birolo G, Benevenuta S, Pancotti C, Lal D, and Fariselli P 2022 DDGun: an untrained predictor of protein stability changes upon amino acid variants Nucleic Acids Res. 50 W222–27

[18]. Jarzab A, Kurzawa N, Hopf T, Moerch M, Zecha J, Leijten N, et al 2020 Meltome atlas—thermal proteome stability across the tree of life Nat. Methods 17 495–503

[19]. Yang Y, Ding X, Zhu G, Niroula A, Lv Q, and Vihinen M 2019 ProTstab–predictor for cellular protein stability BMC Genomics 20 1–9

[20]. Pezeshgi Modarres H, Mofrad MR, and Sanati-Nezhad A 2018 ProtDataTherm: A database for thermostability analysis and engineering of proteins PLoS One 13 e0191222

[21]. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al 2021 Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences Proc. Natl Acad. Sci. USA 118 e2016239118

[22]. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al 2023 Evolutionary-scale prediction of atomic-level protein structure with a language model Science 379 1123–30

[23]. Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al 2021 Prottrans: Toward understanding the language of life through self-supervised learning IEEE Trans. Pattern Anal. Mach. Intell. 44 7112–27

[24]. Nikam R, Kulandaisamy A, Harini K, Sharma D, and Gromiha MM 2021 ProThermDB: thermodynamic database for proteins and mutants revisited after 15 years Nucleic Acids Res. 49 D420–24

[25]. Devlin J, Chang MW, Lee K, and Toutanova K 2018 Bert: Pre-training of deep bidirectional transformers for language understanding arXiv Preprint arXiv:1810.04805

[26]. Chawla NV, Bowyer KW, Hall LO, and Kegelmeyer WP 2002 SMOTE: synthetic minority over-sampling technique J. Artif. Intell. Res. 16 321–57

[27]. Li W, Jaroszewski L, and Godzik A 2001 Clustering of highly homologous sequences to reduce the size of large protein databases Bioinformatics 17 282–83

[28]. Xiao N, Cao DS, Zhu MF, and Xu QS 2015 protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences Bioinformatics 31 1857–59

[29]. Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, et al 2018 iFeature: a python package and web server for features extraction and selection from protein and peptide sequences Bioinformatics 34 2499–502

[30]. Ruiz-Blanco YB, Paz W, Green J, and Marrero-Ponce Y 2015 ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins BMC Bioinformatics 16 1–15

[31]. Guyon I, Weston J, Barnhill S, and Vapnik V 2002 Gene selection for cancer classification using support vector machines Mach. Learn. 46 389–422

[32]. Fisher RA 1936 Design of experiments Br. Med. J. 1 554

[33]. Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger SJ, and Söding J 2019 HH-suite3 for fast remote homology detection and deep protein annotation BMC Bioinformatics 20 1–15

[34]. Shapley LS 1953 A value for n-person games Contributions to the Theory of Games II AW Tucker and HW Kuhn

Cite this article

Wang,X. (2024). Towards thermophilic protein stability prediction: A comprehensive study of machine learning approaches. Theoretical and Natural Science,59,164-173.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 4th International Conference on Biological Engineering and Medical Science

Conference website: https://2024.icbiomed.org/
ISBN:978-1-83558-721-8(Print) / 978-1-83558-722-5(Online)
Conference date: 25 October 2024
Editor:Alan Wang
Series: Theoretical and Natural Science
Volume number: Vol.59
ISSN:2753-8818(Print) / 2753-8826(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).