Research Article
Open access
Published on 14 August 2024
Download pdf
Liu,L. (2024). CSIP: Contrastive learning with Graph Neural Network enables chemical structure to image mapping. Applied and Computational Engineering,86,282-294.
Export citation

CSIP: Contrastive learning with Graph Neural Network enables chemical structure to image mapping

Leo Liu *,1,
  • 1 Georgetown Preparatory School, Maryland, USA

* Author to whom correspondence should be addressed.

https://doi.org/10.54254/2755-2721/86/20241606

Abstract

The development of advanced AI has been transforming the biomedical field. The emergence of multi-modal biomedical data such as imaging, sequencing, and other omics data further enabled the training of AI models for complex analytical tasks. However, such data analysis remains challenging due to the difficulty of integrating information from different modalities. In this paper, we aim to address this challenge by proposing a methodology to integrate image and molecular structure data. We present the Contrastive Structure-to-Image Pretraining (CSIP) framework, which leverages a self-supervised Graph Neural Network (GNN) to encode molecules and images into a joint feature embedding space. This direct mapping between the two modalities enables a wide variety of applications, including image profiling and clustering of molecules based on their effects on cell morphology. Image profiles generated by CSIP archived an average AUC of 0.708 on various biological activity prediction tasks, rivalling the state-of-the-arts and outperforming some fully supervised methodologies. Further, CSIP improved the accuracy of image-molecule matching by 29-folds from the random baseline after being trained on a small dataset, which demonstrated data efficiency. The code to reproduce our results can be found at https://github.com/LeoL18/CSIP.

Keywords

Image-based Profiling, Multi-modal Data, Graph Neural Network, Contrastive learning

[1]. Tao Peng, Gregory M. Chen, and Kai Tan. Gluer: integrative analysis of single cell omics and imaging data by deep neural network. bioRxiv, 2021.

[2]. Romain Lopez, Achille Nazaret, Maxime Langevin, Jules Samaran, Jeffrey Regier, Michael I. Jordan, and Nir Yosef. A joint model of unpaired data from scrna-seq and spatial transcriptomics for imputing missing gene expression measurements, 2019.

[3]. Nathaniel Braman, Jacob W. H. Gordon, Emery T. Goossens, Caleb Willis, Martin C. Stumpe, and Jagadish Venkataraman. Deep orthogonal fusion: Multimodal prognostic biomarker discovery integrating radiology, pathology, genomic, and clinical data, 2021.

[4]. Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M. Mauck, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. Cell, 177(7):1888–1902.e21, 2019.

[5]. Ziqi Zhang, Chengkai Yang, and Xiuwei Zhang. scDART: integrating unmatched scRNA-seq and scATAC-seq data and learning cross-modality relationship simultaneously. Genome Biology, 23(1):139, June 2022.

[6]. Ana Sanchez-Fernandez, Elisabeth Rumetshofer, Sepp Hochreiter, and Günter Klambauer. Cloome: contrastive learning unlocks bioimaging databases for queries with chemical structures. Nature Communications, 14(1):7339, 2023.

[7]. Marzieh Haghighi, Juan C. Caicedo, Beth A. Cimini, Anne E. Carpenter, and Shantanu Singh. High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations. Nat Methods, 2022.

[8]. Sören Richard Stahlschmidt, Benjamin Ulfenborg, and Jane Synnergren. Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics, 23(2):bbab569, 01 2022.

[9]. Chihyun Park, Jihwan Ha, and Sanghyun Park. Prediction of alzheimer’s disease based on deep neural network by integrating gene expression and dna methylation dataset. Expert Systems with Applications, 140:112873, 2020.

[10]. Lianhe Zhao, Qiongye Dong, Chunlong Luo, Yang Wu, Dechao Bu, Xiaoning Qi, Yufan Luo, and Yi Zhao. Deepomix: A scalable and interpretable multi-omics deep learning framework and application in cancer survival analysis. Computational and Structural Biotechnology Journal, 19:2719–2725, 2021.

[11]. Olivier B Poirion, Zheng Jing, Kumardeep Chaudhary, Sijia Huang, and Lana X Garmire. DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data. Genome Medicine, 13(1):112, July 2021.

[12]. Chunlei Liu, Hao Huang, and Pengyi Yang. Multi-task learning from multimodal single-cell omics with Matilda. Nucleic Acids Research, 51(8):e45–e45, 03 2023.

[13]. Ilan Shomorony, Elizabeth T Cirulli, Lei Huang, Lori A Napier, Robyn R Heister, Michael Hicks, Isaac V Cohen, Hung-Chun Yu, Christine Leon Swisher, Natalie M Schenker-Ahmed, Weizhong Li, Karen E Nelson, Pamila Brar, Andrew M Kahn, Timothy D Spector, C Thomas Caskey, J Craig Venter, David S Karow, Ewen F Kirkness, and Naisha Shah. An unsupervised learning approach to identify novel signatures of health and disease from multimodal data. Genome Medicine, 12(1):7, January 2020.

[14]. Lingke Kong, Chenyu Lian, Detian Huang, Zhenjiang Li, Yanle Hu, and Qichao Zhou. Breaking the dilemma of medical image-to-image translation, 2021.

[15]. Yaqing Wang, Zaifei Yang, and Quanming Yao. Accurate and interpretable drug drug interaction prediction enabled by knowledge subgraph learning. Communications Medicine, 4(1):59, March 2024.

[16]. Dongmin Bang, Sangsoo Lim, Sangseon Lee, and Sun Kim. Biomedical knowledge graph learning for drug repurposing by extending guilt-by-association to multiple layers. Nature Communications, 14(1):3570, June 2023.

[17]. Chuanze Kang, Han Zhang, Zhuo Liu, Shenwei Huang, and Yanbin Yin. LR-GNN: a graph neural network based on link representation for predicting molecular associations. Briefings in Bioinformatics, 23(1):bbab513, 12 2021.

[18]. Yi Lu, Yaran Chen, Dongbin Zhao, and Jianxin Chen. Graph-fcn for image semantic segmentation, 2020.

[19]. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.

[20]. Andreas Fürst, Elisabeth Rumetshofer, Johannes Lehner, Viet Tran, Fei Tang, Hubert Ramsauer, David Kreil, Michael Kopp, Günter Klambauer, Angela BittoNemling, and Sepp Hochreiter. Cloob: Modern hopfield networks with infoloob outperform clip, 2022.

[21]. Zhanghexuan Ji, Mohammad Abuzar Shaikh, Dana Moukheiber, Sargur N Srihari, Yifan Peng, and Mingchen Gao. Improving joint learning of chest X-Ray and radiology report by word region alignment. Mach Learn Med Imaging, 12966:110–119, September 2021.

[22]. Xiyue Wang, Sen Yang, Jun Zhang, Minghui Wang, Jing Zhang, Junzhou Huang, Wei Yang, and Xiao Han. Transpath: Transformer-based self-supervised learning for histopathological image classification. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference, Strasbourg, France, September 27 – October 1, 2021, Proceedings, Part VIII, page 186–195, Berlin, Heidelberg, 2021. Springer-Verlag.

[23]. Krishna Chaitanya, Ertunc Erdil, Neerav Karani, and Ender Konukoglu. Contrastive learning of global and local features for medical image segmentation with limited annotations, 2020.

[24]. Benoit Dufumier, Pietro Gori, Julie Victor, Antoine Grigis, Michel Wessa, Paolo Brambilla, Pauline Favre, Mircea Polosan, Colm McDonald, Camille Marie Piguet, and Edouard Duchesnay. Contrastive learning with continuous proxy meta-data for 3d mri classification, 2021.

[25]. Claire McQuin, Allen Goodman, Vasiliy Chernyshev, Lee Kamentsky, Beth A. Cimini, Kyle W. Karhohs, Minh Doan, Liya Ding, Susanne M. Rafelski, Derek Thirstrup, Winfried Wiegraebe, Shantanu Singh, Tim Becker, Juan C. Caicedo, and Anne E. Carpenter. Cellprofiler 3.0: Next-generation image processing for biology. PLOS Biology, 16(7):1–17, 07 2018.

[26]. Maxime W. Lafarge, Juan C. Caicedo, Anne E. Carpenter, Josien P.W. Pluim, Shantanu Singh, and Mitko Veta. Capturing single-cell phenotypic variation via unsupervised representation learning. In M. Jorge Cardoso, Aasa Feragen, Ben Glocker, Ender Konukoglu, Ipek Oguz, Gozde Unal, and Tom Vercauteren, editors, Proceedings of The 2nd International Conference on Medical Imaging with Deep Learning, volume 102 of Proceedings of Machine Learning Research, pages 315–325. PMLR, 08–10 Jul 2019.

[27]. Zhaoping Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xutong Li, Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, and Mingyue Zheng. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of Medicinal Chemistry, 63(16):8749–8760, 2020. PMID: 31408336.

[28]. Tianle Cai, Shengjie Luo, Keyulu Xu, Di He, Tie-Yan Liu, and Liwei Wang. Graphnorm: A principled approach to accelerating graph neural network training, 2021.

[29]. Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications, 2021.

[30]. Qimai Li, Zhichao Han, and Xiao-ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018.

[31]. Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs, 2020.

[32]. Nicolas Keriven. Not too little, not too much: a theoretical analysis of graph (over)smoothing, 2022.

[33]. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?, 2018.

[34]. Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples, 2021.

[35]. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.

[36]. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019.

[37]. Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts, 2016.

[38]. Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification, 2020.

[39]. Changyou Chen, Jianyi Zhang, Yi Xu, Liqun Chen, Jiali Duan, Yiran Chen, Son Dinh Tran, Belinda Zeng, and Trishul Chilimbi. Why do we need large batchsizes in contrastive learning? a gradient-bias perspective. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.

[40]. Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, and Yann LeCun. Decoupled contrastive learning, 2022.

[41]. Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup, 2021.

Cite this article

Liu,L. (2024). CSIP: Contrastive learning with Graph Neural Network enables chemical structure to image mapping. Applied and Computational Engineering,86,282-294.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 6th International Conference on Computing and Data Science

Conference website: https://www.confcds.org/
ISBN:978-1-83558-583-2(Print) / 978-1-83558-584-9(Online)
Conference date: 12 September 2024
Editor:Alan Wang, Roman Bauer
Series: Applied and Computational Engineering
Volume number: Vol.86
ISSN:2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).