Research Article
Open access
Published on 10 January 2025
Download pdf
Xu,Y. (2025). Deep regularization techniques for improving robustness in noisy record linkage task. Advances in Engineering Innovation,15,9-13.
Export citation

Deep regularization techniques for improving robustness in noisy record linkage task

Yichen Xu *,1,
  • 1 Australian National University

* Author to whom correspondence should be addressed.

https://doi.org/10.54254/2977-3903/2025.20435

Abstract

Linking records is essential in data integration, healthcare analysis, fraud detection, and other applications where matching across datasets is needed. But actual data is usually noisy (lost values, typos, inconsistent formatting), and these factors greatly sour the performance of deterministic and probabilistic approaches. In this paper, we introduce a deep learning model and high-level regularizations (dropout, weight decay, early stopping) to enhance robustness for noisy record linkage. We test the approaches by using open data, that are simulated scenarios of real world with different levels of noise. Data augmentation generates fake noise (realistic input errors). Results reveal that regularization techniques improve the model’s performance under noisy environments with up to 20% better accuracy and recall than unregularized models. Dropout specifically tended to generalise better by limiting overfitting to noise. These results reveal the potential of deep learning and regularization to address record linkage problems in noisy environments, and suggest future work on additional techniques including adversarial training and batch normalization.

Keywords

Record Linkage, Deep Learning, Regularization, Reliability, Noisy Data

[1]. Bailey, M. J., Cole, C., Henderson, M., & Massey, C. (2020). How well do automated linking methods perform? Lessons from US historical data. Journal of Economic Literature, 58(4), 997-1044.

[2]. Sevgili, Ö., Shelmanov, A., Arkhipov, M., Panchenko, A., & Biemann, C. (2022). Neural entity linking: A survey of models based on deep learning. Semantic Web, 13(3), 527-570.

[3]. Huang, S. C., Pareek, A., Seyyedi, S., Banerjee, I., & Lungren, M. P. (2020). Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine, 3(1), 136.

[4]. Solares, J. R. A., Raimondi, F. E. D., Zhu, Y., Rahimian, F., Canoy, D., Tran, J.,... & Salimi-Khorshidi, G. (2020). Deep learning for electronic health records: A comparative review of multiple deep neural architectures. Journal of Biomedical Informatics, 101, 103337.

[5]. Cheung, C. Y., Xu, D., Cheng, C. Y., Sabanayagam, C., Tham, Y. C., Yu, M.,... & Wong, T. Y. (2021). A deep-learning system for the assessment of cardiovascular disease risk via the measurement of retinal-vessel calibre. Nature Biomedical Engineering, 5(6), 498-508.

[6]. Abramitzky, R., Boustan, L., Eriksson, K., Feigenbaum, J., & Pérez, S. (2021). Automated linking of historical data. Journal of Economic Literature, 59(3), 865-918.

[7]. Rowlands, I. J., Abbott, J. A., Montgomery, G. W., Hockey, R., Rogers, P., & Mishra, G. D. (2021). Prevalence and incidence of endometriosis in Australian women: a data linkage cohort study. BJOG: An International Journal of Obstetrics & Gynaecology, 128(4), 657-665.

[8]. Shah, A. S., Wood, R., Gribben, C., Caldwell, D., Bishop, J., Weir, A.,... & McAllister, D. A. (2020). Risk of hospital admission with coronavirus disease 2019 in healthcare workers and their households: nationwide linkage cohort study. BMJ, 371.

[9]. Zanella, G. (2020). Informed proposals for local MCMC in discrete spaces. Journal of the American Statistical Association, 115(530), 852-865.

[10]. Tong, T. Y., Appleby, P. N., Armstrong, M. E., Fensom, G. K., Knuppel, A., Papier, K.,... & Key, T. J. (2020). Vegetarian and vegan diets and risks of total and site-specific fractures: results from the prospective EPIC-Oxford study. BMC Medicine, 18, 1-15.

[11]. Corsi, D. J., Donelle, J., Sucha, E., Hawken, S., Hsu, H., El-Chaâr, D.,... & Walker, M. (2020). Maternal cannabis use in pregnancy and child neurodevelopmental outcomes. Nature Medicine, 26(10), 1536-1540.

[12]. Lobe, B., Morgan, D., & Hoffman, K. A. (2020). Qualitative data collection in an era of social distancing. International Journal of Qualitative Methods, 19, 1609406920937875.

Cite this article

Xu,Y. (2025). Deep regularization techniques for improving robustness in noisy record linkage task. Advances in Engineering Innovation,15,9-13.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Journal:Advances in Engineering Innovation

Volume number: Vol.15
ISSN:2977-3903(Print) / 2977-3911(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).