A method based on multi-standard active learning to recognize entities in electronic medical record

Qiao Pan; Chen Huang; Dehua Chen; Qiao Pan; Chen Huang; Dehua Chen

doi:10.3934/mbe.2021054

Mathematical Biosciences and Engineering

2021, Volume 18, Issue 2: 1000-1021. doi: 10.3934/mbe.2021054

Previous Article Next Article

Research article Special Issues

A method based on multi-standard active learning to recognize entities in electronic medical record

School of Computer Science and Technology, Donghua University, Shanghai 201620, China

Received: 29 September 2020 Accepted: 21 December 2020 Published: 05 January 2021

Deep neural networks(DNN)have achieved good results in the application of Named Entity Recognition (NER), but most of the DNN methods are based on large numbers of annotated data. Electronic Medical Record (EMR) belongs to text data of the specific professional field. The annotation of this kind of data needs experts with strong knowledge of the medical field and time labeling. To tackle the problems of professional medical areas, large data volume, and annotation difficulties of EMR, we propose a new method based on multi-standard active learning to recognize entities in EMR. Our approach uses three criteria: the number of labeled data, the cost of sentence annotation, and the balance of data sampling to determine the choice of active learning strategy. We put forward a more suitable way of uncertainty calculation and measurement rule of sentence annotation for NER's neural network model. Also, we use incremental training to speed up the iterative training in the process of active learning. Finally, the named entity experiment of breast clinical EMRs shows that it can achieve the same accuracy of NER results under the premise of obtaining the same sample's quality. Compared with the traditional supervised learning method of randomly selecting labeled data, the method proposed in this paper reduces the amount of data that needs to be labeled by 66.67%. Besides, an improved TF-IDF method based on Word2Vec is also proposed to vectorize the text by considering the word frequency.
- electronic medical records,
- multi-standard active learning,
- uncertainty,
- labeled costs,
- strategy choice
Citation: Qiao Pan, Chen Huang, Dehua Chen. A method based on multi-standard active learning to recognize entities in electronic medical record[J]. Mathematical Biosciences and Engineering, 2021, 18(2): 1000-1021. doi: 10.3934/mbe.2021054

Related Papers:

Abstract

Deep neural networks(DNN)have achieved good results in the application of Named Entity Recognition (NER), but most of the DNN methods are based on large numbers of annotated data. Electronic Medical Record (EMR) belongs to text data of the specific professional field. The annotation of this kind of data needs experts with strong knowledge of the medical field and time labeling. To tackle the problems of professional medical areas, large data volume, and annotation difficulties of EMR, we propose a new method based on multi-standard active learning to recognize entities in EMR. Our approach uses three criteria: the number of labeled data, the cost of sentence annotation, and the balance of data sampling to determine the choice of active learning strategy. We put forward a more suitable way of uncertainty calculation and measurement rule of sentence annotation for NER's neural network model. Also, we use incremental training to speed up the iterative training in the process of active learning. Finally, the named entity experiment of breast clinical EMRs shows that it can achieve the same accuracy of NER results under the premise of obtaining the same sample's quality. Compared with the traditional supervised learning method of randomly selecting labeled data, the method proposed in this paper reduces the amount of data that needs to be labeled by 66.67%. Besides, an improved TF-IDF method based on Word2Vec is also proposed to vectorize the text by considering the word frequency.

References

[1]	C. Zeng, G. Hui, Construction of electronic medical record system for standardized diagnosis and treatment of breast cancer, J. Chin. Med. Dev., 29 (2014), 46–48.
[2]	Q. M. Ling, Research on the advantages and development of electronic medical record in medical record management, Electron. J. Gen. Stomatol., 7 (2020), 26–31.
[3]	L. Liu, D. B. Wang, Summary of research on named entity recognition, J. Chin. Soc. Sci. Tech. Inf., 3 (2018), 329–340.
[4]	C. Y. Kun, L. T. A, M. Q. Zhu, D. J. C, X. Hua, A study of active learning methods for named entity recognition in clinical text, J. Biomed. Inf., 58 (2015), 11–18. doi: 10.1016/j.jbi.2015.09.010
[5]	Y. Shen, H. Yun, Z. C. Lipton, Y. Kronrod, A. Anandkumar, Deep active learning for named entity recognition, preprint, arXiv: 1707.05928.
[6]	W. W. Ning, L. Yang, G. M. Zu, L. X. Yan, Research progress of active learning algorithm based on sampling strategy, J. Comput. Res. Dev., 49 (2012), 1162–1173.
[7]	W. R. Qi, L. X. Li, H. Y. Li, B. He, G. Yi, Research on active learning method for named entity recognition of Chinese electronic medical record, Chin. Digital Med., 12 (2017), 51–53.
[8]	L. M. Qun, S. Martin, E. E. Khaled, M. B. A, Efficient active learning for electronic medical record de-identification, AMIA Summits Transl. Sci. Proc., (2019), 462–471.
[9]	M. Kholghi, L. Sitbon, G. Zuccon, A. Nguyen, External knowledge and query strategies in active learning: a study in clinical information extraction, in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, (2015), 143–152.
[10]	J. Zhu, E. H. Hovy, Active learning for word sense disambiguation with methods for addressing the class imbalance problem, in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), (2007), 783–790.
[11]	M. Bloodgood, K. Vijay–Shanker, Taking into account the differences between actively and passively acquired data: The case of active learning with support vector machines for imbalanced datasets, preprint, arXiv: 1409.4835.
[12]	K. Tomanek, U. Hahn, Reducing class imbalance during active learning for named entity annotation, in Proceedings of the fifth international conference on Knowledge capture, (2009), 105–112.
[13]	S. Ertekin, J. Huang, C. L. Giles, Active learning for class imbalance problem, in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 1 (2007), 823–824.
[14]	S. Ertekin, J. Huang, L. Bottou, C. L. Giles, Learning on the border: Active learning in imbalanced data classiﬁcation, in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, (2007), 127–136.
[15]	H. Guang, Z. C. Xia, H. X. Lei, A new SVM active learning algorithm and its application in obstacle detection, J. Comput. Res. Dev., 46 (2009), 1934–1941.
[16]	B. C. Mei, Classification of weighted support vector machines based on active learning, Comput. Eng. Des., 30 (2009), 966–970.
[17]	Y. F. Liang, Research on Active Learning Algorithm Based on Expert Committee, Master thesis, Ocean University of China, 2010.
[18]	L. Feng, Research and Application of Active Semi-supervised K-means Clustering Algorithm, Master thesis, Hebei University of Geosciences, 2018.
[19]	X. Li, Y. Guo, Adaptive active learning for image classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013 (2013), 859–866.
[20]	M. Kholghi, L. D. Vine, L. Sitbon, G. Zuccon, A. Nguyen, Clinical information extraction using small data: an active learning approach based on sequence representations and word embeddings, J. Assoc. Inf. Sci. Technol., 68 (2017), 2543–2556. doi: 10.1002/asi.23936
[21]	D. Angluin, Queries and concept learning, Mach. Learn., 2 (1988), 319–342.
[22]	R. Grishman, B. Sundheim, Message understanding conference-6: A brief history, in COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, 1996.
[23]	C. Friedman, P. O. Alderson, J. H. Austin, S. B. Johnson, A general natural-language text processor for clinical radiology, J. Am. Med. Inf. Assoc., 1 (1994), 161–174. doi: 10.1136/jamia.1994.95236146
[24]	W. S. Li, Research on Chinese Electronic Medical Record of Named Entity Recognition Based on Improved Deep Belief Network, Master thesis, Beijing University of Chemical Technology, 2018.
[25]	G. K. Savova, J. J. Masanz, P. V. Ogren, J. P. Zheng, S. W. Sohn, C. G. Chute, Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications, J. Am. Med. Inf. Assoc., 17 (2010), 507–513. doi: 10.1136/jamia.2009.001560
[26]	S. T. Wu, H. F. Liu, D. C. Li, C. Tao, M. A. Musen, N. H. Shah, Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis, J. Am. Med. Inf. Assoc., 19 (2012), 149–156. doi: 10.1136/amiajnl-2012-000844
[27]	E. F. Sang, F. D. Meulder, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition, preprint, arXiv: 0306050.
[28]	Y. Li, S. L. Gorman, N. Elhadad, Section classification in clinical notes using supervised hidden markov model, in Proceedings of the 1st ACM International Health Informatics Symposium, (2010), 744–750.
[29]	P. Y. Wang, D. H. Gi, Disease name extraction based on multi-label CRF, Appl. Res. Comput., 1 (2017), 118–122.
[30]	F. Ye, Y. Y. Chen, G. G. Zhou, H. M. Li, Y. Li, Intelligent recognition of named entities in electronic medical records, Chin. J. Biomed. Eng., 2 (2011), 98–104.
[31]	J. Liang, X. M. Xian, X. J. He, M. F. Xu, S. Dai, J. Y. Xin, A novel approach towards medical entity recognition in Chinese clinical text, J. Healthcare Eng., (2017), 1–16.
[32]	G. Luo, X. Huang, C. Y, Z. Nie, Joint entity recognition and disambiguation, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (2015), 879–888.
[33]	A. Passos, V. Kumar, M. C. Andrew, Lexicon infused phrase embeddings for named entity resolution, preprint, arXiv: 1404.5367.
[34]	Y. J. Zhang, Z. T. Xu, X. Y. Xue, A maximum entropy Chinese named entity recognition model of integrating multiple features, J. Comput. Res. Dev., 6 (2008), 1004–1010.
[35]	A. Mccallum, W. Li, Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons, Comput. Sci. Dep. Fac. Publ. Ser. 11, (2003), 188–191.
[36]	R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural language processing (almost) from scratch, J. Mach. Learn. Res., 12(2011), 2493–2537.
[37]	G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural architectures for named entity recognition, preprint, arXiv: 1603.01360.
[38]	A. Jagannatha, Y. Hong, Structured prediction models for RNN based sequence labeling in clinical text, in Proceedings of the conference on empirical methods in natural language processing. conference on empirical methods in natural language processing, (2016), 856–865.
[39]	J. Zhu, H. Wang, B. K. Tsou, M. Ma, Active learning with sampling by uncertainty and density for data annotations, IEEE. Trans. Audio, Speech, Lang. Process., 18 (2012), 1323–1331.
[40]	X. Yan, Research on Image Annotation Method Based on Active Learning, Master thesis, Liaoning University of Technology, 2014.
[41]	L. Jin, Y. F. Cao, C. X. Su, J. Y. Ren, Multi-class image classification based on HS sample selection and BvSB feedback, J. Guizhou Norm. Univ. (Nat. Sci.), (2014), 56–61.
[42]	Q. H. Zhao, Two active learning methods, Master thesis, He Bei University, 2010.
[43]	S. Ertekin, J. Huang, C. L. Giles, Active learning for class imbalance problem, in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, (2007), 823–824.
[44]	H. S. Seung, M. Opper, H. Sompolinsky, Query by Committee, in Proceedings of the fifth annual workshop on Computational learning theory, (1992), 287–294.
[45]	D. D. Lewis, J. Catlett, Heterogeneous uncertainty sampling for supervised learning, in Machine learning proceedings, Morgan Kaufmann, 1994,148–156.
[46]	T. Scheffer, C. Decomain, S. Wrobel, Active hidden markov models for information extraction, in International Symposium on Intelligent Data Analysis, Springer, Berlin, Heidelberg, (2001), 309–318.
[47]	S. Tong, D. Koller, Support vector machine active learning with applications to text classification, J. Mach. Learn. Res., 2 (2002), 45–66.
[48]	A. Kapoor, E. Horvitz, S. Basu, Selective supervision: guiding supervised learning with decision-theoretic active learning, in IJCAI, 7 (2007), 877–882.
[49]	S. Arora, E. Nyberg, C. P. Rose, Estimating annotation cost for active learning in a multi-annotator environment, in Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, (2009), 18–26.
[50]	J. Carroll, R. Haertel, P. McClanahan, E. K. Ringger, K. Seppi, Assessing the costs of sampling methods in active learning for annotation, Fac. Publ., (2008), 185.
[51]	M. Kholghi, L. Sitbon, G. Zuccon, A. Nguyen, Active learning: a step towards automating medical concept extraction, J. Am. Med. Inf. Assoc., 23 (2016), 289–296. doi: 10.1093/jamia/ocv069
[52]	R. Q. Wang, X. L. Li, Y. L. Huang, B. He, Y. Guan, Research on active learning method of Chinese electronic medical record named entity recognition, China Digital Med., 12 (2017), 51–53.
[53]	J. Z. Cheng, W. Qiang, A. Franklin, T. Cohen, H. Xu, Cost-sensitive active learning for phenotyping of electronic health records, AMIA Summits Transl. Sci. Proc., 2019 (2019), 829–838.
[54]	Ö. Uzuner, B. R. South, S. Shen, S. L. Duvall, 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text, J. Am. Med. Inf. Assoc., 18 (2011), 552–556. doi: 10.1136/amiajnl-2011-000203
[55]	S. Pradhan, N. Elhadad, B. South, D. Martinez, L. Christensen, A. Vogel, Task 1: ShARe/CLEF ehealth evaluation lab 2013, in CLEF (Working Notes), 2013.
[56]	G. S. Wang, X. J. Huang, Text classification model of convolutional neural network based on Word2vec and improved TF-IDF, J. Chin. Mini-Micro Comput. Syst., 40 (2019), 210–216.
[57]	M. Kholghi, L. Sitbon, G. Zuccon, A. Nguyen, Active learning reduces annotation time for clinical concept extraction, Int. J. Med. Inform., 106 (2017), 25–31. doi: 10.1016/j.ijmedinf.2017.08.001
[58]	T. H. Nguyen, A. Sil, G. Dinu, R. Florian, Toward mention detection robustness with recurrent neural networks, preprint, arXiv: 1602.07749.
[59]	Z. H. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF models for sequence tagging, preprint, arXiv: 1508.01991.
[60]	Z. L. Yang, R. Salakhutdinov, W. Cohen, Multi-task cross-lingual sequence tagging from scratch, preprint, arXiv: 1603.06270.

Reader Comments

Your name:*

Email:*
© 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)