О корпусе med-review sagteam

статья 1.

Sboev A.G., Rylkov G.V., Rybka R.B., Gryaznov A.V., Sboeva S.G. Data-Driven Model for Identifying Related Pharmaceutically-Significant Entities in Clinical Texts

Abstract: To the date, large amount of useful medical data on undesirable effect of pharmaceuticals have been accumulated in electronic health records and Internet users’ feedback. To analyze such data so that to find correlations among medicines, their administering, adverse effects they cause, and other entities of significance for pharmaceutics, is a task undoubtedly relevant, but at the same time laborious and requiring automation. The goal of this work is to create a method for automatically establishing relations among entities. The method is based on the token encoding component made in the architecture of an encoder from the Transformer topology (Bio+Discharge Summary BERT), a Bidirectional Long Short-Term Memory (BiLSTM) layer, and the attention mechanism.

Статус: принята к публикации в AIP Conference Proceedings

Скачать

Статья 2

Sboev A.G., Sboeva S.G., Gryaznov A.V., Evteeva A.V., Rybka R. B., Silin M. S. A neural network algorithm for extracting pharmacological information from Russian-language Internet reviews on drugs

Abstract: The paper presents a neural network algorithm for analyzing online user reviews of drugs. The algorithm was validated on specially prepared and annotated corpora. The basis of the algorithm is a neural network model combining convolution and recurrent layers, context-dependent vector representations of words, conditional random fields and additional features of words obtained from different dictionaries. The proposed model showed accuracies comparable to the state-of-the-art results for this task on the corpora for other languages.

Выходные данные: Journal of Physics: Conference Series, Volume 1686, No. 012037, DOI 10.1088/1742-6596/1686/1/012037, URL: https://iopscience.iop.org/article/10.1088/1742-6596/1686/1/012037

Скачать

Статья 3

Sboev A.G, Selivanov A.A., Rylkov G.V. and Rybka R.B. On the accuracy of different neural language model approaches to ADE extraction in natural language corpora

Abstract: The problem of extracting mentions of adverse events and reactions from text is especially relevant nowadays due to rapid emergence of datasets including such events, and progress in text analysis tools. This paper presents a comparison of existing methods for the task of automated extraction of adverse events from natural language texts. The considered methods are based on neuralnetwork language models, pre-trained on different sets of unlabeled data. Experiments have been performed on the n2c2-2018 and CADEC corpora, using metrics coined within the CoNLL competition. Models of the aforementioned type show efficient solution of this task, provided sufficient amount of labeled training samples during.

Статус: принята к публикации в журнале Procedia Computer Science

Скачать

Статья 4.

AG Sboev, SG Sboev, IA Moloshnikov, AV Gryaznov, RB Rybka, AV Naumov, AA Selivanov, GV Rylkov and VA Ilyin An analysis of full-size Russian complexly NER labelled corpus of Internet user reviews on the drugs based on deep learning and language neuron nets

Abstract: We present the full-size Russian compound NER-labeled corpus of Internet user reviews, along with an evaluation of accuracy levels reached with this corpus by a set of advanced deep learning neural networks used for the extraction of pharmacologically meaningful entities from Russian texts. The corpus annotation includes mentions of the following entities: Medication (33005 mentions), Adverse Drug Reaction (1778), Disease (17403), and Note (4490). Two of them – Medication and Disease – comprise a set of attributes. In order tо select the most effective neuron models for further adaptation to Russian language texts, numerical analysis has been performed on CADEC and N2C2 corpora. Selected neuronet models were adapted to Russian-language texts to estimate the current accuracy baseline of the problem for Russian texts. Special multilabel model basing on a language model and the set of features is developed, appropriated for presented corpus labeling. The influence of the choice of different modifications of the models: word vector representations, different types of pre-training of Russian language models, text normalization styles, and other preliminary processing are analyzed. The sufficient size of our corpus allows to study the effects of particularities of corpus labeling and balancing entities in the corpus. As a result, the state of the art for the pharmacological entity extraction problem for Russian is established on a full-size labeled corpus, achieving for ADR recognition the accuracy of 63.1\% by the F1-exact metric, which is comparable to the accuracy level of this task for other languages.

Статус: планируется к опубликованию в журнале Artificial Intelligence in Medicine

Скачать