Corpus annotation

The annotation goal was to get a corpus of reviews in which named entities reflecting pharmacotherapeutic treatment are labelled, and annotate medication characteristic semantically. With this in mind, the objects of annotation were attributes of drugs, diseases (including their symptoms), and undesirable reactions to those drugs. The annotators were to label mentions of these three entities with their attributes defined below.представлены ниже.


This entity includes everything related to the mentions of drugs and drugs manufacturers. Selecting a mention of such entity, an annotator had to specify an attribute out of those specified below:

Drugname Marks a mention of a drug. For example, in the sentence “Препарат Aventis “Трентал” для улучшения мозгового кровообращения” (The Aventis “Trental” drug to improve cerebral circulation), the word “Trental” (without quotation marks) is marked as a Drugname.
DrugBrand A drug name is also marked as DrugBrand if it is a registered trademark. For example, in the sentence “Противовирусный и иммунотропный препарат Экофарм “Протефлазид”” (The Ecopharm “Proteflazid” antiviral and immunotropic drug), the word “Протефлазид” (Proteflazid) is marked as DrugBrand.
Drugform Dosage form of the drug (ointment, tablets, drops, etc.). For example, in the sentence “Эти таблетки не плохие, если начать принимать с первых признаков застуды” (These pills are not bad if you start taking them since the first signs of a cold), the word “таблетки” (pills) is marked as DrugForm.
Drugclass Type of drug (sedative, antiviral agent, sleeping pill, etc.) For example, in the sentence “Противовирусный и иммунотропный препарат Экофарм “Протефлазид”” (The Ecopharm “Proteflazid” antiviral and immunotropic drug), two mentions marked as Drugclass: “Противовирусный” (Antiviral) and “иммунотропный” (immunotropic).
MedMaker The drug manufacturer. This attribute has two values: Domestic and Foreign. For example, in the sentence “Седативный препарат Материа медика “Тенотен”” (The Materia Medica “Tenoten” sedative) the word combination “Материа медика” (Materia Medica) is marked as MedMaker/Domestic.
MedFrom This is an attribute of a Medication entity that takes one of the two values – Domestic and Foreign, characterizing the manufacturer of the drug. For example, in the sentence “Седативные таблетки Фармстандарт “Афобазол”” (The Pharmstandard “Afobazol” sedative pills) the drug name “Афобазол” (Afobazol) has its MedFrom attribute equal to Domestic.
Frequency The drug usage frequency. For example, in the sentence “Неудобство было в том, что его приходилось наносить 2 раза в день” (Its inconvenience was that it had to be applied two times a day), the phrase “2 раза в день” (two times a day) is marked as Frequency.
Dosage The drug dosage (including units of measurement, if specified). For example, in the sentence “Ректальные суппозитории “Виферон” 15000 МЕ – эффекта ноль” (Rectal suppositories “Viferon” 150000 IU have zero effect), the mention “15000 МЕ” (150000 IU) is marked as Dosage.
Duration This entity specifies the duration of use. For example, in the sentence “Время использования: 6 лет” (Time of use: 6 years), “6 лет” (6 years) is marked as Duration.
Route Application method (how to use the drug). For example, in the sentence “удобно то, что можно готовить раствор небольшими порциями” (it is convenient that one can prepare the solution in small portions), the mention “можно готовить раствор небольшими порциями” (can prepare a solution in small portions) is marked as a Route.
SourceInfodrug The source of information about the drug. For example, in the sentence “Этот спрей мне посоветовали в аптеке в его состав входят такие составляющие вещества как мята” (This spray was recommended to me at a pharmacy, it includes such ingredient as mint), the word combination “посоветовали в аптеке” (recommended to me at a pharmacy) is marked as SourceInfoDrug.


This entity is associated with diseases or symptoms. It indicates the reason for taking a medicine, the name of the disease, and improvement or worsening of the patient state after taking the drug. Attributes of this entity are specified below:

Diseasename The name of a disease. If a report author mentions the name of the disease for which they take a medicine, it is annotated as a mention of the attribute Diseasename. For example, in the sentence “у меня вчера была диарея” (I had diarrhea yesterday) the word “диарея” (diarrhea) will be marked as Diseasename. If there are two or more mentions of diseases in one sentence, they are annotated separately. In the sentence “Обычно весной у меня сезон аллергии на пыльцу и депрессия” (In spring I usually have season allergy to pollen, and depression), both “аллергия” (allergy) and “депрессия” (depression) are independently marked as Diseasename.
Indication Indications for use (symptoms). In the sentence “У меня постоянный стресс на работе” (I have a permanent stress at work), the word “стресс” (stress) is annotated as Indication. Also, in the sentence “Я принимаю витамин С для профилактики гриппа и простуды” (I take vitamin C to prevent flu and cold), the entity “для профилактики” (to prevent) is annotated as Indication too. For another example, in the sentence “У меня температура 39.5” (I have a temperature of 39,5) the words “температура 39.5” (temperature of 39.5) are marked as Indication.
BNE-Pos This entity specifies positive dynamics after or during taking the drug. In the sentence “препарат Тонзилгон Н действительно помогает при ангине” (the Tonsilgon N drug really helps a sore throat), the word “помогает” (helps) is the one marked as BNE-Pos.
ADE-Neg Negative dynamics after the start or some period of using the drug. For example, in the sentence “Я очень нервничаю, купила пачку “персен”, в капсулах, он не помог, а по моему наоборот всё усугубил, начала сильнее плакать и расстраиваться” (I am very nervous, I bought a pack of “persen”, in capsules, it did not help, but in my opinion, on the contrary, everything aggravated, I started crying and getting upset more), the words “по моему наоборот всё усугубил, начала сильнее плакать и расстраиваться” (in my opinion, on the contrary, everything aggravated, I started crying and getting upset more) are marked as ADE-Neg.
NegatedADE This entity specifies that the drug does not work after taking the course. For example, in the sentence “…боль в горле притупляют, но не лечат, временный эффект, хотя цена великовата для 18-ти таблеток” (…dulls the sore throat, but does not cure, a temporary effect, although the price is too big for 18 pills) the words “не лечат, временный эффект” (does not cure, the effect is temporary) are marked as NegatedADE.
Worse Deterioration after taking a course of the drug. For example, in the sentence “Распыляла его в нос течении четырех дней, результата на меня не какого не оказал, слизистая еще больше раздражалось” (I sprayed my nose for four days, it didn’t have any results on me, the mucosa got even more irritated), the words “слизистая еще больше раздражалось” (the mucosa got even more irritated) are marked as Worse.


This entity is associated with adverse drug reactions in the text. For example, one post said: “После недели приема Кортексина у ребенка начались судороги” (After a week of taking Cortexin, the child began to cramp). In this sentence, the word “судороги” (“cramp”) is labeled as an ADR entity.


We use this entity when the author makes recommendations, tips, and so on, but does not explicitly state whether the drug helps or not. These include phrases like “I do not advise”. For instance, the phrase “Нет поддержки для иммунной системы” (No support for the immune system) is annotated as a Note.

Fig. 1. Annotation example

Fig. 1. Annotation example.

The typical situations that had to be handled during the annotation are the following:

Moreover, there often were author subjective arguments instead of explicit reports on the outcomes. We labeled that as a mention of entity “Note”. For example, “strange meds”, “not impressed”, “it is not clear whether it worked or not”, “ambiguous effect” (example (d) in Fig. “Annotation example”).


In our work, in order to resolve possible ambiguity in terms we performed normalization by matching the labeled mentions to the information from external official classifiers and registers. The external sources for Russian are described below:

Among the international systems of standardization of concepts, the most complete and large metathezaurus is UMLS, which combines most of the databases of medical concepts and observations, including MESH (and MESHRUS), ATC, ICD-10, SNOMED CT, LOINC and others. Every unique concept in the UMLS has an identification code CUI, using which one can get information about the concept from all the databases. However, within UMLS it is only the MESHRUS database that contains Russian language and can be used to associate words from our texts with CUI codes.

Normalization based on categories from the ATC and ICD-10 classifiers

Normalization was carried out by the annotators manually. For this purpose, we applied the procedure consisting of the following steps: automatic grouping of mentions (standardization), manual verification of mention groups, matching the mention groups to the terms from the ATC and the ICD-10. Automatic mentions grouping is based on calculating the similarity between two mentions by the Ratcliff/Obershelp algorithm, which is based on searching two strings for matching substrings. In the course of the analysis, every new mention is added to one of the existing groups G if the mean similarity between the mention and all the group items is more than 0.8 (value deduced empirically), otherwise a new group is created. The G set is empty at the start, and the first mention creates a new group with size 1. Each group is named by its most frequent mention. Next, the annotators manually check and refine the resulting set, creating larger groups or renaming them.

After that, the group names for attributes “Diseasename”, “Drugname” and “Drugclass” are manually matched with ICD-10 and ATC terms to assign term codes from the classifiers. As a result, 141 unique ICD-10 codes were matched against the 1,333 mentions of attribute “Diseasename”; 171 unique ATC codes matched the 2,360 mentions of attribute “Drugname”; and 26 unique ATC codes corresponded to 1,092 mentions of “Drugclass”. Some drug classes that were mentioned in corpus (such as homeopathy) did not have a corresponding ATC code, and were aggregated according to their anatomical and therapeutic classification in the SRD.