Corpora and its subcorpuses

Detailed information about the formed corpora is presented in the Table below and includes:

The table shows several different subcorpuses from the general corpus (Version II from 2021 (missing reference)):

Entity type (tag) Version I Version II Balanced subcorpus of the version II Subcorpus of 500 texts
Number of annotations Number of reviews Number of annotations Number of reviews Average length of the entity (num. of words) Number of annotations Number of reviews Number of annotations Number of reviews
Medication 17875 1659 33005 2799 -- 13748 1250 5967 500
Drugname 4745 1655 8239 2793 1.2 3503 1247 1489 498
Drugform 3303 1266 5997 2194 1 2423 960 1041 387
MedMaker 954 816 1720 1451 1.4 750 629 273 228
SourceInfodrug 1267 878 2579 1579 1.7 1110 683 460 285
Drugclass 1786 1005 3113 1684 1 1317 747 577 313
DrugBrand 2584 1038 4656 1804 1.1 2021 812 873 335
Route 1470 817 3609 1733 2.2 1440 739 683 317
Duration 895 701 1515 1194 2 565 463 256 192
Dosage 506 387 960 706 2.5 407 313 202 143
Frequency 365 303 617 517 3.9 212 187 113 87
Disease 9222 1603 17403 2713 -- 6307 1180 2819 478
Diseasename 2215 917 4042 1628 1.2 1462 657 738 296
Indication 2310 955 4627 1783 1.7 1518 670 720 297
BNE-Pos 2967 1021 5620 1764 2.7 1990 676 809 289
NegatedADE 1532 641 2804 1104 3.2 1195 496 481 201
Worse* 83 51 224 134 4.6 99 61 52 35
ADE-Neg* 115 68 86 54 4 43 28 19 12
ADR 843 339 1778 625 2.4 1752 610 709 177
Note 2319 1004 4490 1861 -- 2273 905 902 359

Corpus statistic

The corpus contains consumer posts on drugs, mentioned 8 236 times and related to 226 ATC codes. The most popular 20% of the ATC codes (by the number of reviews with corresponding Drugname mentions) include 45 different codes which mentions appears in 2 614 reviews (93% of all reviews). Among them, 20 ATC codes were reviewed in more then 50 posts (2511 posts in total).

The proportions of reviews about domestic drugs and foreign to the total number of reviews are 44.9\% and 39.7\% respectively. The remaining documents (15.4\%) contains mentions of multiple drugs both domestic and foreign or mentions of drugs which origin the annotators could not determine. Among the domestic drugs are following: “Anaferon” (144 reviews), “Viferon” (140), “Ingavirin” (99) and “Glycine” (98). Examples of mentioned foreign drugs: “Aflubin” (93), “Amison” (55), “Antigrippin” (51) and “Immunal” (42).

Regarding diseases, the most frequent ICD-10 top level categories are “X - Diseases of the respiratory system” (1122 reviews); “I - Certain infectious and parasitic diseases” (300 reviews); “V - Mental and behavioural disorders” (170 reviews); “XIX - Injury, poisoning and certain other consequences of external causes” (82 reviews). The top 5 low level codes from the ICD-10 by the number of reviews are presented in Fig.1.

Figure 1. Top 5 categories of diseases from the ICD-10

Figure 1. Top 5 low-level disease categories from the ICD-10.

Analysing the consumers’ motivation to acquire and use drugs (“sourceInfoDrug” attribute) showed that review authors mainly mention using drugs based on professional recommendations. 989 reviews contains references of doctor prescriptions, 262 - refers to pharmaceutical specialists recommendations and 252 - doctor recommendations. Some reviews reports about using drugs recommended by relatives (207 reviews), advertisement (97) or internet (15). The heatmap, presented on Fig.2, shows percentages of reviews where popular drugs were co-occurred with different sources (sources were manually merged into 5 groups by annotators).

Figure 2. Heatmap

Figure 2.The distribution heatmap of reviews percentages for different sources of information for the 20 most popular drugs.

It could be seen that most recommendations are coming from professionals. For example Isoprinosine (used in 65.85% cases by medical prescription), Aflubun (44.09%), Anaferon (47.30%) and others. However, for such drugs as Immunal (11.9%) or Valeriana (9.18%) the rate of usage on the advice of patients’ acquaintances is close to doctors’ recommendations or higher. Amizon (12.73%) and Kagocel (11.27%) have the highest percentage for mass media (advertisement, internet and other) as the source compared to other drugs.

The distribution of the tonality (positive or negative) for the sources of information is presented in Fig. ref{fig:Distribution_drug_tonality}. A source is marked as “positive” if positive dynamic is appeared after the use of drug (i.e. review includes “BNE-pos” attribute). “Negative” tonality is marked if negative dynamic or deterioration in health has taken place or drug has had no effect (i.e. “Worse”, “ADE-Neg” or “NegatedADE” mentions appear). It follows from the diagram that drugs prescribed by the doctor are mentioned more often as having positive effect, while using drugs based on an advertisement often leads to deterioration in health.

% The distribution of the tonality (positive or negative) for the sources of information is presented in Fig.3. A source is marked as “positive” if positive dynamic is appeared after the use of drug (i.e. review includes “BNE-pos” attribute). “Negative” tonality is marked if negative dynamic or deterioration in health has taken place or drug has had no effect (i.e. “Worse”, “ADE-Neg” or “NegatedADE” mentions appear). Reviews with both effects were not taken into account. It follows from the diagram that drugs recommended by doctors or pharmacists are mentioned more often as having positive effect, while using drugs based on an advertisement often leads to deterioration in health.

Figure 3. Tonality, relative to the source of recommendations

Figure 3. Tonality, relative to the source of recommendations.

Diagrams in Fig.4 show parts of reviews where popular drugs (top 20) were mentioned along with labeled effects. The following drugs have largest parts for ADR in reviews: immunomodulator – “Isoprinosine” (48.8% of reviews with this drug contains mentions of ADR), antiviral “Amixin” (40.0%), tranquilizer – “Aphobazolum”(37.7%), antiviral – “Amizon” (36.4%), antiviral – “Rimantadine” (36.3%).

Figure 4. Drug effect

Figure 4.Distributions of labels of effects reported by reviewers after using drugs.

Users mention that some drugs causing negative dynamics after start or some period of using it (ADE-Neg). Examples of such drugs are “Anaferon” (3.5% of reviews with this drug mention ADE-Neg effects), “Viferon” (2.1%), “Glycine” (4.1%), “Ergoferon” (3.6%).

According to reviews some of the drugs causes deterioration in health after taking the course (“Worse” label): immunomodulator – “Isoprinosine” (12.2%), antiviral – “Ingavirin” (10.1%), “Ergoferon” (9.1%) and other.