Corpora and its subcorpuses

Detailed information about the formed corpora is presented in the Table below and includes:

The table shows several different subcorpuses from the general corpus (Version II from 2021 (missing reference)):

Entity type (tag) Version I Version II Balanced subcorpus of the version II Subcorpus of 500 texts Version III
Number of annotations Number of reviews Number of annotations Number of reviews Average length of the entity (num. of words) Number of annotations Number of reviews Number of annotations Number of reviews Number of annotations Number of reviews
Medication 17875 1659 33005 2799 -- 13748 1250 5967 500 48075 3821
Drugname 4745 1655 8239 2793 1.2 3503 1247 1489 498 11812 3815
Drugform 3303 1266 5997 2194 1 2423 960 1041 387 8736 3061
MedMaker 954 816 1720 1451 1.4 750 629 273 228 2514 2097
SourceInfodrug 1267 878 2579 1579 1.7 1110 683 460 285 3762 2278
Drugclass 1786 1005 3113 1684 1 1317 747 577 313 4543 2268
DrugBrand 2584 1038 4656 1804 1.1 2021 812 873 335 6786 2540
Route 1470 817 3609 1733 2.2 1440 739 683 317 5641 2547
Duration 895 701 1515 1194 2 565 463 256 192 1866 1470
Dosage 506 387 960 706 2.5 407 313 202 143 1566 1060
Frequency 365 303 617 517 3.9 212 187 113 87 849 724
Disease 9222 1603 17403 2713 -- 6307 1180 2819 478 23854 3716
Diseasename 2215 917 4042 1628 1.2 1462 657 738 296 4934 2096
Indication 2310 955 4627 1783 1.7 1518 670 720 297 7456 2631
BNE-Pos 2967 1021 5620 1764 2.7 1990 676 809 289 7475 2477
NegatedADE 1532 641 2804 1104 3.2 1195 496 481 201 3600 1523
Worse* 83 51 224 134 4.6 99 61 52 35 302 190
ADE-Neg* 115 68 86 54 4 43 28 19 12 87 55
ADR 843 339 1778 625 2.4 1752 610 709 177 5050 1605
Note 2319 1004 4490 1861 -- 2273 905 902 359 6931 2798

Corpus statistic (Version III)

The corpus contains consumer posts on drugs, mentioned 11 812 times and related to 604 ATC codes. The most popular 20% of the ATC codes (by the number of reviews with corresponding Drugname mentions) include 120 different codes which mentions appears in 3 295 reviews (86% of all reviews). Among them, 22 ATC codes were reviewed in more then 50 posts (2351 posts in total).

The proportions of reviews about domestic drugs and foreign to the total number of reviews are 40.59\% and 45.25\% respectively. The remaining documents (14.16\%) contains mentions of multiple drugs both domestic and foreign or mentions of drugs which origin the annotators could not determine. Among the domestic drugs are following: “Anaferon” (145 reviews), “Viferon” (140), “Ingavirin” (102) and “Glycine” (101). Examples of mentioned foreign drugs: “Aflubin” (93), “Amison” (55), “Antigrippin” (65) and “Immunal” (42).

Regarding diseases, the most frequent ICD-10 top level categories are “X - Diseases of the respiratory system” (1221 reviews); “I - Certain infectious and parasitic diseases” (356 reviews); “V - Mental and behavioural disorders” (227 reviews); “XIX - Injury, poisoning and certain other consequences of external causes” (137 reviews). The top 5 low level codes from the ICD-10 by the number of reviews are presented in Fig.1.

Figure 1. Top 5 categories of diseases from the ICD-10

Figure 1. Top 5 low-level disease categories from the ICD-10.

Analysing the consumers’ motivation to acquire and use drugs (“sourceInfoDrug” attribute) showed that review authors mainly mention using drugs based on professional recommendations. 1473 reviews contains references of doctor prescriptions, 341 - refers to pharmaceutical specialists recommendations and 334 - doctor recommendations. Some reviews reports about using drugs recommended by relatives (290 reviews), advertisement (114) or internet (43). The heatmap, presented on Fig.2, shows percentages of reviews where popular drugs were co-occurred with different sources (sources were manually merged into 5 groups by annotators).

Figure 2. Heatmap

Figure 2.The distribution heatmap of reviews percentages for different sources of information for the 20 most popular drugs.

It could be seen that most recommendations are coming from professionals. For example Isoprinosine (used in 65.85% cases by medical prescription), Aflubun (44.09%), Anaferon (42.15%) and others. However, for such drugs as Oxolinum (11.39%) or Aphobazolum (10.00%) the rate of usage on the advice of patients’ acquaintances is close to doctors’ recommendations or higher. Amizon (13.46%) and Kagocel (9.72%) have the highest percentage for mass media (advertisement, internet and other) as the source compared to other drugs.

The distribution of the tonality (positive or negative) for the sources of information is presented in Fig. ref{fig:Distribution_drug_tonality}. A source is marked as “positive” if positive dynamic is appeared after the use of drug (i.e. review includes “BNE-pos” attribute). “Negative” tonality is marked if negative dynamic or deterioration in health has taken place or drug has had no effect (i.e. “Worse”, “ADE-Neg” or “NegatedADE” mentions appear).

% The distribution of the tonality (positive or negative) for the sources of information is presented in Fig.3. A source is marked as “positive” if positive dynamic is appeared after the use of drug (i.e. review includes “BNE-pos” attribute). “Negative” tonality is marked if negative dynamic or deterioration in health has taken place or drug has had no effect (i.e. “Worse”, “ADE-Neg” or “NegatedADE” mentions appear). Reviews with both effects were not taken into account. It follows from the diagram that drugs recommended by doctors or pharmacists are mentioned more often as having positive effect, while using drugs based on an advertisement often leads to deterioration in health.

Figure 3. Tonality, relative to the source of recommendations

Figure 3. Tonality, relative to the source of recommendations.

Diagrams in Fig.4 show parts of reviews where popular drugs (top 20) were mentioned along with labeled effects. The following drugs have largest parts for ADR in reviews: immunomodulator – “Isoprinosine” (46.34% of reviews with this drug contains mentions of ADR), antiviral “Amixin” (40.0%), tranquilizer – “Aphobazolum”(40.0%), antiviral – “Amizon” (36.53%), antiviral – “Rimantadine” (33.96%).

Figure 4. Drug effect

Figure 4.Distributions of labels of effects reported by reviewers after using drugs.

Users mention that some drugs causing negative dynamics after start or some period of using it (ADE-Neg). Examples of such drugs are “Anaferon” (2.7% of reviews with this drug mention ADE-Neg effects), “Viferon” (2.1%), “Glycine” (4.0%), “Ergoferon” (3.6%).

According to reviews some of the drugs causes deterioration in health after taking the course (“Worse” label): immunomodulator – “Isoprinosine” (12.2%), antiviral – “Ingavirin” (11.8%), “Ergoferon” (7.3%) and other.