Dataset of the earliest appearances of scholarly bibliographic references on Wikipedia articles


Accuracy assessment

The first set of appearance data in this study was constructed using the methods described above, and we assessed the accuracy of the proposed methods by checking each diff24 between the candidate revision of the first appearance and the previous revision manually by the first author. In particular, we took random samples of 50 records for each search domain, or 1,100 records in total from the dataset, and judged whether each of them was the first appearance. In judgments, we confirmed changes between two reviews based on bibliographic information, including author names, journal names, years of publication, volume and issue numbers, pages, and URIs of individual scholarly references extracted from Crossref metadata. Figure 3 illustrates the samples of correct and incorrect candidates from the first appearance and revision comparisons.

Picture 3

Example of candidates for revisions of the first appearance of scholarly references and their previous revisions. Below the file number is the judgment on the candidate for the revision of the first publication of the article by the first author. The box colored in light blue is the candidate for the revision of the first appearance of the scientific reference, and the box colored in pink is its previous revision. The text highlighted in yellow shows the difference from the previous revision, and the text in red is the point that fulfilled one of the conditions described in step 2-2 on the construction of the dataset section of the first appearances.

For cases 1 and 2 of Fig. 3, the page and the scientific article are identical, case 1 fulfills the conditions (2) and (3), case 2 fulfills the condition (1) of step 2-2 above, respectively. We judged case 1 as the correct first appearance because the corresponding scientific reference does not exist in the previous revision. On the other hand, we judged case 2 as an incorrect first appearance because only the DOI name had been added to the existing reference in the candidate revision.

For case 3 of FIG. 3, the scientific reference added in the candidate revision is similar to the target article. We judged this to be an incorrect first appearance because the logs “Atlantic Hurricane Season of 1981” and “Atlantic Hurricane Season of 1974” are different. Likewise, if there is no corresponding reference in the previous review, we have deemed this to be a correct first appearance.

Based on the results, we calculated the accuracy for each research area using the following formula:

$$Precision=frac{total;number;of;correct;firsts;appearances}{total;number;of;samples}ast 100$$

For example, when the number of samples judged to be true first appearances was 45 in a certain search domain, the accuracy for the domain was 90.0%.

Table 2 lists the accuracy results for each research area. The highest accuracy was 98.0% (in clinical medicine, environment/ecology and psychiatry/psychology). On the other hand, the precisions in chemistry and physics are relatively low (respectively 86.0% and 84.0%). The reason why precisions in chemistry and physics are relatively low lies in the conventions in these fields. In other words, scholarly references consisting of information other than the article title and identifiers (e.g. author name, journal name, volume, number, or pages) . For example, citation format such as “Macromolecules, 2007, 40 (7), pp 2371–2379″ is used in these fields. These errors are unavoidable for the methodology of this study, consideration should be given to using additional factors such as journal names and years of publication to handle the above cases in the future.

Table 2 Accuracy results for identifying first appearances on each search domain from sample data.

Experiment on the conditions of the number of first words of the title of the article

Figure 4 shows the precision for each number of first words of the article title described in step 2-2 of the methods section. We compared the combination of the full title with the first 1-10 words, and the best accuracy was 84.6% when the first five words were used. Therefore, we used the first five words of the article title in this study.

Figure 4
number 4

Accuracy result for the combination of the full title with the first 1-10 words of the article.

Comparative analysis

In this section, we perform the comparative analysis of our dataset with the old Halfaker similar dataset. et al.7 (hereinafter, “mwcite dataset”). The mwcite dataset has extracted first appearances of identifiers such as DOI, arXiv, PubMed (PMID & PMCID) and ISBN on 298 language versions of Wikipedia as of March 1, 2018. This dataset contains page identifiers and page titles of Wikipedia articles, revision identifiers and timestamps of each change, and types and values ​​of identifiers. Our dataset contains the first appearances of scholarly references on English Wikipedia as of March 1, 2017, and the corresponding DOI names. As shown in Table 1, our dataset covers bibliographic metadata, search fields as well as page IDs and page titles of Wikipedia articles, as well as revision IDs, editor information and timestamps. of each edition.

To compare with our dataset by the same condition, we extracted the English Wikipedia records from the mwcites dataset using both the DOI as the identifier type and the timestamps as of March 1, 2017. DOI names of 1,020,508 in total and 721,836 in unique reference out of 229,090 pages were extracted.

Table 3 shows the results of the DOI name overlap analysis between the two datasets. Based on the difference set, 159,952 DOI names are included in the mwcites dataset alone. Of these, 137,375 were Crossref DOIs and 20,767 were invalid DOI names. Next, 10,458 Crossref DOIs qualify for individual academic papers and identifiable research areas in Stage 1-2. On the other hand, 49,235 Crossref DOIs fulfilling these conditions are included in our dataset alone.

Table 3 Results of the analysis of the overlap by DOI names between the two datasets.

For these 10,458 Crossref DOIs above, we took 50 random samples of the sets of DOI names, page IDs, and revision IDs. As a result of manual checking for differences between revision IDs and previous revisions by the first author, they were categorized as: (1) 28 cases were not written as a hyperlink but simply as text (eg, “”), (2) 19 cases were written without a DOI link (eg, https://doi. org/10.1525/jps.2011.XL.2.43) but a hyperlink to the editor content (for example,, (3) 2 cases were the text commented out and not displayed on the article, (4) 1 case used Wikipedia’s template but did not display as a DOI link due to a typo.

These results show that most of the Crossref DOIs included only the mwcites dataset were not the focus of this study. 10,458 Crossref DOIs qualify as individual scholarly articles and identifiable research areas, but would not be written as DOI links. Apart from these cases, 49,235 Crossref DOIs fulfilling the above conditions were included in our dataset alone. These deviations are interpreted as a difference within scope. There are some differences in the target scope definition, but these two datasets contain the common DOI links at high rates, 77.84% of the mwcites dataset and 91.90% of our data.

Table 4 illustrates the results of the analysis of the overlap by the pairs of DOI names and page identifiers between the two datasets. Based on the product set, 814,326 pairs are common, representing 79.90% and 88.29% of the mwcites dataset and our dataset, respectively. Table 5 shows the results of comparing the timestamps of these common pairs. The timestamps in the two datasets were the same in 415,272 (51.0%) cases of all. For others, the timestamps in our dataset were older than those in the mwcites dataset in 399,008 (49.0%) cases, and the reverse cases were 46 (0.01%). For the 399,054 cases where the timestamps between the two datasets were not equal, we calculated the time offset for them in days. The mean was 723.2, the median 1.5 and the standard deviation 811.0. Based on the accuracy of the method proposed in this study, these gaps in timestamps show that the proposed method has made progress over previous work in identifying the correct first appearances of scholarly references.

Table 4 Results of the analysis of the overlap by the pairs of DOI names and page identifiers between the two datasets.
Table 5 Results of the timestamp comparison between the two datasets.

Finally, we summarize the advantages of each dataset. The mwcites dataset covers many language versions of Wikipedia and several identifiers other than DOI names. It would be suitable for those who analyze the different identifiers on a large scale on Wikipedia or compare them between Wikipedias. On the other hand, it would be inadequate to analyze who and when added the original references to the page. Our dataset focuses on individual academic papers associated with the ESI categories referenced on the English Wikipedia, so it would be useful to compare them across research fields. Also, our data set is suitable for analyzing who and when added the original references to the page.

Basic statistics

Table 6 shows the basic statistics of the data set. In total, we identified the first appearances of 923,894 scholarly references (611,119 unique DOIs) in 180,795 unique pages. These references are added by 74,456 users, 63 bots and 37,748 IP editors. In terms of research fields, “Clinical Medicine”, “Molecular Biology and Genetics” and “Multidisciplinarity” are in the top 3 for the number of total DOIs and exceed 100,000.

Table 6 Basic statistics of the data set.

Comments are closed.