ArXiv is best known as a large collection of preprint research articles, with over one million articles currently in its database.
However, ArXiv recently decided to use its database to analyze the issue of plagiarism, or more specifically, text overlap, to see how common the issue of overlapping text is and also what kind of response it gets.
To do this, it looked at all articles deposited between 1991 and 2012, a total of over 750,000 articles, and used an internal text matching tool to determined which papers had significant overlap with other papers in the archive.
The results are fairly limited. In addition to only looking for matches within the ArXiv system, it also didn’t flag papers with common authors, assuming it was self-reuse, and also ignored papers that cited the earlier work, even if they failed to indicate they were quoting from it. ArXiv says that they realize these rules are much more lenient than any publication, but they were necessary to reduce false positives.
But despite those obvious limitations, the resulting paper contained some interesting statistics. For instance, 1 in 16 authors have copied significant portions from their previously published work and 1 out of every 1,000 submitting authors copied a paragraph or more of text from other sources without citing. A significantly higher percentage copied smaller amounts.
However, since ArXiv also tracks the home country of its submitters, ScienceInsider was able to create a map highlighting which countries had the highest and lowest rates of text overlap.
In the U.S., there was a relatively low percentage of authors with flagged articles, less than 5% percent. However, India and China doubled that with more than 10% of authors having been flagged at least once. Other countries saw an even higher percentage of authors being flagged, with Bulgaria having some 20% of its authors flagged.
The study ignored countries that had fewer than 100 authors to avoid sampling errors.
But while the study found a great deal of geographic variation in the percentage of authors who were flagged, it did find one consistency, the result of the copying. Regardless of country of origin, works that contained a large amount of copied text were rarely cited, meaning that their impact was inevitably lessened in part because of their copying.
However, it’s important to remember that, while the study provides a great apples-to-apples comparison between articles and nations represented in the ArXiv database, it is less useful at exploring the overall problem of plagiarism in research.
Not only are the standards of the study far less restrictive than would be found at any reputable journal, but the matching was limited to other papers in the same database. Papers that only copied from non-ArXiv sources were given a clean bill of health. As a result, it’s likely that many more papers contain text reuse issues.
Still, the insight of this study is both rare and valuable. Not only does it provide clues as to where plagiarism issues may be more prevalent, but it also offers a reminder that, when it comes to creating a high-impact paper, there are no shortcut.
The views of this blog represent my own and not the views of iThenticate.