By
New
analyses of the hundreds of thousands of technical manuscripts
submitted to arXiv, the repository of digital preprint articles, are
offering some intriguing insights into the consequences—and geography—of
scientific plagiarism. It appears that copying text from other papers
is more common in some nations than others, but the outcome is generally
the same for authors who copy extensively: Their papers don’t get cited
much.
Since its founding in 1991, arXiv has become the world's largest
venue for sharing findings in physics, math, and other mathematical
fields. It publishes hundreds of papers daily and is fast approaching
its millionth submission. Anyone can send in a paper, and submissions
don’t get full peer review. However, the papers do go through a
quality-control process. The final check is a computer program that
compares the paper's text with the text of every other paper already
published on arXiv. The goal is to flag papers that have a high
likelihood of having plagiarized published work.
"Text overlap" is the technical term, and sometimes it turns out to
be innocent. For example, a review article might quote generously from a
paper the author cites, or the author might recycle and slightly update
sentences from their own previous work. The arXiv plagiarism detector
gives such papers a pass. "It's a fairly sophisticated machine learning
logistic classifier," says arXiv founder Paul Ginsparg, a physicist at
Cornell University. "It has special ways of detecting block quotes,
italicized text, text in quotation marks, as well statements of
mathematical theorems, to avoid false positives."
Only when there is no obvious reason for an author to have copied
significant chunks of text from already published work—particularly if
that previous work is not cited and has no overlap in authorship—does
the software affix a “flag” to the article, including links to the
papers from which it has text overlap. That standard “is much more
lenient" than those used by most scientific journals, Ginsparg says.
To explore some of the consequences of "text reuse," Ginsparg and
Cornell physics Ph.D. student Daniel Citron compared the text from each
of the 757,000 articles submitted to arXiv between 1991 and 2012. The
headline from that study, published Monday in the Proceedings of the National Academy of Sciences (PNAS) is that the more text a paper poaches from already published work, the less frequently that paper tends to be cited. (The full paper is also available for free on arXiv.)
It also found that text reuse is surprisingly common. After filtering
out review articles and legitimate quoting, about one in 16 arXiv
authors were found to have copied long phrases and sentences from their
own previously published work that add up to about the same amount of
text as this entire article. More worryingly, about one out of every
1000 of the submitting authors copied the equivalent of a paragraph's
worth of text from other people's papers without citing them.
So where in the world is all this text reuse happening? Conspicuously missing from the PNAS
paper is a global map of potential plagiarism. Whenever an author
submits a paper to arXiv, the author declares his or her country of
residence. So it should be possible to reveal which countries have the
highest proportion of plagiarists. The reason no map was included,
Ginsparg told ScienceInsider, is that all the text overlap detected in their study is not necessarily plagiarism.
Ginsparg did agree, however, to share arXiv’s flagging data with ScienceInsider.
Since 1 August 2011, when arXiv began systematically flagging for text
overlap, 106,262 authors from 151 nations have submitted a total of
301,759 articles. (Each paper can have many more co-authors.) Overall,
3.2% (9591) of the papers were flagged. It's not just papers submitted
en masse by a few bad apples, either. Those flagged papers came from 6%
(6737) of the submitting authors. Put another way, one out of every 16
researchers who have submitted a paper to arXiv since August 2011 has
been flagged by the plagiarism detector at least once.
The map above, prepared by ScienceInsider,
takes a conservative approach. It shows only the incidence of flagged
authors for the 57 nations with at least 100 submitted papers, to
minimize distortion from small sample sizes. (In Ethiopia, for example,
there are only three submitting authors and two of them have been
flagged.)
Researchers from countries that submit the lion's share of arXiv
papers—the United States, Canada, and a small number of industrialized
countries in Europe and Asia—tend to plagiarize less often than
researchers elsewhere. For example, more than 20% (38 of 186) of authors
who submitted papers from Bulgaria were flagged, more than eight times
the proportion from New Zealand (five of 207). In Japan, about 6% (269
of 4759) of submitting authors were flagged, compared with over 15% (164
out of 1054) from Iran.
Such disparities may be due in part to different academic cultures, Ginsparg and Citron say in their PNAS
study. They chalk up scientific plagiarism to "differences in academic
infrastructure and mentoring, or incentives that emphasize quantity of
publication over quality."
*Correction, 11 December, 4:57 p.m.: The map has been corrected to reflect current national boundaries.