CrossCheck Plagiarism Screening: Understanding the Similarity Score

Posted by Guest Blogger on Aug 11, 2011 10:17:00 PM

Written by Kirsty Meddings, Product Manager at CrossRef ---CrossCheck, the plagiarism screening initiative from CrossRef and iParadigms has recently welcomed its 240th publisher and is becoming an established part of the crosscheck-crossref-plagiarism-ismteeditorial process for many journals. CrossCheck members use the iThenticate plagiarism detection system to screen submitted papers for originality and can quickly tell whether a paper contains passages of text that also appear in other publications or resources.

When a manuscript is first uploaded to iThenticate, a Similarity Score is returned indicating the percentage of text in the uploaded document that matches text in other published documents or web pages.

The similarity score is the first thing you see when a document is processed and, because it’s easy to focus on this number as signifying a problem, a common question new users of the system ask is ‘what level of similarity score indicates a problem?’

The answer to this question is there is no such thing as a ‘magic number’ that will tell you whether a document contains problematic content.  The similarity score gives you a rough ‘headline’ that ensures heavily duplicated papers are brought straight to your attention and allows you to quickly disregard papers with hardly any matches. Beyond that, the score itself doesn’t give you definitive answers and definitely cannot tell you whether you have a case of plagiarism.

Why is this?

Well, there are a number of factors that need to be taken into account when assessing a paper’s overall similarity score.

Firstly, it’s important to note the similarity score is telling you the total amount of matching text. This is probably going to be made up of a number of smaller matches. It is possible a 30% score will turn out to be a 30% match to one source, but it’s much more likely that when you look at the reports you’ll find the 30% is made up of a number of smaller matches, the largest of which might be just 4 or 5%.

Of course, a paper with six separate matches of 5% could well be as problematic as one that has copied 30% of its content from a single source, but it’s impossible to tell whether this is the case without looking at the reports.

Secondly, where the match appears can sometimes be more important than how big the match is. For example, editors in certain subject areas may be less concerned about sizable matches in methods sections, where there are only so many ways to describe a certain process. A match in the discussion or conclusions with no appropriate citation, on the other hand, could set alarm bells ringing even though it only accounts for a small percentage of the manuscript.

Similarly, acceptable thresholds for one type of article may not be appropriate for another: Review articles could be expected to have a higher overall similarity score than original research articles.

It is also important to bear in mind there could be simple errors in the unedited manuscript that mean matches are picked up incorrectly. The  exclude bibliography feature of iThenticate relies on the reference section having a title on its own line within the document. If this is omitted from the manuscript, the references will not be excluded.

Similarly, the  exclude quotes feature looks for quotation marks. If the author has not used quotation marks or missed one at the start or end of the passage, the system will not recognize it as a quote, even though it might be apparent to the editor due to its layout and reference.

For all of these reasons it’s important to look at the reports rather than rely on the similarity score alone.

Using the Content Tracking Report

The default report in iThenticate is the Similarity Report. This shows you content matches from highest to lowest. It highlights all areas of the uploaded manuscript that match one or more sources in iThenticate’s comprehensive databases and gives you a very good indication of whether the paper contains significant sections of duplicated text.

A quick glance at the Similarity Report will often be all that is needed to confirm a manuscript only contains small matches comprised of frequently used terms or phrases, or at worst, poorly cited content that can be corrected. If, however, the Similarity Report identifies one or more matches that are quite large, or lots of smaller matches even with the bibliography excluded, the Content Tracking report should be your next port of call.

Content Tracking compares the uploaded manuscript to one source at a time. The Similarity Report combines the top matches from multiple sources into a summary, and in doing so can only attribute each match to one source when it may in fact appear in several.

This is best explained using an example. Say a document has an overall similarity score of 25%, comprised in the Similarity Report of one match of 20% to source A and a second match of 5% to source B. Switching to Content Tracking reveals the second match to source B is in fact 15%, but 10% is a passage of text located within the match to source A and is therefore masked by the larger match. This 10% cannot show as matching both source documents in the Similarity Report because it can only be highlighted once, but in Content Tracking where you can toggle between individual sources with a radio button it will be attributed to each source separately.

An example of where this can be particularly useful is when there is a combination of duplicate or redundant publication and possible plagiarism: Large matches to the author’s previous work could hide smaller passages copied from other articles. Content Tracking will lay out the full extent of overlap with each source with no masking.

Side-by-Side Comparison

Finally, don’t forget that for any match you can view the full text of the source article or web page alongside the uploaded manuscript by clicking on the highlighted passage in the left-hand screen of either the Similarity Report or Content Tracking. Clicking on the link in the right-hand pane will take you to the article or web page in its original location, which can be useful for identifying website matches or checking unfamiliar sources, but only the sideby-side view will show you the matching passages next to each other.

Monthly online demos of iThenticate are available at no cost for any staff from CrossCheck member publishers. Sign up at www.crossref.org/crosscheck.html.

Reference

First published in the International Society of Managing and Technical Editors’ (ISMTE) newsletter, Editorial Office News (EON), February 2011 issue. Available at www.ismte.org

Related

Demo: How to Understand iThenticate Reports

Topics: Resources




Comments