Hubbub Explainer: Suspect Plagiarism? Turnitin.

Published May 18, 2010

In the Web age, it would seem difficult to plagiarize and get very far. Can’t an admissions officer just, you know, Google it?

Of course, there’s more to the Web than what Google knows about. That’s where Turnitin comes in, a service that helps teachers root out plagiarism. Seems it could have saved a lot of trouble in the case of Adam Wheeler, the ex-Harvard student accused of an elaborate and calculated lie. The district attorney’s press release alone reads like a scene from “Catch Me If You Can.” (Wheeler pleaded not guilty in superior court Tuesday.)

The company was started about a decade ago by grad students, including an MIT math guy, who couldn’t believe the amount of plagiarism they found while grading papers. Now there’s a spinoff of the service for university admissions — apparently a highly requested feature.

The Turnitin people have a huge database of existing work. I’m talking huge. Imagine the entire Internet. Petabytes of data. Millions and millions of gigabytes. Turnitin has a copy of all of that plus newspaper archives and paid academic journals, plus all of the documents ever submitted to the service previously — something like 120 million homework assignments and terms papers.

[pullquote]Lorton’s BS detector goes through the roof. He says “ideas and inspiration” means “copy and paste.”[/pullquote]

When a student’s work is uploaded to the service, Turnitin’s ever-evolving algorithm flags any derivative patterns and alerts the client (the university). What fascinates me is how the company keeps up with new forms of plagiarism.

I talked to Jeff Lorton in Oakland, Calif., who runs Turnitin for Admissions.

“There are literally thousands and thousands of companies and websites that you can commission just about anything you want, including everything through your Ph.D. thesis, if you could convince your review board that you wrote it,” Lorton tells me.

I can drop $200 for a professional to ghostwrite my college application essay. The No. 1 site for fake admissions essays, at least in one batch of Turnitin’s data (PDF), is free: PersonalStatement.info, which provides “ideas and inspiration to help you craft the perfect personal statement.”

Lorton’s BS detector goes through the roof. He says “ideas and inspiration” means “copy and paste.”

A few years ago, in the 2006-07 admissions cycle, Turnitin did a beta test of the new admissions service with a batch of about 450,000 college applications. The software found “matches” in about 200,000 of those. Mind you, a match doesn’t automatically mean plagiarism. Of those matches, 36 percent were “significant” matches — very likely plagiarism. At that point, it’s up to the client (the university) to decide how to interpret the results and whether or not to deem something stolen.

So what about false positives? In the comments on my story Tuesday for Radio Boston, people took issue with the idea that it’s virtually impossible for two students to write the same phrase without intending to plagiarize.

I put that question to Lorton. He said, first of all, the system pulls out all the common words — the of’s and the’s and such. That reduces phrases to patterns, which have to be 16 words or more to be considered unique.

“The chances of the same two people writing the exact same 16 words, or pattern, is less than one in 1 trillion,” he says. (That’s 12 zeroes.)

Tyrone, in the comments, brilliantly (and unscientifically) uses Google to illustrate:

Take any phrase that you want to test. Say, the lead sentence of (a previous comment), “I’ve worked with many Indian engineers in various companies in this country.”. If you google progressively longer parts of this sentence, the progressive loss of hits tells you that it with high confidence that it is an originally composed sentence. Try it yourself, and use quotes around the words as you want to test the exact sentence.

“I’ve” you get 394,000,000 results
“I’ve worked” you get 29,700,000 results
“I’ve worked with” you get 10,300,000 results
“I’ve worked with many” you get 289,000 results
“I’ve worked with many Indian” you get 4,130 results
“I’ve worked with many Indian engineers” you get 6 results
“I’ve worked with many Indian engineers in” you get 2 results
“I’ve worked with many Indian engineers in various” No results found

And that’s just eight words. Turnitin’s algorithm starts at 16.

If an essay is flagged as plagiarism, it very likely is. The takeaway here is that high-school seniors are under enormous pressure to get into college. The Internet is right there, free, filled with inspirational quotes and pithy wisdom to copy and paste. Now teachers and admissions officers are using the same tools to fight back.