Unless you have been living under a rock, you probably know that artificial intelligence (AI) euphoria is taking over the world and the enthusiasm is not merely limited to the business and investing communities. While many are using AI for legit reasons, the technology is also being exploited to cheat in school with many students taking shortcuts for their assignments by using AI tools.
To address the worsening situation of “AI cheating” universities and companies have turned to AI detectors that claim to help detect if someone used AI for their work. Unfortunately, it’s becoming clear that these detectors are not as accurate as they may seem with a propensity for false positives and students are paying the price.
AI Detectors Can Incorrectly Flag Original Work As AI-Generated
As an experiment, Businessweek tested GPTZero and Copyleaks, two of the leading AI-detectors. It used a random sample of 500 college application essays which were submitted to Texas A&M University in the summer of 2022. Since the cut-off date was before ChatGPT was released and these essays were not part of datasets on which AI tools are trained, it virtually ensured that they were not made using AI. However, the tools flagged some of the essays as generated by AI, and in some cases, they claimed 100% certainty.
It is morally wrong to use AI detectors when they produce false positives that smear students in ways that hurt them and where they can never prove their innocence.
Do not use them. https://t.co/cSBZRHgif5 pic.twitter.com/47lFXqT4G7
— Ethan Mollick (@emollick) October 18, 2024
Such wrongful classification of legit work as AI-generated can have disastrous consequences for students, college applicants, job seekers, or even graduate students and post-docs. Also, it can lead to serious charges of plagiarism which can have far-reaching implications for people in some industries.
Incorrect flagging by these detectors can fundamentally ruin the relationship between students and their teachers as both the growing use of AI by students and incorrect flagging by AI detectors leads to constant anxiety and mistrust among them. There is simply no 100% accurate way to tell whether text was written by AI. Many teachers still don’t know how to proceed without the potential of falsely accusing students or letting them get away with cheating.
How do AI Detectors Work?
AI detectors analyze text using natural language processing and machine learning to determine if the content was written by a human or an AI tool. These use specific characteristics of the text like perplexity, embeddings, and burstiness to determine if the content was AI-generated. They don’t understand language and rely on historical data that they have been trained on to arrive at a conclusion.
Someone sent me a cold email proposing a novel project. Then I noticed it used the word "delve."
— Paul Graham (@paulg) April 7, 2024
The following are some of the commonly used tools by these detectors.
- Word choice: Some words like “delve” that are not used commonly by humans appear frequently in AI-generated content and help detectors determine if the content was generated by AI or a human.
- Frequency and repetition: AI-generated content usually lacks variability and has excessive repetition which helps detectors in identifying them.
- Burstiness: AI-generated content typically has low burstiness – or variation in sentence structure and length – which helps AI detectors identify them.
- Perplexity: Content generated by humans has a higher perplexity and uses more creative language choices as compared to AI-generated content.
Should We Rely Only on AI Detectors?
Meanwhile, Businessweek’s analysis showed that those who speak English as their second language, neurodivergent persons, and those using straightforward vocabulary were at a higher risk of their work being incorrectly flagged by AI detectors.
Notably, Stanford’s research last year showed that while AI detectors were “near-perfect” in evaluating essays written by US-born eighth-graders, they classified more than half of TOEFL (Test of English as a Foreign Language) essays written by non-native English students as AI-generated.
James Zou, a professor of biomedical data science at Stanford University said, “They (AI detectors) typically score based on a metric known as ‘perplexity,’ which correlates with the sophistication of the writing — something in which non-native speakers are naturally going to trail their U.S.-born counterparts.”
Zou also highlighted the ethical angle in these detectors disproportionately singling out non-native English speakers and said, “These numbers pose serious questions about the objectivity of AI detectors and raise the potential that foreign-born students and workers might be unfairly accused of or, worse, penalized for cheating.”
Why do AI Detectors Incorrectly Flag Human Text?
Zou and his coauthors say that AI detector tools single out non-native speakers as they tend to score lower on common perplexity measures such as lexical diversity, lexical richness, and grammatical complexity. They often simply know fewer words so it’s natural that their text would have less perplexity.
He adds that it is easy to game AI detectors by asking ChatGPT to use literary language. This would mean that while AI detectors flag some original text as AI-generated they fail to identify some text that was generated using AI tools. According to Zou, “Current detectors are clearly unreliable and easily gamed, which means we should be very cautious about using them as a solution to the AI cheating problem.
AI Detectors Should Not Be the Final Word
Relying only on AI detectors is not the ideal solution for identifying AI-generated content as these tools are far from perfect and they produce far too many false positives. According to Copyleaks co-founder and Chief Executive Officer Alon Yamin, “We’re making it very clear to the academic institutions that nothing is 100% and that it should be used to identify trends in students’ work.”
Yamin who says that his company’s technology is 99% accurate, added, “Kind of like a yellow flag for them to look into and use as an opportunity to speak to the students.” However, when you claim that your product is 99% accurate, many teachers are going to assume that it’s correct and leave it at that.
AI-detection tools should not be the final word on whether the text is written by a human or through AI writing assistance tools. Just like AI, AI detection tools are currently prone to error.
Does AI actually help students learn? A recent experiment in a high school provides a cautionary tale. @jillbarshay @hechingerreport
Kids Who Use ChatGPT as a Study Assistant Do Worse on Testshttps://t.co/NWMGfan62Z
— MindShift (@MindShiftKQED) September 4, 2024
Are AI Tools Actually Good for Students?
There is still no good solution for the AI cheating epidemic in schools across the globe and the false positive rates of AI detectors are only making it worse. While most experts believe that AI would be able to contribute positively to society, some others believe it would be a net negative.
For instance, a study by researchers at Wharton University and Pennsylvania University conducted on nearly 1,000 school math students in Turkey showed that using GenAI tools makes it tougher for kids to learn and acquire new skills – even as it helps improve their performance in the short term.
Wharton professor Hamsa Bastani who co-authored the paper said, “We’re really worried that if humans don’t learn, if they start using these tools as a crutch and rely on it, then they won’t actually build those fundamental skills to be able to use these tools effectively in the future.