OpenAI is currently in a conundrum as it is facing legal threats and investigations from companies whose copyrighted materials have been used to train their popular artificial intelligence (AI) models.

The New York Times was the first major publisher to file a lawsuit against the firm headed by Sam Altman in December last year, claiming that the AI lab used over 10 million articles from its database to train the algorithms and neural networks that power ChatGPT and AI-powered products offered by OpenAI.

Also read: OpenAI’s Publisher Deals May Enshrine It As King of AI, Especially If It Loses Its New York Times Lawsuit

Meanwhile, the UK Parliament’s Communications and Digital Committee opened an inquiry on the company in July last year as well that sought to respond to how the technology needs to be regulated to take advantage of the opportunities it presents to society and businesses while mitigating its risks.

“Our inquiry will therefore take a sober look at the evidence across the UK and around the world, and set out proposals to the Government and regulators to help ensure the UK can be a leading player in AI development and governance,” a statement from the Committee read.

sam altman and openai claim it would be impossible to train ai models without infringing copyright

OpenAI responded shortly after and provided answers to similar questions including how large language models (LLMs) may continue to grow and become more sophisticated over the next one to three years and the risks that AI technology could result in a catastrophe for mankind.

Meanwhile, on the subject of copyright infringement, the company made a staggering statement.

“Because copyright today covers virtually every sort of human expression–including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials,” the letter sent to the Parliament reads.

The firm added: “Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”

This public acknowledgment has broad implications for the firm, its customers, regulators, and publishers as it reveals two things. First, OpenAI has primarily relied on copyrighted materials (that it didn’t own or license) to feed its current models.

Second, the company needs to keep using this kind of information, which is protected under the law as it is considered intellectual property, as its advancements in the AI field are dependent on its unlimited access to these sources. If OpenAI were making anything other than an AI model, scraping millions of people’s intellectually property to shove into a commercial product would be obviously illegal. So why does AI get a pass?

The December lawsuit from the New York Times will give both OpenAI and the renowned news outlet the chance to debate the issue and let the court decide if this imminent need to use copyrighted materials is lawful or not.

Altman’s company will argue that the Times’ IP has been used under “fair use” circumstances, which are special cases in which certain copyrighted materials can be accessed and exploited.

According to the law, there are four factors to be considered to categorize the use of IP as fair use:

  1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.
  2. the nature of the copyrighted work.
  3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole.
  4. the effect of the use upon the potential market for or value of the copyrighted work.

The one element that may play against OpenAI’s defense is the fact that they benefit directly from the use of copyrighted materials as they charge a $20 per month subscription fee to users and make millions off of that every year.

“I do have every reason to believe that they would like to preserve their rights to use this under fair use,” commented Danielle Coffey, the head of the News Media Alliance. This indicates that OpenAI may not opt to settle the case and instead they would seek a favorable ruling that establishes precedent to deter other publishers from filing similar lawsuits in the future.

IEEE Researchers Find OpenAI Claims Absurd

Research from the Institute of Electrical and Electronics Engineers (IEEE) published this year uncovered that the output generated by the AI models developed by OpenAI and Midjourney can be considered plagiarism in multiple cases as it recreates images and scenes from movies and TV shows.

“Both OpenAI and Midjourney are fully capable of producing materials that appear to infringe on copyright and trademarks,” the authors of the paper, Gary Marcus and Reid Southen, commented.

What is perhaps more worrying is that customers are not adequately informed when the images produced by models like Dall-E could be infringing copyright laws. Hence, it’s not just OpenAI that could be held liable for this output but also the user.

Meanwhile, when it comes to the fair use debate, the two researchers highlighted that OpenAI is hoping that they will be treated differently than any other company despite knowingly breaking the law. Using millions of other people’s IP to make a commercial product without permission is pretty obviously illegal.

“We won’t get fabulously rich if you don’t let us steal, so please don’t make stealing a crime!” Marcus posted on social media referring to the argument made by OpenAI that it would be “impossible” to train its AI models without copyrighted materials.

“Don’t make us pay licensing fees, either! Sure Netflix might pay billions a year in licensing fees, but we shouldn’t have to! More money for us, moar!” he sarcastically added.

OpenAI Could Be Forced to Pay $7.5 Billion in Damages to NYT

The stakes are high for OpenAI as copyright law demands a statutory compensation of $750 for every publication that has been used without express permission from the author. The Times claims that over 10 million articles have been utilized to train the AI models developed by Altman’s company.

Hence, the company could be forced to pay at least $7.5 billion to the Times if the publisher can substantiate its claims.

OpenAI is reportedly losing at least $5 billion every year and may need to soon go into another funding round to strengthen its finances. A lawsuit of this magnitude could deal a devastating blow to the business not just financially but also operationally as it will limit its ability to train AI models with the most up-to-date information from publishers like the Times.

Recently, the firm made deals with large publishers like Axel Springer to access its copyrighted materials and use them to train its models. The agreement would generate payments of up to $10 million for the publisher per year – a figure that most experts categorized as low.

Industry observers note, however, that OpenAI has no motivation to pay more as they can still feed their AI models with copyrighted data until a court says otherwise. Hence, for large media companies, these deals allow them to at least make some money off the situation until they get a favorable ruling that stops the AI race in its tracks.

Meanwhile, Tyler Ochoa, a law professor from Santa Clara University in California, discussed the copyright use issue in another light. He claims that users should be held liable and not companies that develop AI models for the production of plagiarized output.

He countered the arguments and conclusions of the IEEE report as he believed that the research was biased by prompts that explicitly mentioned titles of specific movies or pointed the model to generate potential replicas of these titles.

“This [copyright infringement] should be analyzed as a case of contributory infringement: The person who prompted the model is the primary infringer, and the creators of the model are liable only if they were made aware of the primary infringement and they did not take reasonable steps to stop it,” Ochoa stated.

Ochoa referred to movie posters and trailers that he thought were used to train AI models rather than entire films or scenes.

He argued that, if media companies wanted to distribute visual materials to promote their movies, they could demand copyright infringement as their intention was to disseminate the information and get it in front of as many viewers as possible. Then again, this argument doesn’t tell the whole story as OpenAI still took the IP of countless people and made a commercial product with it.