AI Drift Explained - Why New ChatGPT Version Performs Much Worse in Some Tasks and Much Better in Others

ChatGPT debuted in November 2022 and quickly became a global sensation, captivating audiences worldwide.

As the most widely acclaimed AI-powered chatbot, it has garnered praise for its potential to revolutionize the field of artificial intelligence and bring about significant changes in the world.

Following its initial release, ChatGPT has undergone multiple updates and iterations to enhance its capabilities.

However, despite these efforts, users have reported varied performance across different tasks in the latest version of ChatGPT.

ChatGPT is Changing Over Time, Getting Better at Some Tasks and Worse at Others

ChatGPT, an AI language model designed for human conversation, has garnered significant attention for its remarkable ability to generate realistic sounding text of all kinds.

However, there are growing concerns about its performance declining in some tasks over time.

A study conducted by Stanford University observed that ChatGPT’s proficiency in specific tasks declined between March and June 2023, showing fluctuations or “drifts” in its capabilities.

In an interview with Fortune, one of the study’s authors, James Zou, a computer science professor at Stanford University, expressed his astonishment at the substantial “magnitude of the change” exhibited by the advanced ChatGPT.

The study uncovered notable differences between the results from March to June and variations between GPT-3.5 and GPT-4, the 2 most recent models.

It underscored not only the model’s accuracy in specific tasks but also the unpredictable consequences of modifying one aspect of the model on other components.

These findings have sparked questions about the factors contributing to the diminishing performance and cast doubts on the overall capabilities of AI.

In response to these findings, the study compared ChatGPT’s performance over several months.

It focused on four different tasks – solving math problems, answering sensitive questions, generating software code, and visual reasoning.

The chosen tasks aimed to showcase the diverse and valuable capabilities of the Large Language Models (LLMs).

However, the study revealed significant variations in the performance and behavior of GPT-3.5 and GPT-4 across these two versions, with certain tasks showing a notable decline in performance as time passed.

What the Monitoring Revealed

The researchers observed significant fluctuations, termed “drift,” in the performance of the technology when handling specific tasks.

In the case of GPT-4’s ability to solve certain math problems, the results were surprising. The researchers asked it to determine whether 17077 is a prime number, which should be an easy task for an advanced AI model like ChatGPT.

Its accuracy plummeted from 97.6% in March to 2.4% in June, while GPT-3.5 exhibited remarkable improvement, with accuracy increasing from 7.4% to 86.8% during the same period.

Moreover, GPT-4’s responses became notably more concise, with its average verbosity dropping drastically from 821.2 characters in March to a mere 3.8 characters in June.

On the other hand, GPT-3.5 experienced around a 40% growth in response length. The answers provided by both models in their March and June versions for the respective tasks showed minimal overlap.

The researchers speculated that these fluctuations might be influenced by drift effects in chain-of-thoughts, a reasoning method commonly used in these tasks.


For instance, in March, GPT-4 meticulously followed the chain-of-thought instruction to determine if 17077 is a prime number.

It carefully decomposed the task into four steps, executed each step, and ultimately arrived at the correct answer, finding that 17077 is indeed a prime number.

However, the chain-of-thought approach failed in June, and GPT-4 simply generated a blunt “No” without any intermediate steps.

Interestingly, GPT-3.5 exhibited a different drift pattern for chain-of-thought. In March, it tended to generate the answer “No” first and then follow the reasoning steps, leading to incorrect nominal answers even when the conclusion was correct.

Yet, in June, GPT-3.5’s update rectified this issue by presenting the reasoning steps and then generating the correct answer, “Yes.”

The researchers also observed varying results when the models were asked to write code and perform visual reasoning tests.

Both GPT-4 and GPT-3.5 showed a decline in the directly executable code generation percentage. This saw GPT-4 dropping from 52.0% to 10.0% and GPT-3.5 decreasing from 22.0% to 2.0% between March and June.

Additionally, GPT -4’s verbosity in code generation increased by 20%.

For visual reasoning tasks involving abstract reasoning, both GPT-4 and GPT-3.5 showed an overall 2% improvement in the exact match rate. However, the generation length remained relatively constant.

Approximately 90% of the visual reasoning queries showed no change in generation from March to June.

A Need for Constant Fine Tuning

The study highlights significant changes in the behavior of GPT-3.5 and GPT-4 within a relatively short period.

However, the lack of visibility into the models used in ChatGPT has hindered a thorough understanding of specific unintended side effects.

This lack of transparency became more pronounced after OpenAI decided not to make its code open source in March, leading to backlash from AI experts and tech analysts.

One such expert, Ben Schmidt from Nomic AI, criticized OpenAI for not disclosing its code.

He pointed to a 98-page paper introducing GPT-4, which revealed no information about the content of the training set.

Schmidt also shared a snippet from the GPT-4 Technical Report to support his argument.

In the case of the closed code access, Zou noted that these models are like black boxes, making it difficult to comprehend the evolution of neural architectures and training data.

However, he emphasized the importance of establishing the occurrence of drifts and their potential to produce significantly different results.

Zou further noted that the central point conveyed through the research is the widespread occurrence of drifts in large language models.

Wall Street Memes (WSM) - Newest Meme Coin

Our Rating

Wall Street Memes
  • Community of 1 Million Followers
  • Experienced NFT Project Founders
  • Listed On OKX
  • Staking Rewards
Wall Street Memes