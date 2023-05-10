OpenAI, the company behind the GPT language models, is developing a tool to explain the behavior of these models.

Large Language Models (LLMs) such as ChatGPT have been difficult for data scientists to understand because of the lack of transparency around their training data.

However, OpenAI is attempting to take the mystery out of their workings with the development of a tool that automatically identifies which parts of an LLM are responsible for which of its behaviors.

“We’re trying to [develop ways to] anticipate what the problems with an AI system will be,” William Saunders, the interpretability team manager at OpenAI, told TechCrunch in an interview.

“We want to really be able to know that we can trust what the model is doing and the answer that it produces.”

How Does the Tool Explain the Behavior of GPT Models?

OpenAI will use its newest and most advanced LLM, GPT-4, to comprehend the functions of its predecessors, including GPT-2.

In order to understand the tool’s process, it is necessary to understand how LLMs work.

These models are akin to human brains in the way they work and are made up of different “neurons.” Each of these neurons has a specific text pattern that influences a model’s response to a prompt.

For instance, when a model is requested to name superheroes with the best superpowers, Marvel superheroes-focused neurons may increase the chances of the model mentioning characters from Marvel comics and movies.

OpenAI’s researchers have indicated that it is possible to disintegrate GPT-2 into its constituent parts by using this neuron-based arrangement.

To accomplish this, the tool assesses instances in which a specific neuron is triggered by running text sequences.

It then presents GPT-4 with the most frequently activated neurons and requests it to generate an explanation.

The tool then asks GPT-4 to simulate the behavior of each neuron and compares it with the real-life functionality of that particular neuron.

This methodology allows OpenAI to explain the behavior of each of GPT-2’s neurons and assign a grade to the explanation based on the actual behavior of the neuron.

“Using this methodology, we can basically, for every single neuron, come up with some kind of preliminary natural language explanation for what it’s doing and also have a score for how well that explanation matches the actual behavior,” Jeff Wu, who leads the scalable alignment team at OpenAI, said.

OpenAI Hopes to Increase Trust in GPT Models

Saunders said OpenAI hopes the tool would increase trust in the decisions made by GPT models.

While there are hopes that such tools will make these models less biased or toxic, there is a long way to go before they are genuinely useful.

The team has generated explanations for all 307,200 neurons in their GPT-2 model and has released both the dataset and the tool code.

Among the 307,200 neurons in the GPT-2 model, however, the tool was only confident in its explanations for 1,000 of them.

But despite the limitations, the team hopes that this work will provide a promising avenue toward automating interpretability.

The tool could also be adapted to use other LLMs besides GPT-4. The team believes that the methodology behind the tool would not impact its underlying mechanisms regardless of model size or where it gathers information.

