Tian Yuandong poured cold water on OpenAI's mysterious Q* project: synthetic data is not an AGI savior, and the ability is limited to simple math problems

2023-11-27 06:43:39

Source: New Zhiyuan

Image source: Generated by Unbounded AI

The discussion of the Q conjecture continues, and today, AI guru Tian Yuandong publicly stated that Q* can only solve entry-level math problems, and AGI is also likely to be unable to be achieved through synthetic data.

Q* conjecture continues to be popular in the AI community.

Everyone is speculating whether Q* is “Q-learning + A*”.

AI guru Fuchito Tian also analyzed in detail how likely the hypothesis of “Q*=Q-learning+A*” is.

At the same time, more and more people are judging that synthetic data is the future of LLMs.

However, Tian Yuandong poured cold water on this statement.

I partially disagree with the statement that AGI can be solved simply by zooming in on synthetic data.
Search is powerful because, if the environment is designed properly, it will create an infinite number of new patterns for models to learn and adapt to.
However, the question of whether billions of data is needed to learn such a new model remains an open question, which may indicate some fundamental flaws in our architecture/learning paradigm.
In contrast, it is often easier for humans to discover new paradigms through the “aha” moment.

Jim Fan, a senior scientist at NVIDIA, agrees that synthetic data will play an important role, but simply by blindly scaling it will not be enough to achieve AGI.

**Q*=Q-learning+A, how likely is it**

Tian Yuandong said that based on his past experience with OpenGo (a reproduction of AlphaZero), A* can be regarded as a deterministic MCTS version with only the value (i.e., heuristic) function Q.

A* is well suited for tasks where the state is easy to assess after a given action, but the action is difficult to predict after a given state. A prime example of this is a math problem.

Go, by contrast, is a different story: the next candidate is relatively easy to predict (just by checking the local shape), but it is much trickier to assess the situation on the board.

That’s why we also have quite powerful Go bots, but they only make use of strategy networks.

For LLMs, there may be an added advantage to using Q(s,a), as evaluating Q(s,a) may only require pre-population, whereas predictive strategy a = pi(s) requires autoregressive sampling, which is much slower. Also, in the case of using only the decoder, the KV cache of s can be shared across multiple operations.

The legendary Q*, who has already made a major leap forward in solving mathematical problems, how likely is this?

Tian Yuandong said that his guess is that the value function should be relatively easy to set up because of the entry-level math problem being solved (for example, it can be predicted from the target specification in the form of natural language).

If you want to solve a difficult math problem and don’t know how to do it, this approach may not be enough.

LeCun retweeted Tian’s discussion and agreed with his point of view - "He explained the difference in applicability between A* (searching for the shortest path in a graph) and MCTS (searching in an exponentially growing tree). 」

Regarding LeCun’s retweet, Tian Yuandong said that he has been doing many different things, including planning, understanding Transformers/LLMs and efficient optimization techniques, hoping to combine these technologies.

Some netizens expressed skepticism, saying, "For A* to be valid, a provable, acceptable, and consistent heuristic function is needed. But I very much doubt anyone can come up with such a function, because it’s not easy to determine the value of a subsequence. 」

**Even if you make an elementary school math problem, Q* is expected to be high**

Anyone who knows even a little about large models knows that the ability to solve basic mathematical problems means that the ability of a model to do so is a major leap forward.

This is because it is difficult for large models to generalize outside of the trained data.

Charles Higgins, co-founder of AI training startup Tromero, said that the key problem that plagues large models now is how to logically reason about abstract concepts, and if this step is achieved, it will undoubtedly be a major leap.

Mathematics is the study of symbolic reasoning, for example, if X is greater than Y and Y is greater than Z, then X is greater than Z.

If Q* is indeed Q-learning+A*, it shows that OpenAI’s new model can combine ChatGPT-enabled deep learning technology with the rules of human programming. And this method can help solve the hallucinatory puzzle of LLM.

According to Tromero co-creator Sophia Kalanovska, this has very important symbolic significance, but on a practical level, it is unlikely to end the world.

So why is there a rumor that “Q* has already appeared in the prototype of AGI”?

Kalanovska argues that, according to the current claims, Q* is able to combine the two sides of the brain and understand things from experience while reasoning about facts.

Obviously, this is one step closer to our recognized intelligence, because Q* is likely to give new ideas to large models, which ChatGPT can’t do.

The biggest limitation of existing models is that they can only regurgitate information from the training data, but cannot reason and develop new ideas.

Solving the unseen problem is a key step in creating AGI.

Andrew Rogoyski, director of the AI Institute at the Surrey Centre for Humanity, said that the large models that exist today can do undergraduate-level math problems, but when it comes to more advanced math problems, they all fail.

But if LLMs are really able to solve new, unseen problems, that’s a big deal, even if the math problems made are relatively simple.

Synthetic data is the key to the future of LLMs?

So, is synthetic data king?

The explosion of Q* has caused a lot of speculation among the bigwigs, and the bigwigs speculate that the rumored “huge computing resources that enable the new model to solve certain mathematical problems” may be RLAIF (reinforcement learning from AI feedback).

RLAIF is a technology that replaces human tagging preferences from off-the-shelf LLMs, making alignment operations against LLMs more scalable by automating human feedback.

RLHF (Reinforcement Learning Based on Human Feedback), which has previously shone in LLM training, can effectively align large language models with human preferences, but collecting high-quality human preference labels is a key bottleneck.

As a result, companies such as Anthropic and Google have tried to turn to RLAIF, using AI to replace humans in the process of feedback training.

This means that synthetic data is king, and using a tree structure provides more and more options for later on, to arrive at the right answer.

Not long ago, Jim Fan tweeted that synthetic data will provide the next trillion high-quality training data.

"I bet most serious LLM groups know that. The key question is how to maintain quality and avoid premature stagnation. 」

Jim Fan also cites Richard S. Sutton’s article “The Bitter Lesson” to illustrate that there are only two paradigms in the development of AI that can be infinitely scaled through computation: learning and searching.

"It was true in 2019 at the time of writing this article, and it is true today, I bet until the day we solve AGI. 」

Richard S. Sutton is a Fellow of the Royal Society of Canada and the Royal Society, and he is considered one of the founders of modern computational reinforcement learning, making several significant contributions to the field, including time-difference learning and strategic gradient methods.

In this article, Sutton makes the following points:

A generic approach that leverages computing is ultimately the most efficient and efficient. But the reason for this is Moore’s Law, or more precisely due to the continuous exponential decline in the cost per unit of computing.

Initially, researchers worked to avoid search by exploiting human knowledge or the game’s special features, all of which would seem irrelevant once search was effectively applied on a large scale.

Once again, statistical methods have triumphed over human knowledge-based methods, which has led to significant changes in the entire field of natural language processing, where statistics and computation have gradually become dominant for decades.

AI researchers often try to build knowledge into systems, which can be helpful in the short term, but may hinder further progress in the long run.

Breakthroughs will eventually be achieved through a search-and-learn-based approach.

The actual content of the mind is extremely complex, and we should stop trying to find simple ways to represent thoughts, and instead we should only build meta-methods that can find and capture this arbitrary complexity.

So, it seems that Q* has grasped the crux of the problem (search and learn), and synthetic data will further enable it to break through the limitations of the past and make its own leap.

Regarding synthetic data, Musk also said that humans really can’t beat machines.

"You could put the text of every book a human write on a hard drive (sigh), and the synthetic data would be much more than that. 」

In this regard, Jim Fan interacted with Musk and said,

"If we can simulate them on a large scale, a lot of synthetic data will come from embodied agents, such as Tesla Optimus. 」

Jim Fan thinks RLAIF, or RLAIF from groundtruth feedback, will go a long way if scaled correctly. In addition, synthetic data includes simulators, which in principle can help LLMs develop world models.

"Ideally, it is infinite. But the concern is that if the self-improvement cycle is not effective enough, it risks stalling. 」

Regarding the singing and harmony of the two, LeCun said that he had something to say:

LeCun believes that animals and humans quickly become very smart with very little training data.

So, using more data (synthetic or non-synthetic) is a temporary stopgap measure, simply because our current approach has limitations.

In this regard, netizens who support the “big data faction” expressed their dissatisfaction:

“Shouldn’t millions of years of evolutionary adaptation resemble pre-training, and our lifetime experience resembles continuous fine-tuning?”

LeCun then gave an example to explain that the only means used by humans to carry on the results of millions of years of evolution is genes, and the amount of data in the human genome is very small, only 800MB.

Even a small 7B LLM requires 14GB of storage, which is really not a lot of data in the human genome.

Also, the difference between the chimpanzee and human genomes is about 1% (8MB). This little difference is not at all enough to explain the difference in abilities between humans and chimpanzees.

When it comes to the amount of data learned, a 2-year-old child sees a very small amount of visual data, with about 32 million seconds (2x365x12x3600) of all his learning time.

Humans have 2 million optical nerve fibers, and each nerve fiber transmits about 10 bytes per second. - That’s a total of 6E14 bytes.

In contrast, LLM training typically has a data volume of 1E13 tokens, which is about 2E13 bytes. So a 2-year-old child gets only 30 times as much data as LLM.

Regardless of the arguments of the big guys, big tech companies like Google, Anthropic, Cohere, etc., are using process supervision or RLAIF-like methods to create pre-trained datasets, which are costing huge resources.

So it’s clear to everyone that synthetic data is a shortcut to expanding your data set. In the short term, we can obviously use it to create some useful data.

But is this the way to the future? We will have to wait for the answer.

Resources:

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.