More than half a year has passed and ChatGPT’s ranking is almost at the bottom.

金色财经_

2023-09-09 03:11:22

Author: Sanyan Technology

Today, I accidentally came across a picture.

According to the picture, OpenAI’s GPT-4 has ranked last among the 11 large models (the first one is numbered 0). Some netizens added the words “GPT4: How can I complain about my grievances?”

This makes people curious. At the beginning of this year, after ChatGPT became popular, other companies began to mention the concept of large models.

It’s only been more than half a year, and GPT is already “at the bottom”?

Therefore, the author wanted to see what the GPT ranking was like.

Testing time is different Testing team is different GPT-4 ranks eleventh

Judging from the information displayed on the picture in the previous article, this ranking is from the C-list.

C-List, the full name of C-Global Large Model Comprehensive Examination Test List, is a Chinese language model comprehensive examination evaluation suite jointly constructed by Tsinghua University, Shanghai Jiao Tong University and the University of Edinburgh.

It is reported that the suite covers four major directions: humanities, social sciences, science and engineering, and other majors, including 52 subjects, covering multiple knowledge fields such as calculus and linear algebra. There are a total of 13,948 Chinese knowledge and reasoning questions, with difficulty divided into four test levels: middle school, undergraduate, graduate, and vocational.

So the author checked the latest C-list.

The latest ranking of the C-list is consistent with the ranking shown in the previous picture. Among the top 11 large models, GPT-4 ranks last.

According to the C-list, these results represent zero-shot (zero-shot learning) or few-shot (few-shot learning) tests, but few-shot is not necessarily better than zero-shot.

C- said that in its tests it was found that many models after instruction fine-tuning were better under zero-shot. Many of the models tested have both zero-shot and few-shot results, and the ranking shows the setting with the better overall average score.

The C-list also indicates that the names of large models with “*” indicate that the model results were tested by the C-team, while other results were obtained through user submissions.

In addition, the author also noticed that the time for submitting test results for these large models varies greatly.

The test result submission time for GPT-4 is May 15th, while Yuntianshu, which ranks first, submits on August 31st; Galaxy, which ranks second, submits on August 23rd; and YaYi, which ranks third, submits its results on August 31st. for September 4th.

Moreover, among the top 16 large models, only GPT-4 has “*” added to its name and was tested by the C-team.

So the author checked the complete C-list again.

The latest C-list includes a total of 66 large model rankings.

Among them, there are only 11 with “*” in their names, which are tested by the C-team, and the time of submission for testing was May 15th.

For these large models tested by the C-team, OpenAI’s GPT-4 ranked 11th, ChatGPT ranked 36th, Tsinghua Zhipu AI’s ChatGLM-6B ranked 60th, and Fudan’s MOSS ranked 6th. fourteen.

Although these rankings can show the rapid development momentum of domestic large models, the author believes that, after all, they are not tested by the same team at the same time, which is not enough to fully prove who is stronger and who is weaker among these large models.

This is like a class of students who each have different test times and answer different papers. How can we rely on each student’s score to compare?

What do the big model developers say? Many people said they surpassed ChatGPT in Chinese and other abilities

Recently, the circle of large models has been quite lively.

In addition, the large model products of eight companies including Baidu and Byte have passed the registration of the “Interim Measures for the Management of Generative Artificial Intelligence Services” and can be officially launched online to provide services to the public. Other companies have successively released their own large model products.

So how do the developers of these large models introduce their products?

On July 7, at the 2023 World Artificial Intelligence Conference “Opportunities and Risks for the Development of the General Artificial Intelligence Industry in the Big Model Era” forum, Qiu Xipeng, professor at the School of Computer Science and Technology at Fudan University and head of the MOSS system, said that Fudan’s conversational large-scale language model MOSS After being released in February this year, it has been continuously iterating, “The latest MOSS has been able to surpass ChatGPT in Chinese capabilities.”

At the end of July, NetEase Youdao launched a large translation model. NetEase Youdao CEO Zhou Feng publicly stated that in internal tests, in the direction of Chinese-English translation, it has surpassed the translation capabilities of ChatGPT and surpassed Google Translate. level. **

In late August, at the 2023 Yabuli Forum Summer Summit, Liu Qingfeng, founder and chairman of iFlytek, gave a speech and said, “**The code generation and completion capabilities of the iFlytek Spark model have surpassed ChatGPT, and other This capability is catching up quickly. **The logic, algorithms, method systems, and data preparations for the current code capability are ready, and all that is needed is time and computing power.”

SenseTime stated in a recent press release that in August this year, the new model internlm-123b completed training and the number of parameters increased to 123 billion. **On the global 51 well-known evaluation sets with a total of 300,000 questions, the overall test results ranked second in the world, surpassing models such as gpt-3.5-turbo and the newly released llama2-70b by Meta Company. **

According to Shangtang, **internlm-123 ranked first in 12 major evaluations. Among them, the agi score in the comprehensive test of the evaluation set is 57.8, surpassing gpt-4 and ranking first; the evaluation score of **knowledge commonsenseqa is 88.5, ranking first; internlm-123b scores in the five reading comprehension evaluations All top the list.

In addition, it ranked first in the five evaluations of reasoning.

Earlier this month, Zuoyebang officially released its self-developed Galaxy model.

Zuoyebang said that the Galaxy model has achieved results on the two authoritative large language model evaluation benchmarks of C- and CMMLU. The data shows that Zuoyebang Galaxy Big Model ranks first in C- with an average score of 73.7 points; at the same time, it ranks in the CMMLU list Five-shot and Zero-shot evaluations with average scores of 74.03 points and 73.85 points respectively. First, it became the first major education model to rank first in average score on the two authoritative lists mentioned above.

Yesterday, Baichuan Intelligent announced the official open source fine-tuned Baichuan 2-7B, Baichuan 2-13B, Baichuan 2-13B-Chat and their 4-bit quantized version.

Wang Xiaochuan, founder and CEO of Baichuan Intelligence, said that in the Chinese field, the actual performance of the fine-tuned Chat model in the Q&A environment or summary environment has exceeded that of closed-source models such as ChatGPT-3.5. **

Today, at the 2023 Tencent Global Digital Ecology Conference, Tencent officially released the Hunyuan large model. Jiang Jie, vice president of Tencent Group, said that the Chinese language capability of **Tencent Hunyuan large model has exceeded GPT-3.5. **

In addition to the self-introductions of these developers, some media and teams also evaluated a large model.

In early August, the team of Shen Yang, a professor and doctoral supervisor at the School of Journalism and Communication at Tsinghua University, released the “Comprehensive Performance Evaluation Report of Large Language Models.” The report shows that **Baidu Wenxinyiyan’s comprehensive score in 20 indicators in three major dimensions leads the country, and is better than ChatGPT. Among them, Chinese semantic understanding ranks high, and some Chinese abilities are better than GPT-4. **

In mid-August, some media reported that on August 11, Xiaomi’s large model MiLM-6B appeared on the C- and CMMLU large model evaluation lists. As of now, MiLM-6B ranks 10th in the C-overall list, 1st in the same parameter magnitude, and 1st in CMMLU Chinese large models.

On August 12, Tianjin University released the “Large Model Evaluation Report”. The report shows that the comprehensive performance of **GPT-4 and Baidu Wenxinyiyan is significantly ahead of other models, and their scores are not much different and are at the same level. Wen Xinyiyan has surpassed ChatGPT in most Chinese tasks and gradually narrowed the gap with GPT-4. **

In late August, some media reported that Kuaishou’s self-developed large language model “KwaiYii” had started internal testing. In the latest CMMLU Chinese-oriented rankings, KwaiYii-13B, the 13B version of KwaiYi, ranked first in both five-shot and zero-shot categories. It is strong in humanities, Chinese specific topics, etc., with an average score of over 61 points.

It can be seen from the above that although these large models claim to be at the top of a certain ranking or surpass ChatGPT in certain aspects, most of them perform well in some specific fields.

In addition, some comprehensive scores exceed GPT-3.5 or GPT-4, but the GPT test was stopped in May. Who can guarantee that GPT has not improved in the past three months?

OpenAI’s situation

According to a report from UBS Group in February, just two months after ChatGPT was launched, its monthly active users had exceeded 100 million at the end of January 2023, making it the fastest-growing consumer application in history.

But the development of ChatGPT is not so smooth.

In July this year, many GPT-4 users complained that compared with previous reasoning capabilities, GPT-4’s performance had declined.

Some users pointed out problems on Twitter and the OpenAI online developer forum, focusing on weaker logic, more incorrect answers, an inability to keep track of the information provided, difficulty following instructions, forgetting to add parentheses in basic software code, and only remembering the most recent tips, etc.

In August, another report stated that OpenAi may be in potential financial crisis and may go bankrupt by the end of 2024.

The report stated that OpenAI costs approximately US$700,000 per day just to run its artificial intelligence service ChatGPT. Currently, the company is trying to become profitable with GPT-3.5 and GPT-4, but has yet to generate enough revenue to break even.

However, OpenAI may also have new opportunities.

Recently, OpenAI announced that it will hold its first developer conference in November.

Although OpenAI stated that it will not release GPT-5, OpenAI said that hundreds of developers from around the world will work with the OpenAI team to preview “new tools” in advance and exchange ideas.

This may mean that ChatGPT has made new progress.

According to The Paper, on August 30, a person familiar with the matter revealed that OpenAI is expected to achieve more than $1 billion in revenue in the next 12 months by selling AI software and the computing power to drive its operation.

Today, another media report stated that Morgan Stanley will launch a generative artificial intelligence chatbot jointly developed with OpenAI later this month.

People who deal with bankers at Morgan Stanley are either rich or wealthy. If this upcoming generative artificial intelligence chatbot can bring a different experience to Morgan Stanley’s clients, it may be a huge gain for OpenAI.

The arrival of the era of artificial intelligence has become unstoppable. As for who is better, you can’t just tell yourself, you have to let users rate it. We also believe that large domestic models will definitely catch up with ChatGPT in terms of specific capabilities and comprehensive capabilities.

View Original

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.

Comment

0/400

No comments

More than half a year has passed and ChatGPT’s ranking is almost at the bottom.

Testing time is different Testing team is different GPT-4 ranks eleventh

**What do the big model developers say? **Many people said they surpassed ChatGPT in Chinese and other abilities

OpenAI’s situation

What do the big model developers say? Many people said they surpassed ChatGPT in Chinese and other abilities