On April 23rd, local time, OpenAI officially released the new flagship model GPT-5.5, which is positioned as a “completely new intelligence layer for real-world work,” marking an important step toward a new way of computer operation.

The core focuses of this release are two points:

First is a breakthrough in efficiency: With the same latency, the model is larger, but the speed remains unchanged. GPT-5.5’s context window reaches 1 million tokens, but it is not just a simple upgrade of GPT-5.4’s capabilities; it achieves higher intelligence at the same latency in terms of efficiency.
Second, GPT-5.5 participated in optimizing its own reasoning infrastructure during training. In short, AI has learned to help tune its own parameters for the first time.

In the Terminal-Bench 2.0 test for complex command-line workflows, GPT-5.5 scored 82.7%, surpassing Claude Opus 4.7’s 69.4% by over 13 percentage points; in the OSWorld-Verified test for AI independently operating real computers, the success rate was 78.7%, exceeding human baseline; in the GDPval test covering 44 professional knowledge tasks, 84.9% of tasks met or exceeded industry expert levels.

However, the price of GPT-5.5 has also increased significantly.

API pricing is $5 per million tokens for input and $30 for output, doubling GPT-5.4’s $2.50 input and $15 output per million tokens, but the official emphasizes that GPT-5.5 requires significantly fewer tokens to complete the same tasks, so overall costs may not increase substantially. GPT-5.5 Pro API is priced at $30 per million tokens for input and $180 for output. Bulk processing and flexible pricing enjoy half-price discounts, with priority processing at 2.5 times the standard price.

In ChatGPT, GPT-5.5 is launched as “GPT-5.5 Thinking,” gradually replacing previous versions.

A new small design feature is: before thinking, the model first provides a brief overview of its reasoning, allowing users to interject at any time during execution to adjust the direction.

To summarize GPT-5.5 in one sentence: past models are collections of capabilities; GPT-5.5 is closer to a working system that can plan, check, and continuously advance its tasks.

84.9% of tasks meet professional standards

Image: Comparison of GPT-5.5 and various competitors in core benchmarks like Terminal-Bench 2.0, GDPval, OSWorld-Verified

First, let’s look at how the evaluation models perform in real professional scenarios. OpenAI used a benchmark called “GDPval,” which requires models to complete a full set of professional tasks. The test covers 44 occupational scenarios, including financial modeling, legal analysis, data science reports, operational planning, and more.

Results show: GPT-5.5 achieves or surpasses industry professional levels in 84.9% of tasks. In comparison, GPT-5.4 is at 83.0%, Claude Opus 4.7 at 80.3%, and Gemini 3.1 Pro only 67.3%.

This gap is not only reflected in the total scores. In spreadsheet modeling tasks, GPT-5.5’s internal test scored 88.5%; it also leads in investment banking-level modeling tasks. Early testers’ feedback is quite consistent: responses from GPT-5.5 Pro show clear improvements over GPT-5.4 Pro in comprehensiveness, structure, and practicality, especially in business, legal, education, and data science fields.

Numbers alone can be numbing, but OpenAI has simply pulled back the curtain on its own workspace for you.

OpenAI states that over 85% of its internal staff use Codex weekly, across departments like finance, communications, marketing, product, and data science. The communications team used it to analyze six months of speech invitation data, building an automated classification process; the finance team reviewed 1M-1 tax forms totaling 71,637 pages, completing the task two weeks ahead of schedule; the marketing team used automated weekly reports, saving each person 5 to 10 hours weekly.

This is no longer a lab demo; it has become part of daily work routines.

The most powerful autonomous programming model

OpenAI claims that GPT-5.5 is currently its strongest autonomous coding model.

In Terminal-Bench 2.0 (testing complex command-line workflows requiring planning, iteration, and tool coordination), GPT-5.5 scored 82.7%, compared to GPT-5.4’s 75.1%, an improvement of nearly 8 percentage points, with less token consumption. In SWE-Bench Pro (evaluating the ability to solve real GitHub issues in a one-shot manner), GPT-5.5 scored 58.6%. In internal Expert-SWE evaluations (long-term programming tasks with a median human completion time of about 20 hours), GPT-5.5 also outperformed GPT-5.4.

Image: Scatter plots of Terminal-Bench 2.0 and Expert-SWE

Driven by Codex, GPT-5.5 can start from a single prompt and independently complete the entire development process—from code generation, functionality testing, to visual debugging.

OpenAI’s official demo shows that space mission applications built on NASA’s real orbital data support 3D interactive control, with orbital mechanics simulations reaching real physical accuracy; earthquake trackers connect to real-time data sources and visualize results, demonstrating the model’s full ability to call external APIs, handle dynamic data, and render in real time.

Regarding user feedback: Every founder and CEO Dan Shipper shared an experience: he once encountered a bug after deployment, spent several days trying to fix it himself, and finally had to ask the company’s top engineer to rewrite part of the system. After GPT-5.5 was released, he ran an experiment—reverting the model to the bugged state and seeing if it could come up with the same solution as the engineer. GPT-5.4 failed; GPT-5.5 succeeded. He commented: “This is the first programming model I’ve used that truly has a clear concept.”

A Nvidia engineer’s straightforward assessment: “Losing access to GPT-5.5 feels like losing a limb.”

Co-founder and CEO of Cursor, Michael Truell, added: GPT-5.5 is smarter and more resilient than GPT-5.4, capable of sticking with long, complex tasks longer without stopping prematurely—which is exactly what engineering work needs.

Knowledge work: AI’s first real ability to “use” a computer

In the OSWorld-Verified test (assessing whether models can independently operate real computer environments), GPT-5.5 achieved a success rate of 78.7%, higher than GPT-5.4’s 75.0% and Claude Opus 4.7’s 78.0%.

This is not just screenshot analysis but actual screen control: seeing interfaces, clicking, inputting, switching between tools until the task is completed. GPT-5.5 makes it possible for AI to truly work alongside you on the same computer.

In the Tau2-bench telecom customer service workflow test, GPT-5.5 achieved an accuracy of 98.0% without prompt tuning, compared to 92.8% for GPT-5.4.

This indicates the model’s understanding of task intent is deep enough to handle complex multi-step dialogues without carefully crafted prompts.

In tool search capability, GPT-5.5 scored 84.4% on BrowseComp, with GPT-5.5 Pro reaching 90.1%, demonstrating strong sustained retrieval and information integration abilities in research tasks requiring cross-source reasoning.

Scientific research: assisting in discovering new mathematical proofs

This release’s performance in scientific research may be the most surprising aspect.

In the GeneBench benchmark (multi-stage data analysis in genetics and quantitative biology), GPT-5.5 scored 25.0%, compared to 19.0% for GPT-5.4. These tasks typically take days for scientific experts; the model must reason with potentially erroneous data, handle hidden confounders, and correctly apply modern statistical methods with minimal supervision.

From the charts, it’s clear that as output tokens increase, GPT-5.5’s score improvement always outpaces GPT-5.4’s, with a noticeable gap emerging around 15,000 tokens—indicating that for long, deep-reasoning tasks, GPT-5.5’s advantage grows with task complexity.

On BixBench (real-world bioinformatics and data analysis benchmark), GPT-5.5 scored 80.5%, leading GPT-5.4’s 74.0%, ranking among the top models.

A particularly noteworthy case: an internal version of GPT-5.5 equipped with custom tool frameworks helped discover a new proof related to Ramsey numbers, verified within the formal proof assistant Lean. Ramsey numbers are central objects in combinatorics, with rare and highly challenging results. This was not just AI generating code or explanations but actually contributing a mathematical proof.

In practical applications, Jackson Laboratory immunologist Derya Unutmaz used GPT-5.5 Pro to analyze a dataset of 62 samples and nearly 28,000 genes, generating detailed research reports and key findings—work that would normally take months for a team.

Bartosz Naskręcki, an assistant professor at Adam Mickiewicz University in Poznan, used a single prompt with Codex-based GPT-5.5 to build an algebraic geometry application in just 11 minutes, visualizing the intersection of two quadrics and converting the resulting curve into a Weierstrass model. The real-time displayed equations can be directly used for further mathematical research, completing the entire process independently from prompt to executable research tool.

Image: Screenshot of the algebraic geometry application built by Professor Bartosz Naskręcki—visualization of quadric intersections and real-time Weierstrass equation calculation interface

Brandon White, co-founder of Axiom Bio, commented more directly: “If OpenAI keeps this momentum, the foundation for drug discovery will change by the end of the year.”

Reasoning efficiency: AI’s first self-optimized infrastructure

A subtle but potentially most significant technical advancement in this release.

GPT-5.5 is a larger, more powerful model, but its per-token latency in real service remains on par with GPT-5.4. To maintain the same latency with increased capability, OpenAI redesigned the entire reasoning system—and Codex and GPT-5.5 directly participated in this optimization.

From the Artificial Analysis intelligence index chart, it’s clear: the horizontal axis shows total output tokens (log scale), the vertical axis shows overall intelligence score. GPT-5.5’s curve not only surpasses GPT-5.4, Claude Opus 4.7, and Gemini 3.1 Pro Preview in score but, more importantly, reaches comparable scores at lower token consumption—indicating higher efficiency, better capability at lower cost.

Image: Artificial Analysis intelligence index line chart

Specifically, the challenge was load balancing: previously requests were split into fixed chunks to balance GPU workload, but static chunking was suboptimal for all traffic patterns. Codex analyzed weeks of production traffic data and developed a custom heuristic algorithm, increasing token generation speed by over 20%.

GPT-5.5 was co-designed, co-trained, and co-deployed with NVIDIA’s GB200 and GB300 NVL72 systems. In other words, this generation of models helped optimize its own inference architecture—this is not just metaphorical; it’s literally “AI improved its own system.”

Cybersecurity: capability enhancement and tighter controls

GPT-5.5 shows clear improvements in cybersecurity capabilities. In CyberGym tests, GPT-5.5 scored 81.8%, GPT-5.4 scored 79.0%, and Claude Opus 4.7 scored 73.1%. In internal Capture The Flag (CTF) challenge tasks, GPT-5.5 scored 88.1%, compared to 83.7% for GPT-5.4.

Image: CyberGym bar chart and CTF challenge scatter plot

OpenAI rates GPT-5.5’s cybersecurity and bio/chemical capabilities as “high” under the emergency preparedness framework, not yet “critical,” but with clear improvements over previous versions. They also admit that the newly deployed stricter risk classifiers “may initially cause some inconvenience for certain users,” and will continue to adjust.

To balance defense needs and access restrictions, OpenAI launched the “Cybersecurity Trusted Access” program: qualified security researchers and critical infrastructure defenders can apply for more relaxed access to use advanced cybersecurity capabilities with less friction.

The underlying logic is: capability diffusion is an irreversible trend. A more realistic approach than restricting diffusion is enabling defenders to access the strongest tools before attackers do.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
155.7K Popularity
#
CryptoMarketSeesVolatility
220.98K Popularity
#
rsETHAttackUpdate
66.91K Popularity
#
US-IranTalksStall
176.14K Popularity
#
ETHMemeCoinFLORKSurges
35.71K Popularity

Sitemap

Understanding GPT-5.5 in One Article: Starting Today, OpenAI Will "Stop Selling" Tokens

84.9% of tasks meet professional standards

The most powerful autonomous programming model

Knowledge work: AI’s first real ability to “use” a computer

Scientific research: assisting in discovering new mathematical proofs

Reasoning efficiency: AI’s first self-optimized infrastructure

Cybersecurity: capability enhancement and tighter controls

Trending Topics

WCTCTradingKingPK

CryptoMarketSeesVolatility

rsETHAttackUpdate

US-IranTalksStall

ETHMemeCoinFLORKSurges

Pin