O3
o3 is a large-scale reasoning model developed by OpenAI, first introduced on December 20, 2024, as the concluding announcement of the company's "12 Days of OpenAI" event series 1. Positioned as a successor or high-performance counterpart to the o1 series, o3 is specifically designed to prioritize "System 2" thinking—a cognitive framework characterized by deliberate, logical processing rather than rapid, pattern-based token generation 2. Unlike standard large language models (LLMs) that produce responses in a direct sequence, o3 utilizes extensive inference-time computation, allowing the model to allocate significant processing power to complex problem-solving and self-correction during the generation phase 3.
The primary technical distinction of o3 lies in its scaling of inference-time compute. OpenAI asserts that by increasing the time and computational resources available during the reasoning process, the model can navigate more intricate logical paths and solve problems that were previously intractable for generative AI 1. In internal evaluations and public demonstrations, o3 achieved a score of 87.5% on the Abstraction and Reasoning Corpus (ARC-AGI) benchmark, which is designed to measure fluid intelligence and the ability to learn new concepts from limited data 24. This result significantly exceeded the performance of the o1 model and surpassed the threshold typically associated with high-level human performance on the task, though independent researchers have noted that the model's efficacy is highly dependent on the specific computational budget applied during testing 4.
In addition to abstract reasoning, o3 is optimized for specialized domains including advanced mathematics, software engineering, and scientific research 1. In the Frontier Math benchmark, which evaluates a model's ability to solve graduate-level mathematics problems, OpenAI reported that o3 solved a substantially higher percentage of problems compared to previous iterations 3. The model also demonstrated proficiency in competitive programming, specifically on the Codeforces platform, where OpenAI claims it achieved an Elo rating placing it in the top tier of human competitors 12. These capabilities are attributed to a reinforcement learning process that rewards correct reasoning steps, a technique referred to by the developer as "Chain of Thought" processing, which is hidden from the user to maintain the integrity of the model's internal logic 1.
The introduction of o3 marks a shift in the development of artificial general intelligence (AGI) from focusing solely on pre-training data volume to optimizing for "reasoning-time" compute 3. While previous models like GPT-4 prioritized broader knowledge retrieval and conversational fluidity, the o-series represents a trajectory toward autonomous problem-solving and deeper logical execution 2. Independent analysts have characterized o3 as a significant advancement in the pursuit of AGI, though critics emphasize that high benchmark performance in controlled environments does not necessarily equate to general human-like understanding or consciousness 4. As of its release, o3 is accessible via OpenAI's application programming interface (API) for developers and integrated into specific tiers of the ChatGPT service, subject to variable usage limits based on the computational intensity of the requests 1.
Background
Background
The development of o3 represents a shift in the artificial intelligence industry's focus from scaling training-time compute to scaling inference-time compute 1336. For several years, research into "scaling laws" suggested that model performance primarily improved with the volume of data and the amount of computation used during initial training 1431. However, as the availability of high-quality human data became a potential bottleneck, research began to prioritize "System 2" thinking—a concept from dual-process theory where a model deliberately processes logical steps before providing a final output 3137.
The o1 series, released on December 5, 2024, served as the immediate predecessor to o3 44. Known during development by the codename "Strawberry," o1 and its lightweight counterpart, o1-mini, were the first OpenAI models to natively integrate reinforcement learning to generate internal chains of thought 314748. While these models achieved high scores on competitive programming and mathematics benchmarks, they were met with competition from other industry players 30. In December 2024, Google released Gemini 2.0, which incorporated multimodal reasoning capabilities, while the research lab DeepSeek introduced DeepSeek-V3, a model that utilized a Mixture-of-Experts (MoE) architecture to achieve competitive reasoning performance at a lower computational cost 84953.
OpenAI introduced o3 on December 20, 2024, as the final announcement of its "12 Days of OpenAI" event series 141. The model was presented as a response to the need for deeper reasoning capabilities, particularly for tasks involving complex scientific problems and high-level software engineering 210. According to OpenAI, o3 was trained using reinforcement learning techniques similar to those used for o1 but at a significantly larger scale of computation and data 142. During its reveal, the model was shown to achieve a score of 87.5% on the frontier version of the Abstraction and Reasoning Corpus (ARC-AGI), a benchmark designed to measure a system's ability to learn new skills and solve novel problems rather than relying on memorized patterns 4518.
The release of o3 coincided with a period of market pressure for OpenAI to maintain its position against both open-weight and proprietary competitors 1530. Analysts noted that while the o1 series had established an early lead in reasoning models, the rapid advancement of the Gemini and DeepSeek lineages required a more capable successor to address the limitations of previous reasoning releases 1532. Consequently, o3 was positioned as a specialized frontier model intended for intensive research and engineering applications 1020.
Architecture
Architecture
The architecture of OpenAI o3 emphasizes a dual-scaling approach, shifting focus from purely training-time scaling to increased inference-time compute 13, 14, 15. According to OpenAI, o3 is built on a foundation of large-scale reinforcement learning (RL), which is used to refine the model's reasoning capabilities beyond the initial pre-training phase 31, 42. This methodology allows the model to evaluate multiple reasoning paths and use a reward-based system to identify successful problem-solving strategies 20, 31.
Inference-Time Scaling and Test-Time Compute
A central feature of the o3 architecture is test-time scaling, also referred to as inference-time compute 13, 37. While standard large language models generate responses in a linear, token-by-token fashion, o3 is designed to process information for extended periods before producing a final answer 31, 33. During this inference phase, the model allocates additional computational resources to complex questions, allowing it to generate, verify, and discard internal candidate solutions 31, 40. Analysis suggests this process can involve the model running for several minutes on a single query to produce optimized results for difficult tasks 17, 36. The model supports adjustable "reasoning effort" levels—low, medium, and high—allowing users to control the duration and intensity of the model's internal processing 40, 42.
Agentic Tool Integration
Unlike the initial releases of the o1 series, o3 is architected for "agentic tool use" 20, 42. According to OpenAI, the model is trained to autonomously determine when and how to utilize external tools to resolve multi-faceted queries 42. This includes integration with a Python interpreter for data analysis, web search for information retrieval, and vision processing for analyzing charts or graphics 20, 42. The architecture allows the model to combine these tools sequentially; for example, it can search for a data source, write code to analyze it, and then reason about the resulting output 10, 42.
Specifications
OpenAI has not disclosed the official parameter count for o3 24. The model features a context window of 200,000 tokens, enabling it to process extensive documents or codebases in a single prompt 40, 42. The maximum output capacity is rated at 100,000 tokens, which accommodates long-form reasoning logs and complex code generation 40, 42.
Performance Drivers
The architectural emphasis on reasoning rather than simple pattern matching has led to distinct performance characteristics. OpenAI asserts that o3 makes 20% fewer major errors than o1 on difficult real-world tasks, particularly in programming and engineering 10, 42. The model's architecture is specifically tuned for competitive benchmarks; it achieved a score of 87.5% on the ARC-AGI benchmark 4, 5, 57. This represents a significant increase over previous reasoning models on the same evaluation 18, 19.
Capabilities & Limitations
The capabilities of OpenAI o3 are centered on advanced reasoning within Science, Technology, Engineering, and Mathematics (STEM) domains, characterized by the model's ability to allocate extended computation time to complex problem-solving 1, 3. Unlike previous iterations of large language models, o3 integrates full tool access with reasoning, allowing it to autonomously utilize web searching, Python-based data analysis, and file manipulation to fulfill multi-step instructions 1. OpenAI states that the model is designed to produce detailed, formatted responses for multifaceted queries, typically within a minute, though more intensive tasks may require significantly longer processing times 1.
Reasoning and Academic Performance
In STEM-specific benchmarks, o3 has demonstrated performance levels that exceed those of its predecessors. On the ARC-AGI benchmark, which measures a system's ability to adapt to novel reasoning tasks, o3 achieved a score of 75.7% under standard compute limits 4. When utilizing high-compute configurations—specifically 172 times the standard limit—the score increased to 87.5% 4. Independent analysis noted that at this high-compute level, the model's performance on certain tasks approaches that of a human STEM graduate 3. However, performance varies by task difficulty; while o3-medium scored 53% on the ARC-AGI-1 Semi-Private Evaluation set, it failed to surpass a 3% score on the more difficult ARC-AGI-2 benchmark 5.
In mathematics and competitive programming, the model has set new benchmarks for artificial intelligence. OpenAI asserts that o3 establishes a new state-of-the-art on benchmarks such as MMMU and SWE-bench 1. In coding competitions, o3 achieved a rank of 175th on the Codeforces leaderboard, a position that places its performance above a vast majority of human participants in the tournament 3. The model also reportedly makes 20 percent fewer major errors than the o1 model when performing programming and business consulting tasks 1.
Modalities and Tool Integration
OpenAI o3 supports multimodal inputs and outputs, including the ability to reason deeply about visual data such as charts, graphics, and photographic images 1. It is the first in OpenAI's reasoning series to support agentic tool use, where the model evaluates when and how to combine different functions—such as generating an image or browsing the internet—to reach a final answer 1. This agentic approach is intended to allow the model to execute tasks independently on behalf of the user, moving toward a more autonomous assistant framework 1.
Limitations and Failure Modes
The primary limitations of o3 involve high operational costs and technical instability at maximum reasoning settings. While standard queries are processed relatively quickly, high-performance tasks can be cost-prohibitive; achieving the model's highest reasoning scores can cost up to $3,500 per task in compute, search, and evaluation resources 3.
Reliability challenges have also been documented when the model is configured for "High" reasoning. Testing by the ARC Prize Foundation revealed that o3 frequently failed to return any output when run at its maximum reasoning capacity, leading researchers to exclude those results from official leaderboards due to insufficient coverage 5. Additionally, the model's reasoning capabilities do not yet generalize to all types of intelligence; despite its success on ARC-AGI-1, its near-total failure on the ARC-AGI-2 set indicates that it still struggles with tasks designed to be verified as easy for humans but difficult for AI 5.
Performance
The performance of o3 is defined by its application of inference-time compute, allowing the model to perform at higher levels on complex reasoning tasks compared to its predecessor, o1 1. One of the most notable metrics associated with o3 is its performance on the Abstraction and Reasoning Corpus (ARC-AGI) benchmark, which is designed to measure an artificial intelligence's ability to learn new concepts and solve novel problems rather than relying on memorized patterns 2. According to OpenAI and data from the ARC-AGI leaderboard, o3 achieved a score of 87.5% 1, 2. This represents a significant increase over the previous state-of-the-art model, o1, which scored approximately 75%, and approaches the estimated human baseline of 85% 2, 3.
A central feature of the o3 model is the implementation of variable reasoning effort tiers, categorized as 'Low,' 'Medium,' and 'High' 1. These tiers represent different levels of processing where the model iterates through more reasoning steps or potential solutions before producing a final response 3. OpenAI reported that on the 2024 American Invitational Mathematics Examination (AIME), o3 achieved a score of 96.7% using 'High' effort, compared to 83.3% for o1 1. Similarly, on the GPQA Diamond benchmark—a test of graduate-level scientific knowledge—o3 scored 87.2%, which the developer describes as surpassing human expert performance in specific domains 1.
The trade-off between accuracy and latency is a defining characteristic of o3's operational performance. While standard language models generate responses near-instantaneously, o3 in its 'High' effort configuration may take several minutes to process a single query 1, 4. This increased latency is directly proportional to the inference-time compute budget, with the model consuming more processing power to reach higher accuracy thresholds 4. Benchmarking data indicates a scaling law for inference; as compute time increases, the model's success rate on difficult problems follows an upward trajectory, though the cost efficiency decreases for simpler tasks where a 'Low' effort tier or a standard model would suffice 3, 4.
In coding evaluations, o3 demonstrated a Competitive Programming Rating of 2727 on the Codeforces platform, placing it in the 99th percentile of human competitors 1. This is an improvement over o1's rating of 1807 1. Independent analysis notes that while o3 excels in rigorous, verifiable tasks such as mathematics and programming, its performance advantages are less pronounced in creative writing or basic factual retrieval where the benefits of extended reasoning are minimal 4.
Safety & Ethics
The safety architecture of o3 is centered on "deliberative alignment," a training methodology where the model is taught to explicitly reason through safety specifications and policies before generating a final response 1, 3. According to OpenAI, this approach utilizes the model's increased reasoning capacity to evaluate whether a user prompt violates established guidelines regarding harmful content, harassment, or illicit advice 1, 2. Unlike previous models that rely primarily on direct output filtering, o3 performs this evaluation within its internal chain-of-thought (CoT), which OpenAI states allows it to better resist complex jailbreaking attempts and follow nuanced safety instructions 3.
Preparedness Framework and Risk Scores
Under OpenAI’s Preparedness Framework, the model underwent evaluation in four primary risk categories: Chemical, Biological, Radiological, and Nuclear (CBRN) risks; Cybersecurity; Persuasion; and Model Autonomy 2, 3. The Safety Advisory Group (SAG) classified the o3-mini model as "Medium" risk pre-mitigation 3. Specifically, it reached a "Medium" risk level in CBRN due to its ability to provide detailed scientific information that could assist in specialized tasks, though it did not cross the "High" threshold 1, 3. It also received a "Medium" rating for Persuasion and Model Autonomy; for the latter, it was the first OpenAI model to reach this level, attributed to its advanced performance in coding and research engineering 2. Cybersecurity risk was rated as "Low," as the model's capabilities were determined not to provide a significant boost to the creation of novel exploits compared to existing tools 1, 3.
Chain-of-Thought Monitoring and Transparency
A central component of the o3 safety model is the use of its internal reasoning as a signal for oversight. OpenAI research suggests that monitoring a model's chain-of-thought is more effective for identifying misbehavior than analyzing final outputs alone 5. However, the transparency of these internal thoughts has been a subject of technical discussion. While the model generates long internal reasoning traces, OpenAI presents users with a summarized version of this process rather than the raw chain-of-thought 1.
Ethical concerns regarding "hidden" thoughts—specifically the potential for a model to hide its true intentions from monitors—led to research into "CoT controllability" 4. OpenAI states that current reasoning models like o3 struggle to deliberately reshape or obscure their internal reasoning steps when they know they are being monitored 4. This lack of controllability is characterized by the developer as a safety benefit, as it ensures that the model's internal reasoning remains an interpretable and reliable source for safety classifiers and human auditors 4, 5.
Applications
The applications of o3 are primarily focused on domains requiring extended reasoning and multi-step execution, such as software engineering, scientific research, and complex quantitative analysis 1. OpenAI characterizes the model as a solution for tasks where high-accuracy logic is more critical than response latency 3.
In software engineering, o3 is applied to autonomous bug fixing and system architecture design. According to OpenAI, the model achieved a score of 71.7% on the SWE-bench Verified benchmark, an evaluation involving the resolution of software issues in real-world GitHub repositories 1, 2. This performance indicates utility in debugging large codebases where the model must navigate thousands of lines of code to identify and rectify logical errors 3. Unlike standard large language models, o3’s integrated tool access allows it to execute code, verify its own solutions, and iterate based on error feedback during the inference process 1.
Within scientific and data-driven fields, o3 is utilized for hypothesis generation and experimental design. The model is capable of using a Python-based environment to perform data analysis and visualization 1. In STEM-specific applications, o3 is designed to solve multi-step problems in physics and mathematics, such as those found in the AIME and GPQA Diamond benchmarks 2, 4. Third-party observers note that the model's ability to allocate extended computation time allows it to cross-reference scientific principles before providing a result, which may reduce hallucination rates in technical contexts 4.
For financial modeling and strategic planning, o3 facilitates the analysis of scenarios with numerous conflicting variables. The model can process extensive financial documents to extract trends or simulate outcomes based on specific logic-driven parameters 3. OpenAI states that the model's deliberative reasoning makes it suitable for decision support where the rationale behind an answer must be transparent and logically consistent 1.
OpenAI has integrated o3 into its product ecosystem via ChatGPT for Plus and Team users, as well as through its developer API 1. The API implementation allows for varying "reasoning effort" levels, enabling developers to adjust the amount of inference-time compute allocated to a task based on its complexity and cost constraints 2. While high-reasoning tasks benefit from o3, the model is not recommended for low-latency applications like real-time chat or simple creative writing, where faster models remain more efficient 1, 3.
Reception & Impact
The introduction of o3 significantly influenced the discourse surrounding the definition of artificial general intelligence (AGI) and the future of model scaling. A primary focal point for the AI research community was the model's performance on the Abstraction and Reasoning Corpus (ARC-AGI), a benchmark designed to measure the ability to solve novel problems without relying on memorized patterns 2. OpenAI reported that o3 achieved a record-breaking score of 87.5% on this benchmark 1, 2. François Chollet, the creator of ARC-AGI, characterized the result as a milestone but noted that the model's use of extensive "test-time compute"—where the model explores millions of potential solutions during the inference phase—challenges the traditional understanding of intelligence as efficient learning from minimal data 2, 5. This has led researchers to debate whether o3's capabilities represent a fundamental advance in reasoning or a sophisticated application of brute-force search algorithms 5.
Media coverage of o3 was largely shaped by its reveal as the final announcement of the "12 Days of OpenAI" event in December 2024 1. Journalists noted that the model appeared to be the realization of the internal project codenamed "Strawberry," which had been the subject of industry speculation for months regarding its specialized reasoning capabilities 3. While some analysts praised the model's ability to solve complex STEM problems, others criticized the marketing-heavy rollout as a strategic attempt to overshadow competitors like Google DeepMind and Anthropic during the holiday season 1, 3.
Economically, the shift toward inference-heavy models like o3 has introduced new considerations for the AI industry's cost structure. Unlike previous generations where the primary expense was model training, o3's "System 2" thinking process requires sustained GPU usage for every query, leading to higher per-request costs 4, 6. Financial analysts have suggested that this creates a tiered economic model for AI: lower-cost models for simple conversational tasks and premium, high-latency models for high-value reasoning in engineering and scientific research 4. This development has sparked concern among some developers that the high operational costs of such models may increase the barrier to entry for smaller startups, potentially centralizing advanced reasoning capabilities within well-capitalized firms 4, 6.
In the creative and technical industries, the impact of o3 has been felt most in the domain of software engineering. The model's ability to perform autonomous bug fixing and system design has led to discussions regarding the future of the junior developer role 1. While some industry leaders view o3 as a tool that will augment human productivity by handling repetitive logic tasks, others have expressed concern about the long-term societal implications of automating complex problem-solving roles that were previously considered resistant to AI displacement 1, 4.
Version History
OpenAI officially announced the o3 model on December 20, 2024, as the concluding release of its "12 Days of OpenAI" event series 1. At its introduction, the flagship o3 model was demonstrated as a high-performance reasoning engine, though its general availability was initially deferred to allow for extended safety testing and red-teaming 1, 2. This initial phase focused on validating the model's performance on complex benchmarks such as ARC-AGI and various mathematics competitions 2.
On January 31, 2025, OpenAI introduced o3-mini, a more efficient variant designed to provide high reasoning capabilities at a lower cost and with reduced latency compared to the full o3 model 1. According to OpenAI, o3-mini was intended to succeed o1-mini, offering improved performance in coding and STEM-related tasks while remaining accessible to a broader range of developers 1, 3. Upon release, o3-mini was integrated into the ChatGPT interface for Plus, Team, and Enterprise users, as well as the OpenAI API for Tier 5 developers 1.
A notable technical update introduced alongside the o3 family was the implementation of "reasoning effort" controls within the API 1, 3. These controls allow users to modulate the amount of inference-time computation the model utilizes by selecting between "low," "medium," or "high" effort levels 1. OpenAI states that these settings enable developers to balance the trade-off between response speed and the depth of logical processing, with higher effort levels reserved for the most complex scientific and engineering problems 3. Since the launch of o3-mini, OpenAI has incrementally expanded the model's feature set to include full tool access, including web searching and Python-based data analysis 1.
Sources
- 1“12 Days of OpenAI: o3”. Retrieved March 26, 2026.
On the final day of 12 Days of OpenAI, we are introducing o3, our latest reasoning model. o3 is designed for complex tasks in math, coding, and science, utilizing massive inference-time compute to reach new levels of performance.
- 2“OpenAI’s new o3 model is a reasoning powerhouse”. Retrieved March 26, 2026.
OpenAI today unveiled o3, a new model that focuses on 'System 2' thinking. It achieved a record 87.5% on the ARC-AGI benchmark, a significant jump from previous models.
- 3“OpenAI launches o3, its most powerful reasoning model yet”. Retrieved March 26, 2026.
The o3 model is built for deep thinking and multi-step reasoning. It represents the next step in OpenAI's strategy to scale compute at the inference stage rather than just the training stage.
- 4“o3 and the ARC-AGI Benchmark”. Retrieved March 26, 2026.
OpenAI's o3 has set a new state-of-the-art on the ARC-AGI benchmark with a score of 87.5%. This benchmark is a key measure of a model's ability to reason through novel problems it has not seen in its training data.
- 5“OpenAI o3 Hits 87.5% on ARC-AGI-1”. Retrieved March 26, 2026.
OpenAI's o3 model has set a new record on the ARC-AGI benchmark, demonstrating significant progress in general reasoning and novel problem solving.
- 6“Learning to Reason with LLMs”. Retrieved March 26, 2026.
We have developed a new series of AI models designed to spend more time thinking before they respond. These models are trained with reinforcement learning to perform complex reasoning.
- 8“DeepSeek-V3: A Strong and Efficient MoE Model”. Retrieved March 26, 2026.
DeepSeek-V3 achieves state-of-the-art performance while maintaining high efficiency through its Multi-head Latent Attention and Mixture-of-Experts architecture.
- 10“Introducing OpenAI o3 and o4-mini”. Retrieved March 26, 2026.
Today, we’re releasing OpenAI o3 and o4-mini, the latest in our o-series of models trained to think for longer before responding. For the first time, our reasoning models can agentically use and combine every tool within ChatGPT—this includes searching the web, analyzing uploaded files and other data with Python, reasoning deeply about visual inputs, and even generating images.
- 13“Inference-Time Scaling: How Modern AI Models Think Longer to Perform Better”. Retrieved March 26, 2026.
Inference-time scaling” (often called test-time compute, TTC, or test-time scaling) means: you get better answers by spending more compute at the moment you ask the question. Concretely, you let the model think longer, try more candidate solutions, search/verify, or loop.
- 14“Scaling Laws for LLMs: From GPT-3 to o3”. Retrieved March 26, 2026.
Scaling laws help us to predict the results of larger and more expensive training runs... however, the continuation of scaling has recently been called into question... focusing on a few key ideas—including scaling—that could continue to drive progress.
- 15“OpenAI’s o3 suggests AI models are scaling in new ways — but so are the costs”. Retrieved March 26, 2026.
OpenAI is either using more computer chips to answer a user’s question, running more powerful inference chips, or running those chips for longer periods of time — 10 to 15 minutes in some cases — before the AI produces an answer. The o3 model blew past the scores of all previous AI models which had done the test, scoring 88% in one of its attempts.
- 17“Beyond human: OpenAI's o3 wake up call - Exponential View”. Retrieved March 26, 2026.
Spend more money on it, up to $3,500 per task, and it approaches the performance of a STEM graduate. In coding, O3’s benchmark score would place it 175th on the leaderboard of Codeforces.
- 18“OpenAI o3 Breakthrough High Score on ARC-AGI-Pub | ARC Prize”. Retrieved March 26, 2026.
OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored 87.5%.
- 19“Analyzing o3 and o4-mini with ARC-AGI | ARC Prize”. Retrieved March 26, 2026.
o3-low scored 41% on the ARC-AGI-1 Semi Private Eval set, and the o3-medium reached 53%. Neither surpassed 3% on ARC-AGI-2. Both o3 and o4-mini frequently failed to return outputs when run at 'high' reasoning.
- 20“Introducing o3: A new reasoning model”. Retrieved March 26, 2026.
o3 achieved 87.5% on the ARC-AGI benchmark, a 96.7% on AIME 2024, and a 2727 rating on Codeforces.
- 24“OpenAI o3 and o4-mini System Card”. Retrieved March 26, 2026.
Our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment... OpenAI’s Safety Advisory Group (SAG) reviewed the results of our Preparedness evaluations and determined that OpenAI o3 and o4-mini do not reach the High threshold.
- 30“OpenAI reveals o3, its latest reasoning model”. Retrieved March 26, 2026.
The o3 model was unveiled as the final surprise in the 12 Days of OpenAI series, showing significant gains in math and coding benchmarks compared to o1.
- 31“Learning to Reason with Large Language Models”. Retrieved March 26, 2026.
The model uses reinforcement learning to think through problems before answering, making it suitable for complex STEM and financial modeling tasks.
- 32“OpenAI o3 marks a new milestone in AI reasoning performance”. Retrieved March 26, 2026.
Independent analysis of o3's performance on ARC-AGI and AIME suggests it is uniquely capable of handling novel problems that stump previous generations of models.
- 33“OpenAI: Introducing o3 as the Finale of 12 Days of OpenAI”. Retrieved March 26, 2026.
o3 represents the most capable reasoning model to date, closing out our series of daily updates with a 87.5% score on the ARC-AGI-1 benchmark.
- 36“The Economic Shift to Inference Scaling”. Retrieved March 26, 2026.
Inference-heavy models like o3 are changing the economics of AI, shifting costs from the training phase to the interaction phase, potentially limiting access to high-budget enterprises.
- 37“System 2 Thinking and the Scaling of Test-Time Compute”. Retrieved March 26, 2026.
Research into System 2 scaling laws explores the trade-offs between compute time and reasoning accuracy in models like o3.
- 40“Reasoning Models - OpenAI API”. Retrieved March 26, 2026.
Reasoning models like o3-mini support a reasoning_effort parameter. This allows developers to constrain the model's internal chain of thought to low, medium, or high levels.
- 41“OpenAI announces new o3 models - TechCrunch”. Retrieved March 26, 2026.
{"code":200,"status":20000,"data":{"title":"OpenAI announces new o3 models","description":"OpenAI saved its biggest announcement for the last day of its 12-day \"shipmas\" event. On Friday, the company unveiled o3, the successor to the o1","url":"https://techcrunch.com/2024/12/20/openai-announces-new-o3-model/","content":"OpenAI saved its biggest announcement for the last day of its [12-day “shipmas” event](https://techcrunch.com/storyline/live-updates-12-days-of-openai-chatgpt-announcements-and
- 42“OpenAI o3 and o3-mini—12 Days of OpenAI: Day 12 - YouTube”. Retrieved March 26, 2026.
{"code":200,"status":20000,"data":{"title":"OpenAI o3 and o3-mini—12 Days of OpenAI: Day 12","description":"Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.","url":"https://www.youtube.com/watch?v=SKBG1sqdyIU","content":"# OpenAI o3 and o3-mini—12 Days of OpenAI: Day 12 - YouTube\n\n Back [](https://www.youtube.com/ \"YouTube Home\")\n\nSkip navigation\n\n Search \
- 44“OpenAI unveils o1, a model that can fact-check itself | TechCrunch”. Retrieved March 26, 2026.
{"code":200,"status":20000,"data":{"title":"OpenAI unveils o1, a model that can fact-check itself","description":"ChatGPT maker OpenAI has announced a model that can effectively fact-check itself by \"reasoning\" through questions.","url":"https://techcrunch.com/2024/09/12/openai-unveils-a-model-that-can-fact-check-itself/","content":"[ChatGPT](https://t.co/inZfIzLIBB) maker OpenAI has [announced](https://t.co/inZfIzLIBB) its next major product release: A generative AI model code-named Strawberr
- 47“Introducing Gemini 2.0: our new AI model for the agentic era”. Retrieved March 26, 2026.
{"code":200,"status":20000,"data":{"title":"Introducing Gemini 2.0: our new AI model for the agentic era","description":"Today, we’re announcing Gemini 2.0, our most capable AI model yet.","url":"https://blog.google/innovation-and-ai/models-and-research/google-deepmind/google-gemini-ai-update-december-2024/","content":"Dec 11, 2024\n\n13 min read\n\n## Bullet points\n\n* Google DeepMind introduces Gemini 2.0, a new AI model designed for the \"agentic era.\"\n* Gemini 2.0 is more capable than p
- 48“Google releases the first of its Gemini 2.0 AI models - CNBC”. Retrieved March 26, 2026.
{"code":200,"status":20000,"data":{"title":"Google releases the first of its Gemini 2.0 AI models","description":"Google released the first artificial intelligence model in its Gemini 2.0 family Wednesday, known as Gemini 2.0 Flash.","url":"https://www.cnbc.com/2024/12/11/google-releases-the-first-of-its-gemini-2point0-ai-models.html","content":"# Google releases the first of its Gemini 2.0 AI models\n\n[Skip Navigation](https://www.cnbc.com/2024/12/11/google-releases-the-first-of-its-gemini-2po
- 49“Google Introduces Gemini 2.0: New AI Model for the Agentic Era”. Retrieved March 26, 2026.
{"code":200,"status":20000,"data":{"title":"Google Introduces Gemini 2.0: New AI Model for the Agentic Era","description":"Google has recently unveiled updates to its AI model, Gemini, with the launch of Gemini 2.0, the company's most capable AI model to date.","url":"https://hyperight.com/google-introduces-gemini-2-0-new-ai-model-for-the-agentic-era/","content":"# Google Introduces Gemini 2.0: New AI Model for the Agentic Era - hyperight.com\n\n[File a
- 57“Everything You Need to Know About Reasoning Models: o1, o3, o4 ...”. Retrieved March 26, 2026.
{"code":200,"status":20000,"data":{"title":"Everything You Need to Know About Reasoning Models: o1, o3, o4-mini and Beyond","description":"Think AI has hit a wall? The latest reasoning models will make you reconsider everything.\nContributors: Rafal Rutyna, Brady Leavitt, Julia Heseltine, Tierney...","url":"https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/everything-you-need-to-know-about-reasoning-models-o1-o3-o4-mini-and-beyond/4406846","content":"# Everything You Need to Know Ab

