GPT-4o
GPT-4o, where the "o" suffix stands for "omni," is a multimodal large language model developed by OpenAI and released on May 13, 2024 1. It serves as a successor to GPT-4 Turbo and is characterized by its ability to process and generate combinations of text, audio, and visual data within a single integrated neural network 1. Unlike previous iterations that relied on a pipeline of separate models—such as Whisper for speech recognition and a distinct text-to-speech engine for audio output—GPT-4o was trained end-to-end across multiple modalities 1. This architectural shift allows the model to retain information that was previously lost in translation between separate systems, such as emotional tone, the presence of multiple speakers, and background noise 1.
A primary advancement of the model is its reduced latency, which OpenAI states enables more natural human-computer interaction 1. The model responds to audio inputs in as little as 232 milliseconds, with an average response time of 320 milliseconds 1. This represents a substantial decrease from the 5.4-second average latency of GPT-4 and the 2.8-second latency of GPT-3.5, bringing the AI's reaction speed closer to the human average of approximately 210 milliseconds 1. These speeds facilitate real-time conversational use cases, such as live speech translation between different languages and interactive tutoring where the model can observe a user's screen or camera feed to provide immediate feedback 1.
In terms of technical performance, GPT-4o has demonstrated high scores on several industry benchmarks, including the Massive Multitask Language Understanding (MMLU) and HumanEval for code generation 1. According to data provided by OpenAI, the model outperformed competitors such as Claude 3 Opus and Gemini Pro 1.5 in four out of six key benchmarks, though it was surpassed by Claude 3 Opus in multilingual math (MSGM) and by GPT-4 Turbo in discrete reasoning (DROP) 1. The model also features an improved tokenization system that increases efficiency for non-Roman alphabets 1. Languages such as Hindi, Marathi, and Tamil saw token reductions between 2.9 and 4.4 times, which reduces the computational cost and increases the generation speed for users in those regions 1.
OpenAI released a smaller, more efficient variant known as GPT-4o mini on July 18, 2024, which is optimized for high-speed, cost-effective tasks through a process of model distillation 1. While the full GPT-4o model is available to both free and paid subscribers, Plus users receive five times the message capacity of the free tier 1. To address safety concerns, particularly regarding the potential for audio deepfakes and the unauthorized impersonation of specific individuals, OpenAI has restricted the model's audio output to a specific set of pre-defined voices 1. The organization's internal safety assessments graded the model as having a "Medium" risk level across categories such as cybersecurity and persuasion, asserting that the model meets the necessary safety standards for public distribution 1.
Background
GPT-4o was developed as the successor to GPT-4 Turbo, representing a shift in how OpenAI approached multimodal processing 1. Prior to its release, OpenAI's systems utilized a pipeline-based architecture for voice and vision tasks 1. For instance, voice interactions required a sequence of three distinct models: Whisper for speech-to-text transcription, GPT-4 or GPT-4 Turbo for text-based reasoning, and a separate text-to-speech (TTS) engine for the final output 1. This modular approach introduced significant latency and resulted in the loss of non-textual information, such as emotional tone, prosody, and background noise, which the central reasoning engine could not access 1. OpenAI states that GPT-4o was designed to address these limitations by training a single neural network natively on text, audio, and visual data 1.
The release of GPT-4o on May 13, 2024, occurred during a period of intensifying competition in the large language model (LLM) market 1. Competitors such as Google and Anthropic had recently released models—Gemini 1.5 Pro and Claude 3 Opus, respectively—that challenged GPT-4 Turbo's performance on various industry benchmarks and the LMSYS Chatbot Arena leaderboard 1. According to OpenAI's internal testing, GPT-4o achieved higher scores than these competitors on the Massive Multitask Language Understanding (MMLU) and HumanEval benchmarks, though it was outperformed by Claude 3 Opus on the Multilingual Grade School Math (MSGM) test 1. The development of GPT-4o was also driven by the need for more efficient tokenization, particularly for non-Roman alphabets, to reduce API costs and improve processing speeds in languages such as Hindi, Arabic, and Chinese 1.
In April 2024, several weeks before the official announcement, a mysterious model identified as "im-also-a-good-gpt2-chatbot" appeared on the LMSYS Chatbot Arena 1. The model's high performance led to widespread speculation among researchers that it was a next-generation OpenAI system 1. This was later confirmed to be GPT-4o 1. The use of the "gpt2" suffix in the teaser sparked technical debate, with some observers suggesting it signaled a fundamental change in the GPT series' underlying architecture, whereas OpenAI eventually chose to retain the "GPT-4" branding for the public launch 1.
A primary motivation for GPT-4o's development was the achievement of near-human latency in verbal communication 1. OpenAI reported that the previous three-model pipeline averaged a latency of 5.4 seconds for GPT-4, whereas GPT-4o reduced this to an average of 0.32 seconds—a speed comparable to human response times in conversation 1. By integrating vision and audio natively, the model can also describe live camera feeds and interpret visual data, such as a computer screen or a physical environment, without the need for manual file uploads or intermediate processing steps 1.
Architecture
GPT-4o, referred to by OpenAI as an "omni" model, is designed around a single neural network architecture that processes multiple modalities end-to-end 1. This represents a technical departure from previous iterations, such as GPT-4 Turbo, which utilized a modular pipeline to handle multimodal inputs 1. In those earlier configurations, audio was processed via separate models like Whisper for speech-to-text transcription and a distinct text-to-speech (TTS) engine for audio generation, which often resulted in the loss of non-textual information such as tone, emotion, and background noise 1. By contrast, GPT-4o is trained across text, vision, and audio within the same model, allowing for a shared latent space where reasoning can occur directly across different data types 1, 6.
Multimodal Training and Latency
The primary technical innovation of GPT-4o is its unified training methodology. OpenAI states that the model can accept any combination of text, audio, images, and video and generate any combination of those formats as output 1. This integration enables the model to perceive nuances that were previously discarded in pipelined systems, including multiple speakers in a single audio stream and emotional inflections in speech 1. Architecturally, this end-to-end approach significantly reduces latency. OpenAI reports an average audio response latency of 320 milliseconds, which is comparable to human response times in conversation 1. For comparison, the pipelined approach used by GPT-4 had an average latency of approximately 5.4 seconds 1.
Tokenization and Efficiency
GPT-4o utilizes a new tokenizer known as o200k_base, which employs a Byte Pair Encoding (BPE) algorithm 3. This tokenizer features a vocabulary size of approximately 200,000 tokens, a significant increase from the 100,000-token vocabulary used in the cl100k_base tokenizer for GPT-4 3. The expanded vocabulary allows for more efficient text compression, particularly for non-Roman alphabets 1. According to OpenAI, this results in a reduction of token requirements for Indian languages (such as Hindi and Marathi) by 2.9 to 4.4 times, for Arabic by 2.0 times, and for East Asian languages by 1.4 to 1.7 times 1. This improved efficiency reduces computational costs and increases processing speed for international users 1, 3.
Context Window and Technical Specifications
The model features a context window of 128,000 tokens 6, 7. While the architecture supports this large window, OpenAI has implemented different rate limits and usage tiers for free and paid users 1. Technical details regarding the specific parameter count of GPT-4o have not been publicly disclosed by OpenAI, and information remains limited as to the exact size of the neural network 1. However, third-party assessments indicate that the model demonstrates improved throughput and lower response times compared to its predecessors 6. A secondary, smaller iteration called GPT-4o mini was released in July 2024, which was created through a process of model distillation from the larger GPT-4o architecture to optimize for lightweight and cost-effective applications 1.
Performance in Computer Vision
In visual processing, GPT-4o's architecture allows it to analyze video frames and screen feeds in real-time 1. Research evaluating the model on standard computer vision tasks suggests that while it may not match the performance of specialized state-of-the-art models in niche geometric tasks, it functions as a highly effective generalist across semantic segmentation, object detection, and image classification 5. Its performance is attributed to being trained on diverse image-text-based tasks, though native image generation within the same architecture has been observed to occasionally exhibit spatial misalignments or hallucinations in complex scenarios 5.
Capabilities & Limitations
GPT-4o is designed as a multimodal system, meaning it can natively process and generate text, audio, and visual data within a single neural network 1. According to OpenAI, this architecture allows the model to handle diverse inputs more efficiently than previous iterations that relied on separate models for different tasks 1.
Voice and Audio Modalities
The model supports real-time voice conversations with an average latency of 0.32 seconds, a significant reduction from the 5.4-second average latency of GPT-4 1. Because GPT-4o reasons across audio directly, it can interpret and express emotional prosody, including the ability to produce sarcasm, laughter, or singing 1. The model is also capable of detecting background noises and identifying multiple distinct speakers in a single audio feed 1. These capabilities enable applications such as real-time speech-to-speech translation and spoken roleplay for professional training 1.
Visual and Multilingual Capabilities
GPT-4o features advanced vision capabilities that allow it to analyze live video feeds, static images, and computer screens 1. OpenAI has demonstrated the model acting as a tutor by observing a student's tablet screen to provide feedback on mathematics problems and handwriting 1. It can also describe a user's physical surroundings via a camera feed, which is intended to assist visually impaired individuals with daily navigation 1.
In terms of linguistic support, the model includes an improved tokenizer that enhances efficiency for non-Roman alphabets 1. OpenAI reports that token requirements for Indian languages like Hindi and Marathi were reduced by 2.9 to 4.4 times, while East Asian and Arabic scripts saw reductions between 1.4 and 2 times 1. This improved tokenization increases processing speed and reduces API costs for these languages 1.
Limitations and Failure Modes
Despite its technical advancements, GPT-4o is subject to several known limitations. It continues to exhibit "hallucinations," where it generates factually incorrect information with high confidence 1. In a data analysis test, the model inaccurately reported sports statistics and created visualizations containing fabricated data points and incorrect team rosters 1.
Additional failure modes identified by the developer include:
- Reasoning Errors: The model may still struggle with complex logic or specific technical domain questions 1.
- Vision and Transcription Inaccuracies: Computer vision interpretations are not always correct, and audio transcriptions can fail when faced with strong accents or highly technical terminology 1.
- Translation Failures: While proficient in many languages, the model has demonstrated errors when attempting to translate directly between two non-English languages 1.
- Unsuitable Tone: OpenAI noted instances where the model adopted an inappropriate or condescending tone during voice interactions 1.
Safety and Intended Use
OpenAI utilizes a "preparedness framework" to assess risks in four categories: cybersecurity, biological/chemical/radiological/nuclear (BCRN) threats, persuasion, and model autonomy 1. GPT-4o received a "Medium" risk rating, the highest of which determines the model's overall safety score 1. To mitigate the risk of audio deepfakes, the model's audio output is restricted to a pre-defined selection of voices, preventing users from using the model to impersonate specific individuals 1. OpenAI states that the model is intended for assistive technology, creative workflows, and real-time communication, while advising against its use for high-stakes decision-making without human oversight 1.
Performance
Benchmarks and Comparative Evaluations
Upon its release, GPT-4o demonstrated high performance across standardized large language model (LLM) benchmarks, often surpassing contemporary models such as Claude 3 Opus and Gemini Pro 1.5 1. According to data provided by OpenAI, GPT-4o achieved a score of 88.7% on the Massive Multitask Language Understanding (MMLU) benchmark, which measures world knowledge and problem-solving across 57 subjects, compared to 86.8% for Claude 3 Opus 1. In coding tasks measured by HumanEval, GPT-4o scored 90.2%, while Claude 3 Opus scored 84.9% 1.
Independent evaluations on the LMSYS Chatbot Arena, a crowdsourced Elo-rating leaderboard, showed that a version of the model (initially appearing under the pseudonym "im-also-a-good-gpt2-chatbot") reached the top position prior to its official naming as GPT-4o 1. Despite these leads, some benchmarks showed narrower margins or shifts in leadership; for instance, Claude 3 Opus scored 90.7% on the Multilingual Grade School Math (MGSM) benchmark, slightly exceeding GPT-4o's score of 90.5% 1. Additionally, OpenAI noted that GPT-4 Turbo outperformed GPT-4o on the Discrete Reasoning Over Paragraphs (DROP) benchmark 1.
Speed and Latency
GPT-4o was designed to reduce the latency associated with multimodal interactions. OpenAI reported that the model can respond to audio inputs in as little as 232 milliseconds, with an average response time of 320 milliseconds 1. This represents a significant increase in speed compared to GPT-4, which averaged 5.4 seconds, and GPT-3.5, which averaged 2.8 seconds 1. OpenAI states that this 320-millisecond average is comparable to human response times in conversation, which typically average approximately 210 milliseconds 1. This reduction is attributed to the model's end-to-end architecture, which eliminates the need for a multi-step pipeline involving separate speech-to-text and text-to-speech models 1.
API Efficiency and Cost
For developers using the OpenAI API, GPT-4o was released with a pricing structure approximately 50% lower than that of GPT-4 Turbo 1. As of July 2024, the model costs $5.00 per million input tokens and $15.00 per million output tokens 1. OpenAI also characterizes the model as being twice as fast as GPT-4 Turbo in terms of raw text generation speed 1.
Efficiency gains are also evident in the model's updated tokenizer, which requires fewer tokens to represent non-Roman scripts 1. For example, token requirements for Indian languages like Hindi and Tamil were reduced by 2.9 to 4.4 times compared to previous versions 1. Arabic showed a 2x reduction, while East Asian languages such as Chinese and Japanese saw reductions between 1.4x and 1.7x 1. These improvements directly result in lower API costs and faster processing for users operating in those languages 1.
Safety & Ethics
Safety and Ethical Frameworks
OpenAI evaluated GPT-4o using its "Preparedness Framework," a protocol designed to measure frontier risks across four primary categories: cybersecurity, biological threats (BCRN), persuasion, and model autonomy 1, 2. According to the GPT-4o System Card, the model received a "Medium" risk rating for persuasion and a "Low" rating for the remaining three categories 2. OpenAI's internal policy requires that only models with a post-mitigation risk score of "Medium" or lower be deployed to the public 2. Pre-training mitigations included the use of automated safety classifiers to filter out data related to violence, sexual material, and information hazards, as well as processes to reduce the presence of personal information in the training corpus 3.
Audio Safety and Voice Protections
The model's native audio capabilities introduced novel risks, specifically regarding unauthorized voice generation and the potential for deepfakes 1, 4. To mitigate the risk of voice cloning or impersonation, OpenAI stated that it restricted the model's audio output to a pre-defined selection of authorized voices 1, 4. During internal red-teaming, a phenomenon described as "unintentional voice mimicry" was observed, where the model briefly imitated a user's voice after processing noisy audio input 7. Although this occurred during testing and did not affect production users, it was classified as an AI hazard due to its potential for privacy violations 7.
Red-Teaming and Adversarial Testing
OpenAI conducted an extensive red-teaming program involving over 100 external experts from 29 countries, covering 45 languages 4. Testing was executed in four phases, transitioning from early single-turn prompts to real-time, multi-turn audio and video interactions intended to reflect actual deployment conditions 4. Independent adversarial testing of the GPT-4o-mini variant demonstrated that while the model resisted the majority of prompt injection and personal identifiable information (PII) leakage attacks, it remained susceptible to a small percentage of "alignment bypass" attempts—where harmful requests are framed within fictional or academic scenarios 6.
Social and Psychological Concerns
The emotive and low-latency nature of GPT-4o's voice interactions has led to concerns regarding user behavior 7. OpenAI cautioned that the human-like quality of the model's speech could result in "unhealthy attachment" or over-reliance, potentially affecting how users engage in human-to-human social interactions 8. Furthermore, evaluations identified risks of "ungrounded inference," where the model might attribute sensitive traits to a user based on their vocal characteristics or accent 2. OpenAI maintains that it continues to use system-level monitors to enforce usage policies and prevent the generation of disallowed content in real-time 2, 3.
Applications
GPT-4o is applied across various domains utilizing its native multimodality to process text, audio, and visual data simultaneously 1. Its primary use cases include interactive education, accessibility tools, professional productivity, and real-time communication 12.
Education and Tutoring
In education, GPT-4o has been deployed for real-time tutoring and instructional support. A notable implementation involves Khan Academy, where the model assists students with mathematics problems by viewing their screen and providing verbal guidance 1. According to OpenAI, the model is designed to provide hints and lead students to solutions rather than simply providing the final answer, utilizing its low-latency audio and vision capabilities to simulate an interactive human tutor 1.
Accessibility Features
For accessibility, GPT-4o functions as a vision-to-audio tool for visually impaired users 1. By accessing a smartphone camera feed, the model can verbally describe physical surroundings, identify objects, and read text such as menus or signs in real-time 12. This application serves as a real-world audio description service, helping users navigate environments or understand visual information through immediate spoken feedback 1.
Professional and Coding Workflows
In professional and technical environments, the model is utilized as a coding assistant and data analyst. Through the ChatGPT desktop application, users can share their screen or specific windows, allowing GPT-4o to analyze code, explain visual plots, or suggest bug fixes directly within an integrated development environment (IDE) 13. While it can interpret charts and perform calculations, independent testing has indicated that the model may still produce inaccurate data visualizations or misinterpret statistical details, such as game counts or goal differences, if not provided with a complete and clean dataset 1.
Translation and Roleplay
The model's low-latency response time—which OpenAI states averages 0.32 seconds—facilitates real-time translation for travel and business 1. It can act as an intermediary for speakers of different languages, translating spoken dialogue while attempting to preserve emotional tone and prosody 1. Additionally, it is used for conversational roleplay, such as job interview preparation or sales training, where the model adopts specific personas and vocal styles to provide a more realistic practice environment 1.
Limitations in Application
Certain scenarios are not recommended or remain experimental. OpenAI has documented instances where the model failed during translation between two non-English languages or adopted an unsuitable tone 1. Furthermore, due to the persistent risk of hallucinations, users are cautioned against relying on the model for high-stakes data reporting or botanical and biological identification without independent verification 1.
Reception & Impact
The release of GPT-4o drew significant media attention for its perceived similarity to the artificial intelligence portrayed in the 2013 film Her. This comparison was driven by the model's "omni" architecture, which allows it to process and generate audio with emotional inflection and near-human latency 1. OpenAI reported an average audio response latency of 0.32 seconds, which the company noted is comparable to the 0.21-second average human response time 1. Third-party analysis observed that this speed, combined with the model's ability to interpret tone and background noise, facilitated a more fluid and conversational interaction style than previous pipeline-based models that utilized separate components for speech recognition and synthesis 1.
A central point of controversy following the model's debut involved the "Sky" voice, one of the preset options used during initial demonstrations. Observers and media outlets noted a resemblance between the voice and that of actress Scarlett Johansson, who had previously voiced an AI character in Her. Following public concerns and legal scrutiny regarding the likeness, OpenAI paused the use of the Sky voice 1. The company subsequently stated it would be "cautious about the release" and limited audio outputs to a specific selection of preset voices to mitigate risks such as unauthorized imitation or deepfake generation 1.
The model’s introduction had a notable impact on the ecosystem of "wrapper" startups—companies that build specialized applications on top of existing AI models. By integrating native multimodal features such as real-time translation, vision-based assistance, and a dedicated desktop application directly into the ChatGPT platform, OpenAI moved into functional areas previously occupied by third-party developers 1. For example, the model's ability to describe a user's camera feed in real-time provided a native alternative to specialized accessibility tools designed for the visually impaired 1.
Economically, GPT-4o's release shifted the competitive landscape by making high-tier capabilities available to non-paying users. OpenAI's decision to roll out the model on its free plan, while providing Plus users with five times the message capacity, represented a shift in its accessibility strategy 1. Additionally, the model's improved tokenization for non-Roman alphabets—reducing token counts by up to 4.4 times for certain Indian languages—was characterized as a significant development for international users, as fewer tokens result in lower API costs and increased processing speeds in those regions 1.
Version History
OpenAI launched GPT-4o on May 13, 2024, as the successor to the GPT-4 Turbo model 1. The initial version, designated as gpt-4o-2024-05-13 in the OpenAI API, introduced an "omni" architecture designed to natively process and generate combinations of text, audio, and visual data within a single neural network 1. At the time of announcement, text and vision capabilities were made available to ChatGPT users on Free, Plus, and Team plans, with immediate API access for developers 1.
On July 18, 2024, OpenAI released GPT-4o mini, a distilled variant of the larger model designed for increased speed and efficiency 1. According to OpenAI, GPT-4o mini was developed to replace GPT-3.5 Turbo as the company’s standard small-scale model, offering improved benchmark performance at a significantly lower price point 1. This version was optimized for lightweight applications and real-time interactions where low latency and cost-effectiveness are primary requirements 1.
In August 2024, OpenAI implemented an update to the GPT-4o API that introduced "Structured Outputs." According to the developer, this feature ensures that the model’s responses strictly adhere to a JSON schema provided by the user, facilitating more reliable integration into software workflows 2. Throughout the months following its launch, OpenAI conducted iterative improvements focused on vision processing stability and the consistency of its audio reasoning 1.
While text and image processing were available at launch, the rollout of the model’s native audio capabilities, including "Advanced Voice Mode," was conducted gradually 1. OpenAI stated that this phased deployment was necessary to scale technical infrastructure and refine safety filters for audio generation 12. In terms of pricing history, the model was introduced with costs approximately 50% lower than those of GPT-4 Turbo, set at $5 per million input tokens and $15 per million output tokens 1.
See Also
Sources
- 1“GPT-4o Guide: How it Works, Use Cases, Pricing, Benchmarks”. Retrieved March 25, 2026.
GPT-4o is OpenAI’s latest LLM. The 'o' in GPT-4o stands for "omni"—Latin for "every"—referring to the fact that this new model can accept prompts that are a mixture of text, audio, images, and video. ... OpenAI shared that the average latency of Voice Mode is 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4. By contrast, the average latency for GPT-4o is 0.32 seconds, nine times faster than GPT-3.5 and 17 times faster than GPT-4. ... GPT-4o gets the top score in four of the benchmarks, though it is beaten by Claude 3 Opus in the MSGM benchmark and by GPT-4 Turbo in the DROP benchmark.
- 2“Understanding How GPT-4o Tokenizes Text and Uses tiktoken”. Retrieved March 25, 2026.
GPT-4o... relies on efficient tokenization techniques to process text. One of the key components in this process is tiktoken, OpenAI’s specialized tokenization library.
- 3“O200k_base tokenizer”. Retrieved March 25, 2026.
The O200k_base tokenizer is a Byte Pair Encoding (BPE)-based tokenizer... It features a vocabulary size of 200,000 tokens, enabling improved compression and handling of diverse languages compared to previous encodings like cl100k_base.
- 4“How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks”. Retrieved March 25, 2026.
GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks... they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks.
- 5“GPT-4o: The Cutting-Edge Advancement in Multimodal LLM”. Retrieved March 25, 2026.
GPT-4o offers substantial improvements over its predecessors by introducing multimodal capabilities, larger context windows, efficient tokenization, and faster processing speeds.
- 6“GPT-4o context window confusion - API - OpenAI Developer Community”. Retrieved March 25, 2026.
According to the docs, gpt-4o has a context window of 128,000.
- 7“GPT-4o Context Window is 128K but Getting error”. Retrieved March 25, 2026.
I am using ‘GPT-4o’ model... context window is 128K.
- 8“GPT-4o System Card”. Retrieved March 25, 2026.
Three of the four Preparedness Framework categories scored low, with persuasion, scoring borderline medium. Only models with a post-mitigation score of "medium" or below can be deployed.

