Llama 4 Scout
Llama 4 Scout

Llama 4 Scout is a multimodal large language model released by Meta AI in April 2025 as part of the Llama 4 family of generative artificial intelligence models 12. Positioned as the "efficiency" model within the series, Scout was introduced alongside the larger Llama 4 Maverick and a preview of the 2-trillion-parameter teacher model, Llama 4 Behemoth 12. Unlike previous generations of Llama models that relied on dense architectures, the Llama 4 series—including Scout—utilizes a Mixture-of-Experts (MoE) architecture and native multimodality 12. Scout is released as an open-weights model, intended to facilitate high-speed, low-cost applications for developers and medium-sized enterprises 12.
Architecturally, Llama 4 Scout comprises 109 billion total parameters, with 17 billion parameters active during the processing of each token 12. The model employs 16 distinct expert networks, a configuration designed to provide a broad distribution of expertise across general tasks while maintaining high inference speeds 12. Meta states that the model is specifically optimized to run on a single NVIDIA H100 GPU using int4 quantization, which increases its accessibility for on-premises deployment compared to larger models that require multi-GPU setups or DGX hosts 12. The model incorporates "early fusion" for native multimodality, a design that allows the seamless integration of text and up to five image inputs from the initial stages of the processing pipeline 12.
A primary technical differentiator for Llama 4 Scout is its 10 million token context window, which Meta describes as an industry-leading capacity at the time of its release 12. To support this extensive context, the model utilizes an interleaved attention architecture without positional embeddings, known as iRoPE, and employs inference-time temperature scaling of the attention mechanism 12. This capability is intended for specialized use cases that require processing massive datasets in a single pass, such as multi-document summarization, reasoning over large-scale codebases, and the analysis of extensive user activity logs to identify long-term patterns 12. Meta characterizes the model's role as that of a "diligent scout" capable of traversing vast information landscapes to retrieve specific insights 12.
In official performance reports, Meta asserts that Llama 4 Scout delivers state-of-the-art results for its model class, outperforming contemporary competitors including Google’s Gemini 2.0 Flash-Lite, Mistral 3.1, and Google’s Gemma 3 12. Meta highlights the model's "expert image grounding," which is designed to precisely anchor model responses to specific visual regions within an image 12. Despite these claims, independent evaluations have produced a more nuanced perspective; while the model's efficiency and image-text alignment are recognized, some independent tests have reported inconsistencies when the model is tasked with retrieving information from the full extent of its 10 million token context window 12. Industry reception has generally focused on the model's performance-to-cost ratio and its ability to bring massive-context processing to standard enterprise hardware 12.
Background
Llama 4 Scout was introduced by Meta AI in April 2025 as a key component of the Llama 4 generation of large language models 12. The model represents a technical departure from the Llama 3.1 lineage, transitioning from a purely dense architecture to a Mixture-of-Experts (MoE) design 12. During its development, Meta AI employed knowledge distillation, using a 2-trillion-parameter teacher model known as Llama 4 Behemoth to train the smaller Scout and Maverick models 12. This approach was intended to transfer the capabilities of a massive, high-parameter model into more efficient, deployable architectures 12.
The development of Llama 4 Scout was driven by a strategic shift toward specialized AI agents rather than a single, monolithic model 12. According to Meta AI, this multi-model approach was designed to address diverse computational resource constraints and specific user priorities, such as the trade-offs between model complexity and context length 12. The naming convention for the model was intended to reflect its role; while "Maverick" was designed for general-purpose high performance, "Scout" was optimized for information gathering, exploration, and the processing of vast datasets on limited hardware 12.
At the time of its release, the generative AI market was characterized by high demand for "mini" or "flash" models that offered low latency and reduced operational costs without significant losses in reasoning ability 12. Llama 4 Scout was positioned to compete with contemporary efficiency-focused models, including OpenAI’s GPT-4o-mini, Anthropic’s Claude 3.5 Haiku, Google’s Gemini 2.0 Flash-Lite, and Mistral 3.1 12. Third-party analysis noted that these models emerged as the industry moved toward optimizing the intelligence-to-efficiency ratio for enterprise applications and real-time use cases 12.
A primary developmental goal for Llama 4 Scout was achieving native multimodality and industry-leading context retention 12. Unlike previous Llama models that often added multimodal capabilities through late-stage integration, Scout was built using an early fusion mechanism to process text and vision tokens simultaneously from the initial stages 12. Furthermore, Meta AI focused on extreme context handling, implementing the iRoPE architecture—which uses interleaved attention layers without positional embeddings—to support a 10-million-token context window while maintaining the ability to run on a single NVIDIA H100 GPU 12.
Architecture
Llama 4 Scout is a sparse mixture-of-experts (MoE) transformer model 13. The architecture incorporates 109 billion total parameters, with 17 billion active parameters engaged during each forward pass 1315. This configuration utilizes 16 distinct experts, with a routing mechanism that selects two experts per token (top-2 routing) 1316. By activating only 12.5% of its total expert capacity per token, the model is designed to provide the reasoning depth of a large-scale model while maintaining the computational overhead and inference speed of a 17-billion-parameter dense model 13.
Multimodal Integration
Scout employs an "early fusion" approach for native multimodality 1617. Meta states that unlike models that use separate, frozen multimodal weights, Scout was pre-trained with a unified foundational structure that integrates unlabeled text and vision tokens 17. This allows the model to process multilingual text and images concurrently, supporting tasks such as visual recognition, image reasoning, and captioning 16. The model supports input modalities of text and images, producing output in text and code 16.
Context Window and Attention Mechanisms
The model features a context window of 10 million tokens, which Meta identifies as a significant increase over previous Llama iterations 1415. To manage this extended sequence length, the architecture utilizes grouped-query attention (GQA) and an evolved version of rotary positional embeddings (RoPE) termed interleaved RoPE (iRoPE), which is intended to enhance generalization across long sequences 131516. Independent technical analysis of the model's implementation suggests it utilizes a hybrid attention mechanism: global attention without positional encoding combined with local attention computed in chunks 18. Furthermore, Scout incorporates an inter-document attention masking technique that permits the processing of multiple documents within a single context window while preventing information cross-contamination between unrelated document boundaries 13.
Training and Data
Meta used knowledge distillation to train Scout, employing the 2-trillion-parameter Llama 4 Behemoth as a teacher model 14. The training dataset consisted of approximately 40 trillion tokens 16. This data mix included publicly available data, licensed content, and information from Meta's social platforms, such as public posts and interactions from Facebook and Instagram 16. A training objective known as load-balancing loss was implemented to ensure experts are utilized equally and to prevent routing collapse 13. The knowledge cutoff for the training data is August 2024 16.
Hardware Optimization and Deployment
Scout is optimized for deployment on varied hardware configurations, with Meta stating the model can fit on a single NVIDIA H100 GPU when utilizing Int4 quantization 1417. In FP16 precision, the model requires approximately 220GB of VRAM, which typically necessitates a multi-node or multi-GPU setup, such as four A100 (80GB) or three H100 (80GB) units 13. The model architecture is designed for compatibility with high-throughput serving stacks including vLLM, TensorRT-LLM, and TGI 13. In production environments using quantized INT4 on a single H100, the model has been reported to achieve inference speeds between 55 and 75 tokens per second 13.
Capabilities & Limitations
Llama 4 Scout is designed as a natively multimodal model, utilizing an "early fusion" approach that integrates text and vision tokens into a single model backbone 6. This architectural choice allows the model to process unlabeled text, image, and video data simultaneously during pre-training 6. According to Meta, this unified processing enables the model to perform visual reasoning, image captioning, and visual question-answering with an emphasis on aligning user prompts with specific visual regions, a capability referred to as image grounding 613.
Core Capabilities and Tasks
The model supports a variety of technical and routine processing tasks. It features native support for function calling and the generation of structured JSON output, which Meta characterizes as essential for enabling AI agents to take generalized actions and work through unseen problems 6. In benchmark testing, Scout has demonstrated proficiency in coding and routine text transformation, with Meta asserting that the model outperforms predecessors like Llama 3 and contemporary competitors such as Gemma 3 and Gemini 2.0 Flash-Lite across general benchmarks 6.
Llama 4 Scout is optimized for high-efficiency deployments. The 17-billion active parameter model is designed to fit within a single NVIDIA H100 GPU when using 4-bit quantization 6. Third-party analysis by Artificial Analysis indicates that Scout achieves high output speeds, with providers such as Groq reaching 409.2 tokens per second, making it suitable for low-latency applications 15.
Long-Context Processing
A defining feature of Llama 4 Scout is its support for a context window of 10 million tokens 613. To manage this volume of data, the model utilizes an architecture termed "iRoPE" (interleaved Rotary Position Embeddings), which employs interleaved attention layers and inference-time temperature scaling to enhance length generalization 6. Meta states that this enables use cases such as reasoning over entire codebases, multi-document summarization, and the parsing of extensive user activity logs for personalized tasks 6. The model is both pre-trained and post-trained with a base context length of 256K before being extended to the 10-million-token limit 6.
Limitations and Failure Modes
Despite its technical proficiencies, Llama 4 Scout is positioned as an efficiency-focused model and lacks the reasoning depth of larger variants in the Llama 4 family. Meta acknowledges that while Scout is capable, it does not match the complex multi-step reasoning or STEM-focused performance of Llama 4 Maverick (which utilizes 128 experts) or the 2-trillion-parameter Llama 4 Behemoth 6. Behemoth is specifically designated as the "teacher" model for STEM tasks, outperforming Scout on benchmarks like MATH-500 and GPQA Diamond 6.
Other limitations include a fixed knowledge cutoff of August 2024 13. While the model is pre-trained on 200 languages, official support for instruction-tuned tasks is limited to 12 primary languages, including English, Spanish, French, and Hindi 13. In visual tasks, the model has been formally tested for understanding up to five concurrent input images; Meta advises that developers must perform additional tuning and safety testing if deploying the model for tasks involving higher image counts 13.
Intended and Unintended Use
Meta intends Llama 4 Scout for commercial and research applications involving assistant-like chat, visual recognition, and synthetic data generation 13. It is specifically designed to act as a student model for distillation from larger LLMs 13. Unintended or prohibited uses include any application that violates the Llama 4 Acceptable Use Policy or applicable trade regulations 13. To mitigate safety risks such as jailbreaking or prompt injection, Meta provides system-level safeguards like Llama Guard and Prompt Guard 6. Internal red-teaming for the model involves "Generative Offensive Agent Testing" (GOAT), which simulates multi-turn interactions from adversarial actors to identify vulnerabilities before deployment 6.
Performance
Standardized Benchmarks
In standardized evaluations as of March 2026, Llama 4 Scout demonstrated performance levels that generally trail the larger Maverick model and leading proprietary models in reasoning tasks while maintaining high accuracy in long-context retrieval 13. On the MMLU-Pro benchmark, Scout achieved a score of 74.3%, compared to 82.1% for Llama 4 Maverick and 83.5% for GPT-5.3 13. Its performance on the GPQA Diamond reasoning test was 58.2%, and it recorded 81.7% on the MATH-500 evaluation 13. In coding tasks, Scout scored 79.6% on HumanEval and 32.8% on the SWE-bench Verified benchmark 13. For multimodal capabilities, the model attained a score of 73.9% on the MMMU benchmark 13.
Independent evaluations indicate that Scout’s primary performance advantage lies in its long-context handling rather than pure reasoning depth 13. In "needle-in-a-haystack" tests, the model maintained over 95% retrieval accuracy up to 8 million tokens of its 10-million-token context window 13. Accuracy degraded slightly to 89% at the full 10-million-token limit 13.
Inference Speed and Latency
The Mixture-of-Experts architecture allows Scout to achieve higher throughput relative to its total parameter count 13. When deployed in FP16 precision on four NVIDIA A100 (80GB) GPUs, the model produces between 40 and 60 tokens per second 13. Using 4-bit quantization (INT4) on a single NVIDIA H100 (80GB) GPU, throughput increases to 55–75 tokens per second 13. For local inference on edge devices, such as an Apple M4 Ultra using 4-bit quantization, the model can achieve a time-to-first-token (TTFT) of less than 50 milliseconds 13. By comparison, cloud-based API calls typically exhibit TTFT latencies between 200 and 500 milliseconds 13. The prefill stage for a 1-million-token context window requires approximately 90 to 120 seconds 13.
Operational Costs
As of 2026, several hosted providers offer API access to Llama 4 Scout. Together AI lists pricing at $0.10 per 1 million input tokens and $0.30 per 1 million output tokens, while Fireworks AI charges $0.12 per 1 million input and $0.35 per 1 million output tokens 13.
Analysis of the total cost of ownership (TCO) suggests a crossover point for organizations considering self-hosting versus using third-party APIs 13. For organizations processing fewer than 10 million tokens per month, API services are more economical due to high fixed infrastructure costs for self-hosting 13. However, the breakeven point for Scout is estimated to be between 500 million and 1 billion tokens per month 13. Above these volumes, self-hosted deployments can result in cost reductions of 60% to 80% compared to proprietary API services 13.
Safety & Ethics
Llama 4 Scout's alignment strategy utilizes a multi-stage post-training pipeline consisting of lightweight Supervised Fine-Tuning (SFT), online Reinforcement Learning (RL), and Direct Preference Optimization (DPO) 6. According to Meta, the transition to online RL with adaptive data filtering was intended to prevent the model from becoming over-constrained, a condition the developer states can lead to suboptimal performance in reasoning and coding tasks 6. The developer reports that approximately 50% of training data identified as "easy" by automated judges was removed to focus training on medium-to-hard difficulty prompts 6.
For runtime content moderation, the model is designed to integrate with Llama Guard 4, a 12-billion parameter safety classifier 9. This guardrail model is pruned from the Scout base model and fine-tuned to detect 14 categories of harm defined by the MLCommons safety taxonomy, including violent crimes, self-harm, and intellectual property violations 9. Llama Guard 4 is natively multimodal, supporting the evaluation of prompts containing text and multiple images simultaneously 9. Independent testing by Protect AI found that while Llama Guard 4 increased security, it successfully blocked 66.2% of total attack prompts, leaving approximately one-third of harmful inputs unaddressed in red-teaming scenarios 4.
Adversarial testing by third-party organizations has identified several security vulnerabilities. Protect AI assigned Llama 4 Scout a risk score of 58 out of 100, categorizing it as medium risk 4. The assessment found that Scout is particularly susceptible to jailbreak attacks, which had a 67.3% success rate, and prompt injection attacks at 64.1% 4. Similarly, Lakera AI's Model Risk Index reported an overall risk score of 88.14, ranking the model 14th in security resilience among tested models 8. Lakera identified that Scout was highly vulnerable to Direct Instruction Override (DIO) and Indirect Instruction Override (IIO) 8. Promptfoo's red-teaming report further noted a security pass rate of only 21.7% across more than 50 vulnerability tests, identifying three critical security issues in the model's evaluation 7.
Ethical concerns regarding algorithmic bias have been documented by independent researchers. Analysis using the Bias Benchmark for Question-answering (BBQ) dataset indicated the presence of stereotypical biases related to race, gender, and nationality in the model's standard output 10. Research by Hirundo claimed that these biased behaviors could be reduced by 44% through a process of behavioral machine unlearning, which involves identifying and adjusting latent internal representations of the model 10.
Regarding environmental sustainability and efficiency, Meta states that the Mixture-of-Experts (MoE) architecture used in Scout is more compute-efficient than dense models of similar quality 6. The use of FP8 precision during training and inference is intended to maximize hardware utilization and lower the serving costs of the model 6.
Applications
Llama 4 Scout is primarily utilized in high-throughput environments where processing large volumes of data with low latency and cost-efficiency is a priority 13. Its architecture is specifically optimized for tasks that require long-context handling but do not necessitate the deeper reasoning capabilities of larger models in the Llama 4 family 13.
Information Retrieval and RAG
A core application for Llama 4 Scout is in Retrieval-Augmented Generation (RAG) pipelines 13. The model's 10-million-token context window allows for the ingestion of extensive document sets, such as entire codebases, legal archives, or regulatory filings 13. To support these workloads, the model utilizes inter-document attention masking, which is intended to prevent the cross-contamination of information when multiple unrelated documents are processed within a single batch 13. In needle-in-a-haystack evaluations, the model maintains a reported retrieval accuracy of over 95% up to 8 million tokens 13.
Agentic Scouting and Data Extraction
In multi-model agentic workflows, Scout often performs a "scouting" or pre-processing role 13. It is used to filter, summarize, and extract structured data from raw inputs before passing refined information to more computationally expensive models, such as Llama 4 Maverick, for final analysis 13. Common industrial use cases include:
- Legal and Finance: Fine-tuned versions of Scout are deployed for clause extraction in contracts and entity recognition in financial statements 13.
- Customer Support: The model is used for high-volume ticket classification and sentiment analysis, where throughput is more critical than complex reasoning 13.
- Structured Data: Its performance in JSON-heavy data extraction makes it suitable for converting unstructured text into machine-readable formats for enterprise databases 13.
Edge and Mobile Deployment
Llama 4 Scout is designed for flexibility across different hardware tiers 13. While the full 16-bit precision model requires significant GPU memory (approximately 220GB VRAM), quantized versions are suitable for edge devices 13. A 4-bit (INT4) quantized version can operate on a single NVIDIA RTX 4090 or Apple Silicon (M4 Ultra) via unified memory 13. These local deployments are typically utilized for privacy-sensitive applications or offline developer tools 13.
Limitations and Non-Recommended Scenarios
Scout is not recommended for tasks requiring high-level reasoning or mathematical proofs, as it consistently trails Maverick and proprietary models like GPT-5.3 on benchmarks such as MMLU-Pro and GPQA Diamond 13. Additionally, for organizations processing fewer than 500 million tokens per month, the fixed infrastructure costs of self-hosting Scout may exceed the costs of using managed API services 13.
Reception & Impact
Critical Reception
Upon its release in April 2025, Llama 4 Scout received significant attention for the trade-offs it established between context length and reasoning depth. Industry analysts noted that while the model provided a massive 10-million-token context window, its performance in complex reasoning and high-level coding tasks generally trailed behind its larger counterpart, Llama 4 Maverick 12. Independent evaluations highlighted a discrepancy between official benchmarks and real-world performance; while Meta described Scout as a "diligent scout" capable of traversing vast datasets, some independent testing reported inconsistencies in its ability to handle extremely long-context tasks effectively 12.
Comparisons between Scout and its competitors, such as Gemma 3 and Mistral 3.1, often focused on its efficiency. The model was characterizes as offering a favorable balance of speed and overall performance for its size 12. However, some users reported that the model’s smaller number of experts (16 compared to Maverick’s 128) resulted in lower performance on highly specialized or nuanced queries 12. In the domain of coding, while official reports placed its capabilities near DeepSeek-V3, third-party assessments suggested that Scout was not primarily a coding specialist and could be outperformed by models specifically tuned for programming 12.
Benchmarking Controversy
The reception of the Llama 4 family, including Scout, was complicated by a controversy surrounding the Chatbot Arena (formerly LM Arena) leaderboard. Meta AI initially claimed that Llama 4 Maverick outperformed GPT-4o, but it was later disclosed that the version submitted for testing—an experimental model optimized for conversationality—was not the same as the publicly released open-weights version 12. This led to accusations of "benchmark hacking" and "bait-and-switch" tactics within the AI community on platforms such as X (formerly Twitter) and Hugging Face 12. These concerns impacted the perceived reliability of the Llama 4 family's early performance claims, prompting a call for more transparent and robust evaluation practices 12.
Industry and Economic Impact
The primary economic implication of Llama 4 Scout was the increased accessibility of high-performance multimodal models for organizations with limited hardware budgets. Because Scout was designed to run on a single NVIDIA H100 GPU when utilizing int4 quantization, it significantly lowered the barrier to entry for medium-sized businesses seeking to deploy long-context models locally 12. This hardware efficiency prompted major cloud providers, including Amazon Web Services (AWS), Google Cloud, and Microsoft Azure, to immediately integrate Scout into their model-as-a-service offerings 12.
Community adoption was rapid on platforms like Hugging Face, where the model was utilized for specialized downstream tasks such as multi-document summarization and large-scale codebase analysis 12. The availability of natively multimodal open weights was viewed by some analysts as a shift in the industry, pressuring proprietary model developers to provide greater transparency or more competitive pricing for their long-context services 12.
Version History
Llama 4 Scout was released on April 5, 2025, as part of Meta AI's first natively multimodal collection of open-weight models 68. Launched alongside the larger Llama 4 Maverick and a preview of the teacher model Llama 4 Behemoth, Scout was positioned as the high-efficiency entry in the series 6. The initial release featured a mixture-of-experts (MoE) architecture with a knowledge cutoff of August 2024 8.
Model Variants
Meta released Llama 4 Scout in two primary configurations:
- Pretrained (Base): Designed for natural language generation and specialized fine-tuning by developers. According to Meta, these models can be adapted for a wide variety of downstream tasks beyond the 12 officially supported languages 8.
- Instruct (Chat): A post-trained version optimized for assistant-style dialogue and visual reasoning. Meta states that this variant utilizes a revamped post-training pipeline involving supervised fine-tuning (SFT), online reinforcement learning (RL), and direct preference optimization (DPO) to balance conversational ability with reasoning depth 68.
Implementation and API Changes
Immediately following its release, Llama 4 Scout was integrated into Meta's consumer applications, including WhatsApp, Messenger, and Instagram Direct 6. Third-party availability began in April 2025 through providers such as DeepInfra, Groq, Fireworks, and Together AI 9.
Initial third-party deployments encountered performance challenges related to latency and accuracy. On May 2, 2025, updates to the vLLM serving framework were reported to significantly improve model accuracy and inference speed for Llama 4 Scout 7. These optimizations were particularly focused on managing the model's 10-million-token context window and its interleaved attention (iRoPE) architecture 67. Use of the model is governed by the Llama 4 Community License Agreement, which permits commercial and research use while maintaining certain restrictions on synthetic data generation for competing models 89.
Sources
- 4“Llama 4 Scout: Specifications and GPU VRAM Requirements”. Retrieved March 24, 2026.
Architecturally, Llama 4 Scout employs a Mixture-of-Experts (MoE) configuration, incorporating 109 billion total parameters, with 17 billion active parameters engaged per token across 16 experts. Its architecture incorporates interleaved attention layers, specifically iRoPE, to enhance generalization capabilities across extended sequences.
- 6“Unmatched Performance and Efficiency | Llama 4”. Retrieved March 24, 2026.
All Llama 4 models are designed with native multimodality, leveraging early fusion that allows us to pre-train the model with large amounts of unlabeled text and vision tokens.
- 7“Analysis of Llama 4’s 10 Million Token Context Window Claim”. Retrieved March 24, 2026.
As per my understanding, it seems that LLaMA 4 employs a hybrid attention mechanism, featuring both a global attention without positional encoding and a local attention computed in chunks.
- 8“Llama 4 Scout: API Provider Performance Benchmarking & Price Analysis”. Retrieved March 24, 2026.
For output speed, the top providers are Groq (409.2 t/s)... Speed varies significantly across providers.
- 9“Significant Performance Improvements in Llama-4-Scout with Latest vLLM Updates”. Retrieved March 24, 2026.
Recent vLLM updates have significantly improved accuracy and latency... Digits explored Meta's Llama-4-Scout performance issues tied to third-party server deployments.
- 10“Llama 4 Series Vulnerability Assessment: Scout vs. Maverick”. Retrieved March 24, 2026.
Llama 4 Scout had a risk score of 58 and Llama 4 Maverick a risk score of 52, both categorized as medium risk. ... the models in this series are most susceptible to jailbreak attacks, with Llama 4 Scout exhibiting the highest ASR at 67.3%.
- 12“Meta Llama 4 Scout Risk Report”. Retrieved March 24, 2026.
Overall Ranking: 14th. Overall Risk Score: 88.14 risk score. Highest Risk Category: ADD, DAIS, IIO with 100.0 risk score.
- 13“meta-llama/Llama-Guard-4-12B · Hugging Face”. Retrieved March 24, 2026.
Llama Guard 4 is a natively multimodal safety classifier with 12 billion parameters... aligned to safeguard against the standardized MLCommons hazards taxonomy.
- 14“Debiasing Llama 4 (Scout): Pioneering Behavioral Unlearning for Safer AI”. Retrieved March 24, 2026.
Hirundo has successfully reduced 44% of biased behaviors in Llama 4 (Scout) using its Machine Unlearning platform... Our evaluation utilized the Bias Benchmark for Question-answering (BBQ) dataset.
- 15“meta-llama/Llama-4-Scout-17B-16E · Hugging Face”. Retrieved March 24, 2026.
Model Release Date: April 5, 2025... Knowledge cutoff: August 2024... Instruction tuned models are intended for assistant-like chat... whereas pretrained models can be adapted for natural language generation.
- 16“Llama 3.1 8B Instruct vs Llama 4 Scout: Complete Comparison”. Retrieved March 24, 2026.
Llama 4 Scout was released on 2025-04-05... available from DeepInfra, Lambda, Novita, Groq, Fireworks, Together... uses Llama 4 Community License Agreement.
- 17“Llama 4 underperforms: a benchmark against coding-centric models”. Retrieved March 24, 2026.
{"code":200,"status":20000,"data":{"title":"Llama 4 underperforms: a benchmark against coding-centric models","description":"Rootly AI Labs analyzes the performance of Meta’s Llama 4 models and finds they underperform compared to competitors like Claude 3.5 Sonnet and Qwen2.5","url":"https://rootly.com/blog/llama-4-underperforms-a-benchmark-against-coding-centric-models","content":"# Rootly | Llama 4 underperforms: a benchmark against coding-centric models\n\n[Introducing Rootly Academy: a new i
- 18“Llama 4 Scout - Intelligence, Performance & Price Analysis”. Retrieved March 24, 2026.
{"code":200,"status":20000,"data":{"title":"Llama 4 Scout - Intelligence, Performance & Price Analysis","description":"Analysis of Meta's Llama 4 Scout and comparison to other AI models across key metrics including quality, price, performance (tokens per second & time to first token), context window & more.","url":"https://artificialanalysis.ai/models/llama-4-scout","content":"# Llama 4 Scout - Intelligence, Performance & Price Analysis\n\n[Stay connected with us on X, Discord, and LinkedIn to s
