V3.1
DeepSeek-V3.1 is a large language model (LLM) developed by DeepSeek, a Chinese artificial intelligence research laboratory 2. Released in August 2025, the model serves as an update to the DeepSeek-V3 series and is designed to integrate the general-purpose capabilities of V3 with reasoning features from the DeepSeek-R1 lineage 235. It is available under an open-weights license, which allows for both commercial application and private self-hosting 12.
The model is built upon the architecture of DeepSeek-V3, which utilizes a Mixture-of-Experts (MoE) framework featuring 671 billion total parameters and 37 billion activated parameters per token 426. DeepSeek characterizes V3.1 as a hybrid inference model that maintains performance while managing the computational overhead associated with dense models of a similar scale 3235. V3.1 supports a context window of up to 128,000 tokens 228. According to the developer, the model underwent an expanded long-context training process that included 630 billion tokens for its 32K context extension phase and an additional 209 billion tokens to achieve the 128K phase 2.
A central functional feature of V3.1 is its "hybrid thinking mode," which allows the model to toggle between two distinct response styles by modifying the chat template 3235. In "thinking" mode, the model employs chain-of-thought (CoT) reasoning to provide step-by-step logical deductions for complex mathematics, programming, and scientific inquiries 235. In its standard "non-thinking" mode, it provides direct answers suited for general question-answering and content generation 3235. DeepSeek asserts that the V3.1-Think variant achieves quality comparable to their specialized R1-0528 reasoning model but operates with greater speed 35. The developer states that through chain-of-thought compression training, V3.1-Think reduces output tokens by 20% to 50% while maintaining nearly the same average performance 35.
The market impact of V3.1 is frequently attributed to its cost-to-performance ratio 240. While independent estimates suggest that training proprietary models such as GPT-4 may cost between $50 million and $100 million, DeepSeek reported that the underlying V3-Base model was trained for approximately $5.6 million using 2.788 million H800 GPU hours 43841. By providing a model of this scale with open weights, DeepSeek has influenced the competitive landscape, offering capabilities that the developer claims are comparable to leading closed-source models like GPT-4o and Llama 3.1 405B at a lower infrastructure cost for end-users 1143. Post-training optimizations in V3.1 were also intended to improve tool usage and agentic workflows, with DeepSeek reporting that the model outperforms earlier iterations in code and search agent benchmarks 35.
Background
Background
The development of DeepSeek-V3.1 occurred during a period of expansion in the large language model (LLM) sector, characterized by competition between American and Chinese research laboratories 13. By mid-2025, the field included proprietary systems such as OpenAI's GPT series and Anthropic's Claude 4 series 13, 24. DeepSeek-V3.1 was released in August 2025 as an open-weight alternative to closed-source frontier models 13, 35.
The model's design was informed by the architecture of its predecessors, DeepSeek-V2 and DeepSeek-V3 1, 33. DeepSeek-V3 demonstrated that competitive performance could be achieved with limited computational budgets; its training run cost approximately $5.6 million 26, 40. By comparison, contemporary frontier models such as GPT-4 were estimated to cost approximately $100 million to train 38. DeepSeek-V3, released in December 2024, utilized a Mixture-of-Experts (MoE) architecture with 671 billion total parameters, activating 37 billion per token to manage inference costs 4, 26, 27.
Prior to V3.1, the developer maintained two distinct lineages: the V-series for general-purpose tasks and the R-series (such as DeepSeek-R1) for advanced reasoning 1, 13. DeepSeek-R1, introduced in early 2025, focused on chain-of-thought (CoT) reasoning through large-scale reinforcement learning 33. A primary motivation for V3.1 was the integration of these capabilities into a single hybrid architecture capable of switching between reasoning-heavy "thinking" modes and direct "non-thinking" responses 32, 35.
The development timeline involved updates aimed at refining model capabilities. In May 2025, the developer released DeepSeek-R1-0528, which aimed to improve reasoning coherence and reduce hallucination rates 31, 35. DeepSeek-V3.1 was built on an updated base model that underwent specialized long-context training, which included 630 billion tokens for its 32K extension and 209 billion tokens for its 128K extension 2. According to DeepSeek, the model was designed to challenge the established economics of frontier AI development by delivering high-performance capabilities at a fraction of the costs associated with leading proprietary systems 13.
Architecture
DeepSeek-V3.1 is built upon a transformer-decoder architecture that utilizes a Mixture-of-Experts (MoE) structure to manage its large-scale parameter count while maintaining computational efficiency 1, 6. The model consists of 671 billion total parameters, of which 37 billion are activated for any single token during inference 3, 6. This sparsity is distributed across 61 transformer layers with a hidden dimension of 7168 3. The architecture is designed to integrate the general capabilities of the DeepSeek-V3 lineage with reasoning features from the R1 series 2.
Multi-Head Latent Attention (MLA)
A central component of the architecture is Multi-Head Latent Attention (MLA), a mechanism designed to alleviate the memory bottleneck associated with the Key-Value (KV) cache in traditional attention 6, 5. Unlike standard Multi-Head Attention (MHA), MLA compresses the keys and values into a low-rank latent vector during the processing step 6, 3. According to technical assessments, this compression achieves approximately a 10-fold reduction in the memory required for the KV cache, which is particularly significant for long-context generation 6, 4.
In DeepSeek-V3.1, the latent vector dimension is set to 512, and the model employs 128 attention heads 3. Because the low-rank compression is incompatible with traditional Rotary Positional Embeddings (RoPE), the architecture uses a decoupled RoPE strategy 6. In this implementation, positional information is integrated into a separate component of the query and key vectors, ensuring that the model maintains spatial awareness despite the compressed representation of content 6, 3.
DeepSeekMoE and Expert Configuration
The model utilizes a "finer-grained" expert strategy within its MoE layers compared to traditional MoE models such as Mixtral or Grok-1 1, 6. While those models typically employ a small number of large experts (e.g., 8), DeepSeek-V3.1 features 256 routed experts and one shared expert per layer 1, 6. The shared expert is permanently activated to capture and store common knowledge across all tasks, which the developer asserts allows the routed experts to achieve higher levels of specialization 6, 4.
To manage these experts, the model employs a novel auxiliary-loss-free load-balancing strategy 1, 6. Traditional MoE models use an auxiliary loss function to prevent "routing collapse," where all tokens are sent to only a few experts; however, this can inadvertently degrade model performance by forcing suboptimal routing 6, 4. DeepSeek-V3.1 replaces this with a dynamic bias term added to the affinity scores during the top-K routing process 6. The model monitors expert load and adjusts these bias terms—increasing them for underloaded experts and decreasing them for overloaded ones—to ensure balanced utilization without penalizing the primary training objective 6, 4.
Multi-Token Prediction (MTP)
DeepSeek-V3.1 incorporates Multi-Token Prediction (MTP), a technique where the model is trained to predict multiple future tokens sequentially at each position rather than just the single next token 6, 4. Unlike other MTP implementations that predict tokens in parallel, DeepSeek-V3.1 maintains a complete causal chain, using previous predictions as context for subsequent ones 6, 5. This approach is intended to provide a denser training signal, improving the model's data efficiency and its ability to plan ahead during generation 4, 5. While the MTP modules are primarily used during training, they can also be utilized during inference to accelerate generation through speculative decoding, with an reported acceptance rate for the second token between 85% and 90% 4.
Training Methodology and Precision
The model was pre-trained on a dataset of 14.8 trillion tokens using 2,048 NVIDIA H800 GPUs 4. A significant innovation in the training pipeline is the use of FP8 mixed-precision training 1, 2. DeepSeek states that this is the first instance of a large-scale open-weights model being pre-trained using FP8 precision, which reduces memory usage and computational overhead 1. To maintain accuracy at this low precision, the model utilizes fine-grained quantization and a customized data format to minimize accumulation errors 1. The training infrastructure also employs "DualPipe," a pipeline parallelism strategy designed to overlap computation and communication, maximizing the efficiency of the GPU cluster 2, 5.
Capabilities & Limitations
V3.1 is characterized as a hybrid reasoning model, an architectural approach that integrates reasoning and conversational functionalities within a single system 11. This represents an evolution from previous iterations, such as DeepSeek-R1, which focused on pure reasoning, or earlier versions of V3 designed for general-purpose interaction 2, 11. According to developer evaluations, the hybrid architecture allows the model to automatically adjust its reasoning depth based on the complexity of the task, which is intended to minimize unnecessary computational overhead 11.
The model demonstrates specific proficiency in programming and technical tasks. In independent Aider programming benchmark tests conducted in August 2025, the model achieved a 71.6% second-pass rate, which was reported as the highest among non-reasoning models at the time of testing 11. Performance data suggests a 41.3% first-pass rate on similar tasks, with high metrics for format accuracy (95.6%) and zero recorded syntax or indentation errors 11. The model is utilized for code generation, debugging, and refactoring, with third-party testing noting its ability to identify issues in large-scale codebases and generate complex JavaScript and WebGL code 11.
Mathematical reasoning is a central capability, inheriting features from the R1 lineage 2. The model is designed to solve complex mathematical problems—reflected in benchmarks like MATH—and scientific reasoning challenges by generating a step-by-step chain of thought before providing a final response 2. It supports a 128k token context window, doubling the 64k limit of the prior R1 model, which facilitates the processing of extensive documentation and long-form code projects 11. Its training data includes a knowledge cutoff of July 2025, and the model exhibits multilingual fluency, although its primary optimizations focus on English and Chinese language tasks 11.
The model exists in two primary functional variants: a Base model and a Chat-instructed model 2. The Base model serves as the foundation, trained on web content and e-books to predict subsequent text tokens 2. The Chat version undergoes supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to optimize its conversational utility and adherence to safety guidelines 2. Analysis from BentoML indicates that while the Base model was trained without intentional inclusion of synthetic data, it may have indirectly absorbed OpenAI-generated content through crawled web pages 2.
Significant limitations have been identified in the model's creative and aesthetic outputs. Developer feedback suggests that the model’s aesthetic design capabilities are less developed than its technical logic, often producing abstract or visually underwhelming results in frontend development or UI/UX tasks 11. Users are advised to exercise caution when using the model for high-creativity design requirements or critical security code generation 11. While the hybrid model reduces redundant computation, some specialized tasks may still be handled more effectively by dedicated reasoning-only models 11.
V3.1 is subject to common failure modes, including the potential for hallucinations in niche domains and occasional timeouts during high-latency inference 11. Early reasoning-heavy variants like R1-Zero reportedly struggled with repetitive outputs and language mixing; while V3.1 aims to mitigate these through a multi-stage training pipeline, edge case handling remains an area cited for ongoing improvement 2, 11. Intended applications include software development, enterprise-scale code review, and algorithm implementation, while use in high-stakes aesthetic design is typically discouraged 11.
Performance
DeepSeek-V3.1 maintains a performance profile characterized by high computational efficiency and competitive results across standardized benchmarks in mathematics, coding, and general reasoning. The model's performance is often compared to proprietary Western systems like OpenAI's GPT-4o and Meta's Llama 3.1 series, particularly in its ability to achieve comparable accuracy with significantly lower resource requirements 2, 7.
Benchmark Evaluations
The underlying architecture for V3.1, first established in the DeepSeek-V3 release, demonstrated high scores on complex datasets. According to third-party analysis, the model has achieved state-of-the-art results on several difficult evaluations, including MATH 500 and the 2024 American Invitational Mathematics Examination (AIME) 7. On the AIME 2024 benchmark, V3-based models have been noted for outperforming the combination of GPT-4o and Claude 3.5 Sonnet 7. DeepSeek-V3.1-Think, the reasoning-focused variant, is reported to deliver reasoning quality comparable to the previous DeepSeek-R1-0528 but with faster response times 2.
In general language tasks, V3.1 is ranked within the top 10 on the ChatBotArena leaderboard, placing it above several iterations of Google's Gemini Pro and xAI's Grok-2 7. Additionally, the model shows improved performance in tool usage and agentic workflows; developer tests indicate it outperforms both DeepSeek-V3-0324 and DeepSeek-R1-0528 in code and search agent benchmarks 2.
Training Efficiency and Cost
A defining characteristic of the V3 series is its training efficiency. The model was trained using approximately 2.788 million NVIDIA H800 GPU hours to process 14.8 trillion tokens 2, 5. DeepSeek reports that the total training cost for this phase was approximately $5.6 million 2, 5. This represents a significant reduction in expenditure compared to contemporary dense models; for instance, the training of Llama 3.1 405B is estimated to have required roughly 30.8 million GPU hours at a cost between $92.4 million and $123.2 million 5.
This efficiency is attributed to architectural innovations such as the use of FP8 precision during training, which reduces memory footprint and increases throughput, and the DualPipe algorithm, which minimizes idle GPU time during data processing 5. V3.1 further optimizes inference through chain-of-thought compression training, which reduces the number of output tokens by 20% to 50% while maintaining performance levels 2.
Economic and Latency Metrics
API pricing for the model reflects the developer's focus on cost-efficiency. Comparisons of API costs show that DeepSeek-V3 (the base for V3.1) is approximately 17.9 times cheaper for input tokens and 35.7 times cheaper for output tokens than GPT-4o 6. For a daily workload of 10 million input tokens, the estimated cost is $1.40 with DeepSeek compared to $25.00 with GPT-4o 6. While the model supports a 128K context window, the use of Sparse Attention in later iterations like V3.2-Exp was specifically designed to further reduce latency and costs in long-context scenarios 2.
Safety & Ethics
The safety and ethics framework of V3.1 is centered on a multi-stage post-training pipeline designed to align the model’s outputs with human expectations of helpfulness and safety. DeepSeek utilizes a combination of supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to ensure that the model remains "harmless and honest" during interactions 2. This alignment process is further augmented by techniques derived from the DeepSeek-R1 lineage, which uses large-scale reinforcement learning to refine reasoning patterns and mitigate common large language model (LLM) issues such as endless repetition and poor readability 2.
Alignment and Content Filtering
According to the developer, V3.1 employs a hybrid approach to alignment, integrating Direct Preference Optimization (DPO) and RLHF to manage content filtering and adherence to safety guidelines. This is particularly relevant in the model’s "thinking" mode, where chain-of-thought reasoning is utilized to solve complex tasks. The developer states that this mode has undergone specific training to ensure that the internal reasoning process remains focused and consistent, with internal tests indicating that hallucinations have been reduced by 45% to 50% compared to earlier iterations in tasks such as summarization and reading comprehension 2. Furthermore, V3.1 features "smarter tool calling" and improved agentic workflow capabilities, which are intended to prevent the model from generating unsafe or nonsensical commands when interacting with external environments or multi-step tasks 2.
Data Ethics and Bias
Ethical considerations regarding the model's training data include the potential for indirect bias from other AI systems. DeepSeek researchers have noted that during the pre-training of the base model, some crawled web data included content generated by OpenAI models 2. This suggests that V3.1 may have indirectly absorbed the behavioral patterns or biases of these systems, despite researchers' efforts to avoid intentional inclusion of synthetic data during the pre-training cooldown phase 2.
Regulatory Compliance and Regional Bias
As a system developed by a Chinese research laboratory, V3.1 is subject to regional regulatory frameworks governing artificial intelligence and information services. These regulations necessitate specific content filtering layers to ensure compliance with local legal requirements. Critics and independent analysts have noted that such compliance measures can influence the model's output bias, particularly in relation to sensitive political, social, or cultural topics. While the model is available under an open-weights license for commercial use, the developer emphasizes that private deployments remain responsible for implementing additional safety layers consistent with their specific regional laws 2.
Applications
The V3.1 model is utilized across several high-compute domains, primarily distinguished by its hybrid reasoning architecture that balances conversational fluency with logical depth. Its applications range from individual developer workflows to large-scale enterprise automation and academic research.
Software Development and Engineering
Due to its high proficiency in programming tasks, V3.1 is frequently integrated into software development environments and Integrated Development Environments (IDEs) like VS Code and JetBrains 11. In standardized Aider programming benchmarks, the model achieved a 71.6% second-pass rate, which independent evaluations noted as surpassing the performance of several proprietary non-reasoning models 11. Developers employ the model for real-time code generation, complex debugging, and modular refactoring 11. Its 128k token context window enables the processing of extensive codebases and long-form technical documentation, allowing the model to maintain coherence over large engineering projects 11.
Enterprise and High-Volume Processing
In enterprise settings, V3.1 is positioned as a cost-effective alternative for high-volume text and data processing. Comparative analysis indicates that the model can be up to 68 times less expensive to operate than proprietary competitors like Claude Opus for similar programming and reasoning tasks 11. Organizations deploy V3.1 for automated code reviews, unit test generation, and integration into CI/CD pipelines to streamline quality assurance processes 11. Its Mixture-of-Experts (MoE) architecture allows for high-parameter performance with lower inference costs, making it suitable for startups and large-scale software firms seeking to reduce AI service expenditures 2, 11.
Research and Academic Use
V3.1 serves as a high-performance open-weights resource for the scientific and academic communities. It is used for complex mathematical problem-solving, scientific reasoning, and the implementation of advanced algorithms 2, 11. Because the base model is hosted on platforms like Hugging Face, researchers can deploy it on private infrastructure, which is a significant factor for institutions requiring strict data privacy or those conducting experiments that cannot be performed through closed-source APIs 2, 11.
Limitations and Not-Recommended Scenarios
Despite its technical capabilities, certain use cases are not recommended for V3.1. Developer feedback suggests that the model’s creative design and aesthetic judgment for UI/UX tasks are limited, often producing abstract or visually inconsistent results 11. Additionally, industry recommendations advise caution when using the model for generating critical security-sensitive code, where human oversight remains essential 11.
Reception & Impact
Industry Reception and the 'DeepSeek Moment'
The release of DeepSeek-V3.1 marked a significant period of market volatility, frequently referred to as the "DeepSeek moment" by industry analysts 13. Its arrival shortly after the launches of OpenAI’s GPT-5 and Anthropic’s Claude 4.1 was seen as a direct challenge to the high-cost, closed-source business models of American AI providers 13. This impact was reflected in the strategic shifts of competitors; OpenAI CEO Sam Altman noted that the rise of Chinese open-source models, specifically mentioning DeepSeek, influenced his company's decision to release its own open-weight models 13.
Economic and Efficiency Implications
The model's "efficiency-first" philosophy received critical acclaim for demonstrating that high-tier AI performance could be achieved with significantly lower capital expenditure 13. This is primarily attributed to its Mixture-of-Experts (MoE) architecture, which activates only 37 billion parameters during inference, keeping operational costs low despite its total 685-billion-parameter size 1, 13. Third-party evaluations by VentureBeat highlighted that V3.1 could complete coding tasks for approximately $1.01, compared to equivalent workloads on certain proprietary systems that cost nearly $70 13. Although training costs for V3.1 were not immediately disclosed, its predecessor (V2) cost roughly $5.6 million to train, a fraction of the cost associated with equivalent models from U.S. laboratories 13.
Technical and Scientific Impact
DeepSeek-V3.1 was noted by technical commentators for its innovative use of FP8 precision during the pre-training phase 1. While FP8 is widely used for inference, the DeepSeek team’s application of it during training—facilitated by fine-grained quantization and tile-wise grouping—was characterized as a novel approach that doubled training speed and reduced memory consumption 1. According to technical reports, the developers maintained a relative loss error of less than 0.25% compared to traditional BF16 training, a result described as a significant engineering achievement for open-source large language models 1.
Global and Community Adoption
The model saw rapid adoption within the global developer community, quickly ascending the trending lists on platforms like Hugging Face 13. Its availability under the permissive MIT license has allowed for extensive commercial modification and use 13. Despite this popularity, practical barriers remain; the model's 700GB size requires hardware resources beyond the capacity of most smaller organizations, often restricting use to lower-cost API access rather than local hosting 13.
Furthermore, the model’s emergence has intensified a global debate regarding the shifting balance of AI development between the United States and China 13. Analysts suggest that V3.1 indicates a shift in the AI race from a focus on raw power to a focus on accessibility and cost-effectiveness 13. However, adoption in U.S. enterprise environments remains tempered by geopolitical tensions and a preference for domestic vendors offering integrated security and support frameworks 13.
Version History
The version history of the V3 architecture began with the release of DeepSeek-V3 in December 2024, a 671-billion parameter Mixture-of-Experts (MoE) model 2. This was followed in January 2025 by the introduction of DeepSeek-R1, which utilized the V3-Base architecture to implement large-scale reinforcement learning for advanced reasoning 2. Alongside these flagship models, DeepSeek released a suite of distilled versions ranging from 1.5B to 70B parameters, based on Llama and Qwen architectures, to enable reasoning capabilities on smaller hardware configurations 2.
In March 2025, the developer released DeepSeek-V3-0324, which integrated reasoning techniques from the R1 lineage into the general-purpose V3 pipeline to improve coding and mathematical performance 2, 14. This was followed in May 2025 by DeepSeek-R1-0528, an update that added support for system prompts and, according to the developer, reduced hallucination rates by approximately 45–50% in tasks such as summarization 2, 14.
DeepSeek-V3.1 was officially released in August 2025 12. According to DeepSeek, this version represents a hybrid architecture that allows a single model to toggle between "thinking" (chain-of-thought) and "non-thinking" (direct response) modes through chat template adjustments 2, 14. Key updates in V3.1 included an expansion of the context window to 128,000 tokens and optimized tool-use capabilities for agentic workflows 12, 14.
Subsequent updates focused on stability and efficiency. DeepSeek-V3.1-Terminus was released in September 2025 to address language consistency issues, such as English-Chinese mixing in outputs 14. Later that month, the V3.2-Exp model introduced "DeepSeek Sparse Attention" (DSA) to optimize long-context inference 2, 14. The lineage culminated in December 2025 with the release of DeepSeek-V3.2, which integrated reasoning directly into tool-use tasks, and V3.2-Speciale, a high-compute research variant designed for formal theorem proving 14, 15.
Sources
- 1“The Complete Guide to DeepSeek Models: V3, R1, V3.1, V3.2 and Beyond”. Retrieved March 25, 2026.
In August 2025, DeepSeek released DeepSeek-V3.1, a major update that combines the strengths of V3 and R1 into a single hybrid model. It features a total of 671B parameters (37B activated) and supports context lengths up to 128K. ... DeepSeek-V3.1-Think achieves quality comparable to DeepSeek-R1-0528, but responds more quickly. ... V3.1-Think reduces output tokens by 20–50% while maintaining almost the same average performance.
- 2“DeepSeek-V3.1 is here. Here's what you should know.”. Retrieved March 25, 2026.
DeepSeek-V3.1 has received widespread attention and early tests revealed performance that rivals proprietary systems from American AI giants... This cost-efficiency stems partly from a low training cost. We still need training figures for the new model. But we know that its predecessor, DeepSeek-V2, required just $5.6 million for one training run.
- 3“DeepSeek v3 and R1 Model Architecture: Why it's powerful and economical”. Retrieved March 25, 2026.
DeepSeek v3 and R1 continue to use the traditional Transformer block... incorporates Multi-head Latent Attention (MLA) and radical Mixture-of-Experts (MoE)... first ever FP8 Precision OSS LLM pre-training.
- 4“DeepSeek-V3 Technical Report”. Retrieved March 25, 2026.
DeepSeek-V3 adopts Multi-Head Latent Attention (MLA) and DeepSeekMoE... multi-token prediction (MTP) objective... FP8 mixed-precision training.
- 5“DeepSeek-V3: Technical Details”. Retrieved March 25, 2026.
DeepSeek-V3 contains 671B total parameters, of which 37B are active for each token. It has 61 transformer layers with the hidden dimension, d_h=7168.
- 6“DeepSeek-V3 Technical Report - Gonzo ML”. Retrieved March 25, 2026.
V3 is trained on 14.8T high-quality tokens... Trained with 2048 NVIDIA H800 GPUs. Multi-Token Prediction (MTP) allows predictions of future tokens at each prediction... acceptance rate of the second token prediction from MTP is between 85 and 90%.
- 7“DeepSeek Technical Analysis — (3) Multi-Token Prediction”. Retrieved March 25, 2026.
The MLA reduced the KV cache size by 93.3%... Multi-Token Prediction which can improve the performance(accuracy) of the model.
- 11“DeepSeek vs. Llama 3: The True LLM Game Changer?”. Retrieved March 25, 2026.
API pricing comparisons show DeepSeek V3 can be ~17.9x cheaper for input tokens and ~35.7x cheaper for output tokens compared to models like GPT-4o.
- 12“DeepSeek V3 and the cost of frontier AI models”. Retrieved March 25, 2026.
DeepSeek V3... trained on 14.8T tokens with 671B total and 37B active parameters. ... Beating the pair of GPT-4o and Claude 3.5 together, and by some margin, is extremely rare.
- 13“DeepSeek's V3.1 update and missing R1 label spark speculation over fate of R2 AI model”. Retrieved March 25, 2026.
DeepSeek announced on Tuesday the release of the V3.1 model in a brief message to one of its WeChat user groups. The update expands the context window to 128k, allowing the model to hold more information.
- 14“Change Log | DeepSeek API Docs”. Retrieved March 25, 2026.
Date: 2025-08-21 DeepSeek-V3.1 Both deepseek-chat and deepseek-reasoner have been upgraded to DeepSeek-V3.1. Hybrid reasoning architecture: A single model supports both thinking mode and non-thinking mode.
- 15“DeepSeek-V3.2 Release | DeepSeek API Docs”. Retrieved March 25, 2026.
Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents! V3.2 now supports Thinking in Tool-Use.
- 24“DeepSeek V3.1 Base Suddenly Launched: Outperforms Claude 4 in ...”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"DeepSeek V3.1 Base Suddenly Launched: Outperforms Claude 4 in Programming, Internet Awaits R2 and V4","description":"DeepSeek V3.1's new version is officially launched. With a 128k context length, its programming capabilities surpass Claude 4 Opus, and it costs as low as $1.","url":"https://eu.36kr.com/en/p/3430524032372096","content":"Just last night, DeepSeek officially and quietly launched a brand - new V3.1 version, extending the context length to
- 26“Introducing DeepSeek-V3”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"🚀 Introducing DeepSeek-V3 | DeepSeek API Docs","description":"Biggest leap forward yet","url":"https://api-docs.deepseek.com/news/news1226","content":"## Biggest leap forward yet[](https://api-docs.deepseek.com/news/news1226#biggest-leap-forward-yet \"Direct link to Biggest leap forward yet\")\n\n* ⚡ 60 tokens/second (3x faster than V2!)\n* 💪 Enhanced capabilities\n* 🛠 API compatibility intact\n* 🌍 Fully open-source models & papers\n\n[File a ticket](https://support.reddithelp.co
- 31“DeepSeek-V3.1 model now available in Amazon Bedrock - AWS”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"DeepSeek-V3.1 model now available in Amazon Bedrock | Amazon Web Services","description":"AWS launches DeepSeek-V3.1 as a fully managed models in Amazon Bedrock. DeepSeek-V3.1 is a hybrid open weight model that switches between thinking mode for detailed step-by-step analysis and non-thinking mode for faster responses.","url":"https://aws.amazon.com/blogs/aws/deepseek-v3-1-now-available-in-amazon-bedrock/","content":"## [AWS News Blog](https://aws.amaz
- 32“DeepSeek-V3.1 Release”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"DeepSeek-V3.1 Release | DeepSeek API Docs","description":"Introducing DeepSeek-V3.1: our first step toward the agent era! 🚀","url":"https://api-docs.deepseek.com/news/news250821","content":"Introducing DeepSeek-V3.1: our first step toward the agent era! 🚀\n\n* 🧠 Hybrid inference: Think & Non-Think — one model, two modes\n\n* ⚡️ Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs. DeepSeek-R1-0528\n\n* 🛠️ Stronger agent skills: P
- 33“I would bet money against that. Replicating GPT-4 pre-training with ...”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"warning":"Target URL returned error 429: Too Many Requests","title":"","description":"","url":"https://news.ycombinator.com/item?id=35817624","content":"Sorry.","metadata":{"color-scheme":"light dark"},"external":{},"usage":{"tokens":2}},"meta":{"usage":{"tokens":2}}}
- 35“AI Training Costs Soar: GPT-4's $100M Price Tag Sets New Standard”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"The $100 Million Question: Why Training AI Is Getting Absurdly Expensive💰","description":"The $100 Million Question: Why Training AI Is Getting Absurdly Expensive💰\n \nEveryone talks about AI models.\nGPT vs Gemini. Claude vs LLaMA.\nBut that’s not the real competition.\n \n📈The real race is who can afford the training run.\n \nGPT-4 reportedly cost >$100M to train.\nFuture frontier models may cost >$1B.\n \nAt that point the list of players starts
- 38“DeepSeek V3.1 vs Llama 3.1 405b - AnotherWrapper”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"DeepSeek V3.1 vs Llama 3.1 405b — Pricing, Benchmarks & Performance Compared","description":"Compare DeepSeek V3.1 vs Llama 3.1 405b: input $0.56/M vs $3.5/M, output $1.68/M vs $3.5/M tokens. DeepSeek V3.1 is 68% cheaper overall. Full API cost breakdown, context window, and benchmark comparison.","url":"https://anotherwrapper.com/tools/llm-pricing/deepseek-v31/llama-3-1-405b-together","content":"# DeepSeek V3.1 vs Llama 3.1 405b — Pricing, Benchmarks &

