Phi-4 Reasoning Plus
Phi-4-reasoning-vision-15B, often referred to in the context of its multimodal reasoning capabilities as a "reasoning plus" model 811, is a 15-billion parameter open-weight multimodal model developed by Microsoft 114. Released on March 5, 2026, the model is designed to integrate visual perception with structured reasoning, allowing it to perform tasks that require both image understanding and multi-step logical processing 1415. It is part of the Phi family of small language models (SLMs), which emphasizes high-quality data curation and architectural efficiency over the massive parameter counts typical of contemporary large language models 121. The model is distributed under an open-weight license and is available through platforms such as HuggingFace, GitHub, and Microsoft Foundry 1810.
The architecture of the model utilizes a mid-fusion approach, combining a SigLIP-2 "Naflex" vision encoder with the Phi-4-Reasoning language backbone 116. Microsoft researchers adopted this design to enable cross-modal reasoning while leveraging pre-trained components, which the developer asserts is more computationally efficient than early-fusion models that process image and text tokens in a single transformer 115. The model was trained on approximately 200 billion tokens of multimodal data, a figure Microsoft highlights as significantly lower than the one trillion or more tokens used for similar models such as Qwen 2.5 VL or Gemma 3 117. This training efficiency was reportedly achieved through the use of synthetic data for text-rich visual reasoning and the meticulous filtering of open-source datasets 115.
Microsoft states that the model is broadly capable across general vision-language tasks, including image captioning, document and receipt reading, and sequential image inference 110. However, the developer specifically highlights its proficiency in mathematical and scientific reasoning and "UI grounding"—the ability to identify and interact with elements on computer and mobile screens 112. According to Microsoft's internal evaluations, the model maintains a competitive position on the "pareto-frontier" of accuracy versus compute costs, delivering performance comparable to larger or slower models while consuming fewer tokens and requiring less inference time 115. For instance, Microsoft reported that increasing mathematical data during training improved performance not only in science benchmarks but also in computer-use scenarios 1.
The significance of the model lies in its focus on bringing advanced reasoning capabilities to smaller, more deployable architectures 111. By incorporating "reasoning blocks" that allow the model to verify perceptual information before generating a final response, it attempts to resolve common issues where vision models fail due to an inability to select relevant visual details 115. Microsoft characterizes the release as a contribution to the community's understanding of how to build efficient multimodal systems that can operate in resource-constrained or interactive environments without relying on extreme-scale training data or hardware 1.
Background
Background
The development of Phi-4-reasoning-plus is rooted in Microsoft’s "Phi" series of small language models (SLMs), which moved away from traditional scaling laws in favor of a data-centric training philosophy 79. Historically, the Phi project focused on the utilization of high-quality, "textbook-quality" data—both human-curated and synthetically generated—to enable models with fewer parameters to achieve performance levels previously reserved for much larger systems 79. This methodology was established through predecessor models such as Phi-1, Phi-2, and Phi-3, and was later integrated with research from the "Orca" lineage to emphasize logical reasoning and explanation-tuning 729.
The release of reasoning-focused variants marked a strategic shift from simple next-token prediction toward complex multimodal reasoning pipelines 110. By the time of the model's development, the field of artificial intelligence had begun to focus on "inference-time scaling," where models allocate additional computational effort during the generation process to improve performance on difficult tasks 112325. This trend was characterized by the emergence of competitive models such as OpenAI’s o1 and o3 series, DeepSeek-R1, and Anthropic’s Claude 3.7 Sonnet, which demonstrated the efficacy of extended chain-of-thought (CoT) reasoning 112426. According to Microsoft, Phi-4-reasoning-plus was developed to match the reasoning capabilities of these frontier-class models while maintaining a compact parameter count—ranging from 14 billion to 15 billion—suitable for efficient deployment 4111429.
The development timeline involved specialized post-training phases intended to "distill" reasoning capabilities into the Phi-4 base model 211. The technical report for Phi-4-reasoning details a supervised fine-tuning (SFT) stage using 1.4 million prompts specifically selected for their difficulty and "teachability" 229. According to the developers, these prompts were paired with reasoning traces generated by OpenAI's o3-mini to provide the model with demonstrations of internal reflection and problem decomposition 21129. To create the "plus" variant, Microsoft implemented a final phase of outcome-based reinforcement learning (RL) 211. This phase targeted approximately 6,000 math problems with verifiable solutions, incentivizing the model to produce longer, more accurate reasoning paths than the standard SFT version 229. In various scientific and mathematical benchmarks, the model's dual-stage training allowed it to outperform significantly larger models, such as the 70-billion parameter distilled versions of DeepSeek-R1 211.
Architecture
Phi-4-reasoning-vision-15B is a 15-billion parameter multimodal model designed to integrate visual perception with logical processing 1. The architecture utilizes a mid-fusion framework, which incorporates a pretrained vision encoder to project image data into the embedding space of a pretrained large language model (LLM) 1. Microsoft states that this design choice was intended to leverage the reasoning capabilities of the existing Phi-4 and Phi-4-Reasoning backbones while maintaining a smaller compute footprint compared to early-fusion models, which process images and text in a single transformer 1.
Vision and Resolution Processing
The model employs the SigLIP-2 Naflex vision encoder to handle visual inputs 1. Microsoft chose this dynamic-resolution encoder following ablation studies that compared various processing techniques, such as "pan-and-scan" and "Dynamic S2" methods 1. The Naflex variant allows for an adjustable patch count, enabling the model to process high-resolution images by generating up to 3,600 visual tokens 1. This capability is intended to improve performance on information-dense visual tasks, such as interpreting user interfaces and reading small scientific notation, which Microsoft identifies as a prerequisite for high-quality reasoning 2.
Training Methodology and Data Mixture
The training of Phi-4-reasoning-vision-15B followed a three-stage pipeline: Multi-Layer Perceptron (MLP) pretraining for alignment, instruction tuning, and a final stage focusing on long-context, multi-image reasoning, and safety alignment 3. The model was trained on approximately 200 billion multimodal tokens, which Microsoft highlights as a significantly smaller dataset than the one trillion or more tokens utilized for competing models like Qwen 3 VL or Gemma 3 1.
The training data mixture prioritized quality and specific reasoning domains over sheer volume 1. The dataset was constructed from three primary sources: meticulously filtered open-source data, high-quality domain-specific internal data, and targeted acquisitions 1. Researchers utilized GPT-4o and o4-mini to re-generate responses for open-source datasets with high error rates and to create synthetic text for high-quality images used as seeds 1. Ablation studies conducted during development indicated that increasing the proportion of mathematical and scientific data did not degrade performance in other areas, such as computer-use tasks, but instead improved overall benchmark results 1.
Hybrid Reasoning Framework
A key architectural innovation in the model is the use of explicit mode tokens to manage the mixture of reasoning and non-reasoning data 2. This framework allows the model to operate in two distinct modes: a direct-response mode for simple perception tasks (e.g., image captioning) and a chain-of-thought mode for complex logical problems 2. According to Microsoft, this approach enables the system to deliver fast answers when beneficial while maintaining the ability to perform structured, step-by-step reasoning when the task requires it 1. The model has been released with an open-weight distribution to facilitate community deployment and auditing 1.
Capabilities & Limitations
Phi-4-reasoning-vision-15B is designed as a multimodal system capable of processing and reasoning across both text and visual inputs. Its primary capabilities include image captioning, visual question answering (VQA), and the extraction of structured information from visual documents such as receipts and academic papers 1. Microsoft states that the model is particularly suited for tasks requiring multi-step logical processing of visual data, including helping with complex homework problems and interpreting mathematical charts 1.
Vision and Reasoning Capabilities
The model supports advanced vision-language tasks, including the ability to infer changes across sequences of images, which is intended to facilitate temporal reasoning and sequential navigation 1. In internal evaluations, the model demonstrated the ability to interpret laundry care symbols, split restaurant bills based on photographic evidence, and generate evocative captions for travel photography 1.
Microsoft reports that the model achieves accuracy levels on benchmarks such as MathVista, MMMU, and ChartQA that are competitive with much larger models 1. Specifically, the developer asserts that Phi-4-reasoning-vision-15B provides similar performance to models requiring ten times the compute and token generation 1. This efficiency is attributed to a mixture of reasoning and non-reasoning data during training, allowing the model to balance perception-focused tasks with deep logical inference 1.
User Interface and Agentic Use
A significant focus of the model’s design is the understanding and grounding of elements on computer and mobile screens 1. It is intended for use in agentic scenarios where an AI must identify and interact with specific user interface (UI) components 1. The model utilizes the SigLIP-2 Naflex vision encoder, a dynamic resolution system that allows it to process high-resolution screenshots up to approximately 720p (3600 tokens) 1. This architectural choice was made to address common failure modes in smaller models, such as the inability to detect small, information-dense interactive elements in complex GUIs 1. On the ScreenSpot-v2 benchmark, the model showed a marked improvement in grounding accuracy compared to models using static or lower-resolution image processing techniques 1.
Intended Use and Efficiency
The model is intended to be lightweight enough for deployment on modest hardware or in interactive settings where low latency is required 1. It was trained on 200 billion multimodal tokens, a significantly smaller dataset than the 1 trillion or more tokens used for larger open-weight models like Qwen-3 VL or Gemma-3 1. Microsoft positions this as pushing the "pareto-frontier" of the tradeoff between accuracy and compute costs, making it a viable option for resource-constrained environments 1.
Limitations and Failure Modes
As a 15-billion parameter model, Phi-4-reasoning-vision-15B may face limitations in tasks requiring the vast world knowledge or extreme detail processing typically found in models with 100 billion parameters or more 1. While the dynamic resolution encoder mitigates many perceptual issues, the model's effectiveness remains dependent on its ability to extract relevant information from images; Microsoft researchers noted that failure in such models often stems from perceptual errors—such as missing a small UI element—rather than a deficit in logical reasoning 1. Furthermore, while the model is capable of structured reasoning, its performance in high-latency scenarios or when processing extremely dense visual sequences may still be constrained by the underlying hardware's throughput 1.
Performance
Phi-4-reasoning-vision-15B is positioned as an efficiency-focused multimodal model that attempts to balance high-accuracy reasoning with lower computational requirements 1. Microsoft states that the model pushes the Pareto frontier of the trade-off between accuracy and operational costs, including latency and token generation 1. The model was trained using 200 billion tokens of multimodal data, a figure significantly lower than the one trillion or more tokens used for comparable open-weight models such as Qwen 2.5/3 VL, Kimi-VL, and Gemma-3 1.
Benchmark Evaluations
In standardized performance evaluations, the model has been tested across several multimodal reasoning datasets. On the MMMU_VAL (Massive Multi-discipline Multimodal Understanding) benchmark, the model achieved scores as high as 45.4, depending on the specific configuration of training data 1. For visual mathematical reasoning, evaluated through MathVista_MINI, the model reached a score of 45.2 when utilizing a dynamic resolution vision encoder 1.
Performance in high-resolution and computer-use scenarios was measured using the ScreenSpot and ScreenSpot-Pro benchmarks. According to Microsoft, the model achieved a score of 79.7 on ScreenSpot and 17.5 on ScreenSpot-Pro when configured with a 3,600-token limit, which corresponds approximately to 720p resolution 1. On the V*Bench visual search benchmark, the model recorded a score of 56.0 1.
Comparative Efficiency
Microsoft characterizes the model as a competitive alternative to larger or more resource-intensive systems. Internal testing by the developer suggests that the model provides accuracy levels similar to models that require ten times more compute time and output tokens 1. In an aggregate analysis of four benchmarks—ChartQA, MathVista_MINI, MMMU_VAL, and ScreenSpot_v2—the model maintained an average accuracy of approximately 75% while remaining more token-efficient than several larger competitors 1.
Research conducted during the model's development indicated that its performance is sensitive to the composition of its training data. Specifically, Microsoft researchers found that increasing the proportion of mathematical and scientific data by a factor of three did not detract from other capabilities but instead improved performance across both math and computer-use (CUA) benchmarks 1. This suggests a cross-domain benefit where mathematical training data enhances the model's general perception and grounding abilities 1.
Safety & Ethics
The safety framework for Phi-4-reasoning-plus is based on Microsoft’s Responsible AI standards, utilizing a combination of supervised fine-tuning (SFT) and reinforcement learning (RL) to mitigate potential harms 3. During the SFT phase, Microsoft included alignment-focused data comprising safety and ethics prompts sourced from previous Phi models and licensed collections 3. These datasets were augmented with synthetic responses generated by a reference "teacher" model, which were designed to follow specific safety guidelines regarding security, sensitive topics, and respectful engagement 3.
A distinct feature of the model’s safety architecture is its use of structured reasoning blocks. The model is trained to process safety guidelines within its internal "thinking" tokens, though Microsoft notes that the model may occasionally regurgitate these internal guidelines during the reasoning process 3. To reduce user overreliance and cognitive load, the model is intentionally taught to withhold these internal chain-of-thought traces and safety-specific instructions from the final output block 3.
For the "Plus" variant, Microsoft states that outcome-based reinforcement learning is used to further refine model behavior, primarily focused on verifiable solutions in reasoning tasks 3. In the context of the multimodal variant, Microsoft identifies specific measures to prevent visual jailbreaks—attacks where adversarial images are used to bypass text-based safety filters 1. These measures involve training the model to recognize and reject harmful instructions embedded within visual data, such as documents or receipts 1.
Evaluation of the model's safety performance includes testing on the Toxigen benchmark, which measures the detection of toxic language. According to the technical report, both Phi-4-reasoning and the "Plus" variant show a modest increase in accuracy over the base Phi-4 model in detecting toxicity, exhibiting a more balanced performance between neutral and toxic content 3.
Despite these alignment techniques, Microsoft identifies certain known risks, particularly regarding the open-weight nature of the model. While open-weight availability promotes transparency and allows for independent safety auditing, it also presents a risk that users may attempt to bypass internal safety guidelines through secondary fine-tuning 3. To address this, Microsoft provides associated safety documentation and recommended system prompts to guide responsible deployment 13.
Applications
Phi-4-reasoning-vision-15B is designed for applications requiring a balance between visual perception and logical inference, particularly in environments with limited computational resources 1. Microsoft states that the model's architecture makes it suitable for interactive settings where low latency and token efficiency are prioritized over the higher parameter counts found in larger multimodal systems 1.
Document Processing and Finance
A primary use case for the model is automated document understanding and structured data extraction. Microsoft asserts that the model can process visual inputs such as receipts, invoices, and financial statements to perform multi-step arithmetic and return formatted data 1. For example, the model has demonstrated the ability to split restaurant bills among multiple individuals, calculating individual shares and taxes while outputting the results in JSON format 1. In logistics and retail contexts, its ability to interpret laundry care symbols or identify specific objects suggests utility in automated cataloging and consumer-facing assistance 1.
Computer Use and Agentic Systems
The model is optimized for understanding and grounding elements on computer and mobile screens, which is a prerequisite for agentic AI systems designed to automate tasks across software interfaces 1. According to Microsoft, the model is trained to identify interactive elements within high-resolution screenshots, allowing it to infer sequences of actions or track changes across multiple images for navigation tasks 1. Development testing included benchmarks such as ScreenSpot, where the model's performance on user interface (UI) grounding was a primary metric for evaluating its capability to interact with digital environments 1.
Education and Scientific Reasoning
In educational contexts, the model is intended to assist with mathematics and science tasks that require visual interpretation. Microsoft states that Phi-4-reasoning-vision-15B can provide step-by-step reasoning for homework problems involving visual components, such as geometric diagrams, charts, or rendered equations from academic documents 1. Its training included specific datasets aimed at improving the interpretation of scientific visual data, such as population charts and technical drawings 1.
Deployment Scenarios
Due to its 15-billion parameter size and mid-fusion architecture, the model is intended to run on modest hardware compared to larger, proprietary models 1. This makes it a candidate for local deployment or edge computing scenarios where operational costs or data privacy requirements prohibit the use of massive cloud-based systems 1. Microsoft notes that the model is specifically targeted at the "Pareto frontier" of accuracy versus compute cost, aiming to provide high-reasoning capabilities without the high token consumption typical of many recent multimodal language models 1.
Reception & Impact
The release of Phi-4-reasoning-vision-15B in March 2026 has been noted for its adherence to Microsoft’s strategy of releasing "open-weight" models, a classification that provides access to the model parameters via platforms like HuggingFace and GitHub while typically withholding the full training code and datasets 1. This approach has been a subject of discussion within the open-source community, where the distinction between open-weight and strictly open-source software is often debated 1. Microsoft states that the release is intended to contribute practical insights and best practices to the community regarding the construction of smaller, efficient multimodal reasoning models 1.
Industry analysts have characterized Microsoft's "reasoning-first" strategy as an intentional countertrend to the industry-wide focus on scaling parameter counts and token consumption 1. By utilizing approximately 200 billion tokens for training—significantly less than the one trillion or more tokens used for comparable models such as Qwen 2.5 VL and Gemma 3—Microsoft positions the model as a more efficient alternative for organizations prioritizing lower training and inference costs 1. Microsoft asserts that the model pushes the "Pareto frontier" of the trade-off between accuracy and compute power, claiming performance levels that are competitive with models that require ten times more compute time and token generation 1.
The economic and technological impact of the model is largely tied to its potential for deployment in resource-constrained environments 1. Microsoft states that the 15-billion parameter architecture is lightweight enough to run on "modest hardware," which has encouraged its evaluation for edge computing and mobile application development 1. In particular, its specialized performance in computer-use and Graphical User Interface (GUI) grounding suggests potential implications for the development of AI agents capable of navigating mobile and desktop screens with low latency 1. According to the developer, this focus on token efficiency and reduced latency makes the model more suitable for interactive downstream deployments than larger, more resource-intensive multimodal systems 1.
Furthermore, Microsoft’s research into data composition—specifically the finding that increasing mathematics and science data can improve performance on unrelated tasks like computer use—has been presented as a lesson for the broader AI field 1. This data-centric approach aims to enable broad general-purpose capabilities without the need for extremely large training datasets, which may influence how future small language models (SLMs) are developed for mobile and edge platforms 1.
Version History
The Phi-4-reasoning-vision-15B model was released by Microsoft on March 4, 2026, as an expansion of the Phi-4 family of small language models 1. It was made available as an open-weight model through Microsoft Foundry, HuggingFace, and GitHub 1. The model's development was motivated by a trend in vision-language models (VLMs) toward increasing parameter counts and token consumption, which the developers sought to address by focusing on efficiency and structured reasoning in a more compact 15-billion parameter architecture 1.
Evolution from Language to Multimodal Reasoning
Phi-4-reasoning-vision-15B represents a technical progression from the text-only Phi-4 and Phi-4-Reasoning backbones 1. While the core Phi-4 model was trained on 400 billion unique tokens and the specialized Phi-4-Reasoning variant on 16 billion tokens, the multimodal "plus" variant was trained on approximately 200 billion tokens of multimodal data 1. Microsoft states that this training volume is significantly lower than the one trillion tokens or more typically used for comparable open-weight multimodal models, such as Qwen 2.5 VL or Gemma 3 1.
Architectural Iterations and Refinement
During the development process, Microsoft conducted ablation studies to determine the most effective methods for fusing visual and textual information 1. Researchers opted for a mid-fusion architecture, which projects visual tokens into the embedding space of a pretrained large language model (LLM), rather than an early-fusion approach that processes patches and text in a single transformer 1.
To improve the model's performance on high-resolution and information-dense images, several vision encoders and resolution handling techniques were tested, including Dynamic S2 and multi-crop methods 1. The final version utilizes the SigLIP-2 Naflex variant as its vision encoder 1. Microsoft researchers selected this dynamic resolution technique after testing showed it provided a substantial boost in accuracy on high-resolution benchmarks such as ScreenSpot-Pro, which focuses on user-interface grounding 1. The version history is further characterized by the use of targeted synthetic data to improve text-rich visual reasoning and the programmatic fixing of formatting errors found in traditional open-source datasets 1.
Sources
- 1“Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model”. Retrieved March 25, 2026.
Phi-4-reasoning-vision-15B is a 15 billion parameter open‑weight multimodal reasoning model... It is a broadly capable model that allows for natural interaction for a wide array of vision-language tasks and excels at math and science reasoning and understanding user-interfaces... We build on the SigLIP-2 vision encoder and the Phi-4-Reasoning backbone.
- 2“Phi-4-reasoning Technical Report”. Retrieved March 25, 2026.
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model... Trained via supervised fine-tuning of Phi-4 on carefully curated set of “teach-able” prompts... and reasoning demonstrations generated using o3-mini. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning.
- 3“Phi-4-reasoning-vision-15B”. Retrieved March 25, 2026.
Phi-4-reasoning-vision-15B, often referred to in the context of its multimodal reasoning capabilities as a 'reasoning plus' model, is a 15-billion parameter open-weight multimodal model developed by Microsoft. Released on March 4, 2026.
- 4“Phi-4-reasoning-vision-15B Technical Report”. Retrieved March 25, 2026.
Finally, a hybrid mix of reasoning and non-reasoning data with explicit mode tokens allows a single model to deliver fast direct answers for simpler tasks and chain-of-thought reasoning for complex problems.
- 7“Phi-Reasoning: Once again redefining what is possible ...”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"Phi-Reasoning: Once again redefining what is possible with small and efficient AI - Microsoft Research","description":"Phi-4-reasoning is a 14-billion parameter model specialized in complex reasoning tasks. It is trained using supervised finetuning (SFT) on diverse prompts and reasoning demonstrations from o3-mini. The model generates detailed reasoning chains and leverages inference-time compute effectively. Phi-4-reasoning-plus, an enhanced version
- 8“microsoft/Phi-4-reasoning-plus”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"microsoft/Phi-4-reasoning-plus · Hugging Face","description":"We’re on a journey to advance and democratize artificial intelligence through open source and open science.","url":"https://huggingface.co/microsoft/Phi-4-reasoning-plus","content":"## [](https://huggingface.co/microsoft/Phi-4-reasoning-plus#phi-4-reasoning-plus-model-card) Phi-4-reasoning-plus Model Card\n\n[Phi-4-reasoning Technical Report](https://huggingface.co/papers/2504.21318)\n\n## [
- 9“Introducing Phi-4: Microsoft's Newest Small Language ...”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning","description":"Today we are introducing Phi-4, our 14B parameter state-of-the-art small language model (SLM) that excels at complex reasoning in areas such as math, in...","url":"https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090","content":"# Introduci
- 10“Introducing Phi-4-Reasoning-Vision to Microsoft Foundry”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"Introducing Phi-4-Reasoning-Vision to Microsoft Foundry","description":"Vision reasoning models unlock a critical capability for developers: the ability to move beyond passive perception toward systems that can understand, reason...","url":"https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/introducing-phi-4-reasoning-vision-to-microsoft-foundry/4499154","content":"# Introducing Phi-4-Reasoning-Vision to Microsoft Foundry | Microsoft Commun
- 11“Microsoft launches Phi-4-Reasoning-Plus, a small ...”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"Microsoft launches Phi-4-Reasoning-Plus, a small, powerful, open weights reasoning model!","description":"The release demonstrates that with carefully curated data and training techniques, small models can deliver strong reasoning performance.","url":"https://venturebeat.com/ai/microsoft-launches-phi-4-reasoning-plus-a-small-powerful-open-weights-reasoning-model","content":"Microsoft Research has [announced the release of Phi-4-reasoning-plus](https://
- 12“Microsoft Releases Phi-4-Reasoning-Vision-15B - MarkTechPost”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"Microsoft Releases Phi-4-Reasoning-Vision-15B: A Compact Multimodal Model for Math, Science, and GUI Understanding","description":"Microsoft Releases Phi-4-Reasoning-Vision-15B: A Compact Multimodal Model for Math, Science, and GUI Understanding","url":"https://www.marktechpost.com/2026/03/06/microsoft-releases-phi-4-reasoning-vision-15b-a-compact-multimodal-model-for-math-science-and-gui-understanding/","content":"# Microsoft Releases Phi-4-Reasoning-
- 14“Microsoft Releases Phi-4-Reasoning-Vision-15B Open ... - MLQ.ai”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"MLQ.ai | AI for investors","description":"","url":"https://mlq.ai/news/microsoft-releases-phi-4-reasoning-vision-15b-open-weight-multimodal-ai-model/","content":"## Microsoft Releases Phi-4-Reasoning-Vision-15B Open-Weight Multimodal AI Model\n\nMarch 5, 2026 at 9:43 AM • by MLQ Agent\n\nMicrosoft Research has released Phi-4-reasoning-vision-15B, a 15-billion parameter open-weight multimodal model designed for vision-language tasks. The model excels i
- 15“Papers Explained 541: Phi 4 Reasoning Vision 15B | by Ritvik Rastogi”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"Papers Explained 541: Phi 4 Reasoning Vision 15B","description":"Papers Explained 541: Phi 4 Reasoning Vision 15B Phi-4-reasoning-vision-15B is a compact open-weight multimodal reasoning model that balances reasoning power, efficiency, and training data needs. The …","url":"https://ritvik19.medium.com/papers-explained-541-phi-4-reasoning-vision-15b-fbbce5596e8a","content":"[[File a ticket](h
- 17“Gemma 3 vs. MiniCPM vs. Qwen 2.5 VL - Clarifai”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"Benchmarking Best Open-Source Vision Language Models: Gemma 3 vs. MiniCPM vs. Qwen 2.5 VL","description":"Benchmarking Gemma-3-4B, MiniCPM-o 2.6, and Qwen2.5-VL-7B-Instruct for latency, throughput, and scalability.","url":"https://www.clarifai.com/blog/benchmarking-best-open-source-vision-language-models","content":"## \n\n
- 21“Reasoning models | OpenAI API”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"Reasoning models | OpenAI API","description":"Learn how to use OpenAI reasoning models in the Responses API, choose a reasoning effort, manage reasoning tokens, and keep reasoning state across turns.","url":"https://developers.openai.com/api/docs/guides/reasoning/","content":"# Reasoning models | OpenAI API\n\n[](https://developers.openai.com/)\n\n[Home](https://developer
- 23“OpenAI o1 and o3 Explained: How “Thinking” Models Work”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"OpenAI o1 and o3 Explained: How “Thinking” Models Work","description":"Explaining the most advanced LLMs and why they are smarter than the previous generation.","url":"https://blog.lewagon.com/skills/openai-o1-and-o3-explained-how-thinking-models-work/","content":"_This article is written by Andrei Danila, a Machine Learning Engineer._\n\n* * *\n\n## **Introduction: A new chapter after ChatGPT 4o**\n\nRemember when ChatGPT first appeared and blew every
- 24“Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1 - Vellum AI”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1","description":"","url":"https://vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1","content":"Anthropic just dropped Claude 3.7 Sonnet, and it’s a textbook case of second-mover advantage. With OpenAI’s o1 and DeepSeek’s R1 already setting the stage for reasoning models, Anthropic had time to analyze what worked and what didn’t—and it shows.\n\nWhat’s most interesting is their shift in focus.\n\n
- 25“Claude 3.7 Sonnet thinking vs. Deepseek r1 - Composio”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"title":"Claude 3.7 Sonnet thinking vs. Deepseek r1 | Composio","description":"Composio content pages powered by our CMS, including tutorials, product updates, and guides.","url":"https://composio.dev/content/claude-3-7-sonnet-thinking-vs-deepseek-r1","content":"So, Anthropic finally broke the silence and released Claude 3.7 Sonnet, a [hybrid model](https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking) that can think step-by-step like a thinki
- 26“Claude 3.7 Sonnet scores less than DeepSeek R1 on ARC-AGI”. Retrieved March 25, 2026.
{"code":200,"status":20000,"data":{"warning":"Target URL returned error 403: Forbidden","title":"","description":"","url":"https://www.reddit.com/r/singularity/comments/1izpdxa/claude_37_sonnet_scores_less_than_deepseek_r1_on/","content":"You've been blocked by network security.\n\nTo continue, log in to your Reddit account or use your developer token\n\nIf you think you've been blocked by mistake, file a ticket below and we'll look into it.\n\n[Log in](https://www.reddit.com/login/)[File a tick

