DeepSeek finally opened its eyes
DeepSeek just added image and video understanding to its flagship chatbot for the first time. There's a new image-recognition mode sitting right next to expert and flash on the chat interface, and it landed just days after the V4 release. South China Morning Post put it perfectly: the whale can now see.
Here's why this is a bigger deal than it sounds. DeepSeek was the last major frontier player whose consumer product was still text-only. GPT, Gemini, Claude, Qwen, Kimi, GLM, everyone else had eyes already. DeepSeek built its whole reputation on being the cheap open-weight model that punches above its price, but it was reading the world with one hand tied behind its back. That gap just closed.
For anyone building agents this matters more than for chatbot users. Computer-use, screen reading, document and chart parsing, UI navigation, all of it needs vision. A text-only model can't drive a browser or read a PDF the way it needs to. DeepSeek going multimodal means the cheapest serious model on the market can now do the perception half of agent work, not just the reasoning half.
The rollout is a limited release to select users for now, through chat.deepseek.com. But the direction is clear: the price-performance leader on the open side just stopped being half-blind. If you were waiting for a cheap multimodal base to build agents on, the wait got shorter.
← Back to all articles
Here's why this is a bigger deal than it sounds. DeepSeek was the last major frontier player whose consumer product was still text-only. GPT, Gemini, Claude, Qwen, Kimi, GLM, everyone else had eyes already. DeepSeek built its whole reputation on being the cheap open-weight model that punches above its price, but it was reading the world with one hand tied behind its back. That gap just closed.
For anyone building agents this matters more than for chatbot users. Computer-use, screen reading, document and chart parsing, UI navigation, all of it needs vision. A text-only model can't drive a browser or read a PDF the way it needs to. DeepSeek going multimodal means the cheapest serious model on the market can now do the perception half of agent work, not just the reasoning half.
The rollout is a limited release to select users for now, through chat.deepseek.com. But the direction is clear: the price-performance leader on the open side just stopped being half-blind. If you were waiting for a cheap multimodal base to build agents on, the wait got shorter.
Comments