Machine vision now a bedrock for intelligent machines
SOURCE: ELECTRONICS360.GLOBALSPEC.COM
JAN 22, 2026
Ai2 Releases Molmo 2, an Open Video Model That Knows Where the Action Happens
SOURCE: THEAIECONOMY.SUBSTACK.COM
DEC 16, 2025
Dec 16, 2025

Signage at the Seattle, Wash., offices of the Allen Institute for AI (Ai2), taken on Dec. 9, 2025. Credit: Ken Yeung
Ai2 has unveiled Molmo 2, the latest iteration of its open-source vision-language model (VLM). Arriving over a year after the original, this state-of-the-art update brings the most notable upgrades yet: support for multiple images and video, and grounding. The next-generation Molmo can now count and track objects or actions within videos. And just like its predecessor, Molmo 2 comes in three variants—4B and 8B models built on Qwen 3, and a 7B model based on Ai2’s own Olmo—though the parameter sizes differ from the original.
“Perhaps the most exciting part…is that our 7B model now outperforms our last year’s model’s 72B model,” Ranjay Krishna, Ai2’s computer vision research lead, remarks in a briefing last week. “It’s a huge reduction in the amount of parameters you need to be able to get really good capabilities.”
He highlights that Molmo 2 can perform just as well as video language models such as Meta’s PerceptionLM, but trained on a fraction of the data. For example, Meta is trained on 72.5 million videos, while Ai2 uses 9.19 million.
“With Molmo 2, models can move beyond simple descriptive answers to pinpoint exactly where and when an event occurs in space and in time,” the nonprofit AI lab proclaims in a blog post.
Developers, researchers, and scientists can now download everything related to Molmo 2, including its weights, training data, and training code. In addition, Ai2 is releasing specialized versions tailored for structured tasks, its Molmo 2 dataset, detailed data recipes, and open benchmarks and tooling for grounded video evaluation. All of this is available on GitHub and Hugging Face under a permissive license.
What’s more, Ai2 is open-sourcing captions for over 100,000 unique videos, along with 431,000 clip-level captions. This large, diverse, and descriptive corpus is the lab’s contribution to the community to “study, benchmark against, and extend.”
Thanks for reading The AI Economy! Subscribe for free to receive new posts and support my work.
Subscribe
Released in 2024, Molmo is a fully open-source model with several sizes: 1B, 7B, and 72B. To Ai2’s surprise at the time, it outperformed many closed-source models, such as OpenAI’s GPT-4o and Gemini 1.5 Pro. “As soon as we released [it], it blew up online. There’s quite a lot of excitement,” Krishna shares. “There were a lot of people who immediately started comparing our model against Meta’s Llama 3.2 that came out on the very same day. We’re very excited because we ended up outperforming Meta even though [it] had a lot more data.”
He adds that, to date, Molmo has been downloaded more than three million times. The model has also been received positively by the research and scientific community and has led to the development of new capabilities not only in open-source research but also in proprietary models.

A graph showing how Ai2’s Molmo image-pointing capability has influenced other AI models in the industry. Credit: Ai2
“When [it] was released, our model was able to point and ground its reasoning capabilities directly in the pixels themselves. That capability is something that’s been adopted by all of the proprietary models. So GPT [and] Gemini, these are all models that can point and refer to things in the pixels nowadays. So, we’re pushing the boundaries of not just what open science can do, but also what closed science can be capable of,” Krishna elaborates.
Similar to its predecessor, Molmo 2 maintains an architectural design that combines an LLM with an image encoder. The encoder converts images or video frames into numerical representations before the language model processes the data and the text prompt to generate an answer. In other words, the LLM is at the center of the whole operation and reasons about everything at once, eliminating the need for separate vision and language systems to interact with each other.
Even though it’s multimodal, Molmo 2 isn’t a generative model in the vein of Luma’s Ray3, Adobe Firefly, OpenAI’s Sora, Google’s Imagen, Midjourney, Pika, Runway, or Stable Diffusion. Instead, it’s about understanding visual scenes. “Users take videos, or they might have videos in their data collection—maybe they want to analyze it, understand what happened in the video, summarize [it], [or] understand how a player moved in a particular game. Those are the kinds of applications we’re going after,” Krishna clarifies. “The input is the video and the output is a reasoning about the contents of that video.”
What are the differences between the three versions of Molmo 2? According to Ai2:
Molmo 2 (8B) is built on Qwen 3 and is optimized for video grounding and QA.
Molmo 2 (4B) is also Qwen 3-based, but is made for efficiency.
Molmo 2-O (7B) is the Olmo-backed model that caters to researchers who “want full control over every part of the stack,” including the vision encoder, connector, and the language model.
The Molmo 2 model family undergoes two stages of training. The first is image-captioning and image-pointing pretraining aimed at aligning the visual encoder and LLM. This process is made up of 60 percent captioning data, 30 percent image pointing, and 10 percent natural language data.
The second stage focuses on joint supervised fine-tuning (SFT). Ai2 combines its multimodal dataset—spanning images, video, and multi-image inputs—with PixMo, Tulu, and other open-source image and video datasets to refine Molmo 2. Here’s a look at what else happens behind the scenes:
Multiple data types—categorized by captions, image and video QA, pointing, tracking, and natural language processing—are deliberately mixed together.
Larger datasets are prevented from monopolizing training, helping to minimize the chances of Molmo adopting narrow or biased behaviors.
The model observes samples from key moments in videos (up to 128 frames at no more than two frames per second), not the entire footage. Doing so ensures efficient training while preserving context.
Visual information is compressed before being passed to the language model, allowing longer videos to be processed without incurring overwhelming compute costs.
Molmo 2 can connect what it observes across multiple frames and images, improving its ability to track objects and understand events over time.
Ai2 says Molmo 2 uses a novel token-weighting approach to learn tasks more evenly, applies sequence packing and message-tree scheduling to accelerate training, and employs bi-directional attention between visual tokens to better connect information across frames and images.
Approximately 102 data sets were used to train Molmo 2, pulling from openly licensed video datasets, Creative Commons, YouTube videos, and academic collections. The company says Molmo 2 was trained entirely without distillation from proprietary models, making its data and results fully transparent and reproducible, demonstrating that strong performance is possible without relying on closed systems.
“A lot of the open-weight models that you see coming out from China—for example, Qwen—a lot of them do distill from GPT and other sources, which makes it difficult to understand why these models behave in a certain way, sometimes because it’s so opaque as to what went into their training data and what went into the training data of the models they were distilled from,” Krishna shares.
In Ai2’s case, “all of the training data is publicly available, so at a very high level, the architecture looks a lot similar to Molmo and all the other models that are out there. It takes a video, encodes each frame individually, and tokenizes all of them. Then it converts it into something that an LLM can understand and get all of that, along with some question, tokenized and fed into the LLM, and outputs either answers, points, tracks, and other kinds of possible combinations of those three.”
A signature feature of the original Molmo was pointing, the model’s ability to identify and indicate the exact location of an object or event. Initially, it only worked with images, but now it works with videos.
“There’s a lot of neat applications [for pointing], especially in having users understand what the model is looking at and thinking about,” Chris Clark, the lead scientist on Ai2’s Molmo team, explained to The AI Economy last October. “If Molmo counts, it’ll point to each thing it counts, which actually makes it better at counting, because it’s like a human—if you point to things as you count, maybe you do better. But it also means that if you, as a user, look at that, you can see what went wrong. What didn’t it miss? Did it see something I didn’t? You can’t do that at all with just a language model because it’ll just give you a number, and you have no idea what it was actually counting. You can’t tell.”
In addition to pointing, Molmo 2 can also track objects as they move through multiple frames in a video. While the two capabilities are closely related, they serve different purposes in how the model understands visual content.
Ai2 illustrates the distinction using a video clip showing penguins standing on an iceberg. With pointing, Molmo 2 can identify the location of a specific penguin in a single frame, essentially answering the question, “Where is this penguin at this moment in the video?”
Tracking extends this capability. Instead of marking a single moment, Molmo 2 follows the same penguin across multiple frames as it moves through the scene. The model assigns each penguin a unique ID, allowing it to keep track of an individual even if it briefly moves off-screen or is obscured by another object.
That said, all the examples featured by Ai2 had a small group of objects. What happens if there are larger groups? Could Molmo 2 assist with crowd control at an event, such as an inauguration, or in Times Square for New Year’s Eve? Krishna reiterates that because this is a new capability and experimental, Ai2 is “starting very small. Right now, I think we can handle…tracking ten things. There’s no technological limit to this. It just requires more data, more examples of really crowded scenes, and then exposing the model to that kind of data.”
Still, he asserts that, compared to Molmo 2’s peers, the model’s tracking is “where we excel the most” and that “this is something that most models really struggle with, whereas we’re outperforming by quite a large margin.” He predicts that other open-source models will adopt this capability quickly now that Molmo 2 has been released. Still, Ai2 plans to continue investing in tracking. “There’s still a lot to do. This is a completely new capability, so it’s very experimental,” Krishna acknowledges. “There’s a ton of things that can still go wrong, but we expect that this is something that we’re going to continue to improve on.”

Credit: Ai2
Another point Krishna’s team highlights is efficiency: Molmo 2 outperforms its predecessor and other leading models—with significantly fewer parameters. “The secret sauce behind a lot of the things that we’re building really comes down to data—high-quality data matters a lot,” he admits.
Showing a chart comparing Molmo 1 versus Molmo 2, when it comes to image results, Ai2 finds that Molmo 2 7B beats the much-larger Molmo 1 72B “on average across different image tasks.”
“It really comes down to the quality of the data itself,” Krishna points out. “We put a lot of effort into vetting really good annotators and also training the annotators so they give us the kind of data that’s missing from the internet. A lot of those other proprietary models, what they do is they just hope that the data is available online, they download as much of it as possible, and then they train on that data. In our case, it’s very different. We’re trying to specifically go after and annotate the data that’s missing.”
Models like Molmo 2 open the door to a wide range of applications that benefit from AI with pointing and tracking—from security cameras and wearable devices to smart glasses, robotics, wildlife monitoring, automated dataset annotations, video analytics, converting educational videos into detailed step-by-step written instructions, multi-image processing, and more.
Sports may be one of the clearest ways to see Molmo 2’s capabilities in action. Imagine the model powering instant replays or video review systems—possibly giving tools like AWS’ NFL Next-Gen Stats a run for its money? Ai2 leaned heavily on sports footage throughout the media briefing, using clips from a Seattle Mariners-Los Angeles Angels game to identify players and visually identify key moments on the field. In another example, the team prompted Molmo 2 to analyze a soccer clip and explain why a goal was conceded, effectively simulating a coaching scenario.
Don’t mistake Molmo 2 for a traditional computer vision model—it isn’t. While Ai2 highlights its pointing and tracking capabilities, this LLM can extract insights from a single image, multiple images, or videos. Krishna’s team also stresses that Molmo 2 is fundamentally a conversational model, designed to work with natural language through a chat interface. It can understand and reason about the meaning of what it sees, bridging visual content and semantic language.
Robotics is another application the team sees as benefiting from Molmo 2. While the briefing showcased recorded footage, there is potential in using the model for livestreams and events. Krishna remarks that Ai2 hopes Molmo 2 is used to help robots navigate and manipulate objects. “Obviously, your observations are coming in as you’re taking action,” he comments. “One thing you can do is…every time you get a new input frame, you rerun the whole model. That’s the easiest solution, but it’s expensive because you need to preprocess the entire video again. There are methods that people have come up with in terms of processing video over time. Those are directions we’re looking into next.”
In a surprise announcement, Krishna reveals that Molmo 2 will soon be integrated into MolmoAct, Ai2’s model that helps robots navigate the real world with greater spatial awareness.
“These are really important…foundational capabilities that now the models can ground in the pixel, and this becomes the jumping off point for…wearables, maps, all of these things,” Sophie Lebrecht, Ai2’s chief operating officer, states.

Credit: Ai2
Now that we know what Molmo 2 can do, let’s take a look at its performance when compared against other peer models. Here’s what Ai2’s evaluations turned up:
In its VideoPoint benchmark, which tests how well an AI model can pinpoint precisely where and when something happens in a video, Molmo 2’s 8B model outperformed Google’s Gemini 2.5 Pro and OpenAI’s GPT-5.
In the BURST-VideoCount benchmark, which looks at how accurate AI models are in counting objects or events across an entire video, Ai2 says Molmo 2’s 8B surpasses GPT-5 and Gemini 2.5 Pro. And when tested on the Molmo 2-VideoCount dataset, Molmo 2 is “highly competitive” with other strong models, though it slightly underperforms Gemini 2.5 Pro.
When it comes to multi-domain object tracking, Molmo 2 4B and 8B lead not only proprietary models, but also open-weight vision-language models, such as Qwen3-VL-8B, and specialized open video models, including Sa2VA-8B and Molmo+SAM 2.
Molmo 2 8B performs better at understanding, reasoning about, and analyzing short video clips than other open-weight models, achieving the strongest overall short-video score. Its 4B sibling provides nearly comparable performance with greater efficiency for video QA.
When it comes to processing longer videos, Molmo 2 reportedly holds its own, though larger open-weight or proprietary models remain the top performers. “Long videos are really hard to support, and training models for long video understanding requires a lot of compute,” Krishna explains. “This is something that we want to improve on, but it does require a lot more compute than we’re comfortable spending at the moment.”
Molmo 2 also performs very well on image and multi-image benchmarks. Ai2 says its 4B and 8B models outperform “previous open models we evaluated” and achieve higher average scores than open-weight baselines like Qwen3-VL-8B and InternVL3.5-8B.
When it comes to visual question answering (VQA), Molmo 2 8B achieves “state-of-the-art results” compared to fully open and open-weight models. Still, larger open-weight systems like GLM-4.1V-9B dominate on some multi-image reasoning benchmarks.
Both Molmo 2 4B and 8B are “strong” on counting-heavy benchmarks, a test of a model’s ability to accurately count objects or events in images or videos.
Molmo 2 “lags slightly” on open-weight reasoning benchmarks, though Ai2 attributes this to its model having less multimodal supervision. That said, it performs competitively with most open-weight models in multi-image tests, but comes up short against similarly sized systems.
Lastly, Molmo 2 is outperformed by GLM-4.1V-9B on the most challenging multimodal tasks.

Credit: Ai2
Ai2 believes Molmo 2 embodies its mission of delivering state-of-the-art AI that’s fully open, transparent, and accessible for research and real-world innovation. The nonprofit AI lab had set “aggressive” goals for itself: to move the industry towards open science and build with transparent, open tools. Now that Ai2 has built models for text, images, video, and audio, it’s setting its sights on 2026. Chief Executive Ali Farhadi hopes innovations in robotics, streaming, and other sectors will be introduced. Moreover, he expects a new unified model capable of reasoning across all modalities to be developed.
“We’ll figure out what to call it, but it’s going to be the same style of fully transparent, fully open, key enabler in the market,” Farhadi states. “The reaction of the market to our models has been phenomenal. We surpassed 21 million downloads…we’re getting very, very close to three billion live queries across all the live systems that we have at Ai2 across [text, images, and video].”
He believes the case for open-source AI has been made, but he hopes for greater adoption next year. “There’s a lot of demand for it. There’s a lot of momentum,” Farhadi says. “We hope to see more of a true open source approach, the same way open source software development revolutionized software development. We do believe that open-source—truly open-source—AI will win this race and will become the standard for all practices. That’s the best thing that could happen to AI because it gives us more transparency, certifiability, understanding…and also the fastest, more advanced innovation rate beyond what small groups could do independently in isolation.”
Ai2’s latest model is now openly accessible. As mentioned earlier, Molmo 2, along with its weights and training code, is available on both GitHub and Hugging Face. You can also try it out on Ai2’s developer playground, where you can upload video clips up to a few seconds long and 50 MB in size to experiment with pointing and tracking, as well as see exactly where the model is looking at as it answers.
The time and size constraints aren’t technical in nature—they’re practical decisions to avoid high compute costs from processing large video files.
But the playground also supports multi-image workflows, so you can upload a series of images to test the model’s capabilities.
Molmo 2 will soon be available through an API, though when exactly remains unclear.
LATEST NEWS
Gene Editing
China's 'Frankenstein' now wants to prevent Alzheimer's after being released from prison
JAN 22, 2026
WHAT'S TRENDING
Data Science
5 Imaginative Data Science Projects That Can Make Your Portfolio Stand Out
OCT 05, 2022
SOURCE: ELECTRONICS360.GLOBALSPEC.COM
JAN 22, 2026
SOURCE: ROLLINGSTOCKWORLD.COM
JAN 22, 2026
SOURCE: AIJOURN.COM
JAN 09, 2026
SOURCE: EUREKALERT.ORG
DEC 30, 2025