Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
A team on an AI development platform Hugging Face there is was released What they claim are the smallest AI models that can analyze images, short videos and text.
The models, SmolVLM-256M and SmolVLM-500M, are designed to work well on “constrained devices” such as laptops with less than about 1 GB of RAM. The team says they’re also ideal for developers trying to process large amounts of data very cheaply.
SmolVLM-256M and SmolVLM-500M are only 256 million parameters and 500 million parameters respectively. (The settings roughly correspond to the model’s problem-solving abilities, such as its performance on math tests.) Both models can perform tasks such as describing images or video clips and answering questions about PDF documents and their contents, including scanned text and diagrams. .
To train the SmolVLM-256M and SmolVLM-500M, the Hugging Face team used Docmatix, a set of file scans combined with “The Cauldron” of 50 “high-quality” image and text datasets and detailed captions. Both created by Hugging Face’s M4 teamdeveloping multimodal AI technologies.
The team claims that the SmolVLM-256M and SmolVLM-500M outperformed the larger Idefics 80B model in benchmarks including AI2D, which tests both the SmolVLM-256M and SmolVLM-500M models’ ability to analyze school-level science diagrams. The SmolVLM-256M and SmolVLM-500M are available on the web and for download from Hugging Face under the Apache 2.0 license, meaning they can be used without restrictions.
Smaller models such as the SmolVLM-256M and SmolVLM-500M may be inexpensive and versatile, but they may also contain flaws not seen in larger models. A recent study by Google DeepMind, Microsoft Research and the Mila research institute in Quebec, many small models performs worse than expected on complex reasoning tasks. The researchers speculated that this may be because smaller models recognize surface-level patterns in the data, but have difficulty applying that knowledge to new contexts.