Close Menu
The LinkxThe Linkx
  • Home
  • Technology
    • Gadgets
    • IoT
    • Mobile
    • Nanotechnology
    • Green Technology
  • Trending
  • Advertising
  • Social Media
    • Branding
    • Email Marketing
    • Video Marketing
  • Shop

Subscribe to Updates

Get the latest tech news from thelinkx.com about tech, gadgets and trendings.

Please enable JavaScript in your browser to complete this form.
Loading
What's Hot

This Thiel-backed venture allows doping in its own sports

October 15, 2025

Machine learning helps identify ‘thermal switch’ for next-generation n…

October 15, 2025

Apple doesn't include a charger in the box with the new M5 MacBoo…

October 15, 2025
Facebook X (Twitter) Instagram
Facebook X (Twitter) Instagram Pinterest Vimeo
The LinkxThe Linkx
  • Home
  • Technology
    • Gadgets
    • IoT
    • Mobile
    • Nanotechnology
    • Green Technology
  • Trending
  • Advertising
  • Social Media
    • Branding
    • Email Marketing
    • Video Marketing
  • Shop
The LinkxThe Linkx
Home»IoT»A glimpse at how multimodal AI will transform robotics
IoT

A glimpse at how multimodal AI will transform robotics

Editor-In-ChiefBy Editor-In-ChiefFebruary 21, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
A glimpse at how multimodal AI will transform robotics
Share
Facebook Twitter LinkedIn Pinterest Email


The newly-announced Magma is a multimodal AI enabling agentic tasks ranging from UI navigation to robotics manipulation.

Magma – the work of researchers from Microsoft, the University of Maryland, the University of Wisconsin-Madison, KAIST, and the University of Washington – expands the capabilities of traditional Vision-Language (VL) models by introducing groundbreaking features for action planning, spatial reasoning, and multimodal understanding.

The new-generation multimodal foundation model not only retains the verbal intelligence of its VL predecessors but introduces advanced spatial intelligence. It’s capable of understanding visual-spatial relationships, planning actions, and executing them with precision.

Whether navigating digital interfaces or commanding robotic arms, Magma can accomplish tasks that were previously only achievable through specialised, domain-specific AI models.

According to the research team, Magma’s development was guided by two principal goals:

  • Unified abilities across the digital and physical worlds: Magma integrates capabilities for digital environments like web and mobile navigation with robotics tasks, which fall squarely in the physical domain.
  • Combined verbal, spatial, and temporal intelligence: The model is designed to analyse images, videos, and text inputs while converting higher-level goals into concrete action plans.

Innovative training techniques  

Magma achieves its advanced capabilities through a novel pretraining framework underpinned by two core paradigms: Set-of-Mark (SoM) and Trace-of-Mark (ToM). These methods focus on grounding actions effectively and planning future movements based on visual and temporal cues.

Set-of-Mark (SoM): Action grounding

SoM is pivotal for action grounding in static images. It involves labelling actionable visual objects, such as clickable buttons in UI screenshots or robotic arms in manipulation tasks, with numeric markers. This enables Magma to precisely identify and target visual elements for action, whether in user interfaces or physical manipulation settings.  

Trace-of-Mark (ToM): Action planning

For dynamic environments, ToM trains the model to recognise temporal video dynamics, anticipate future states, and create action plans. By tracking object movements, such as the trajectory of a robotic arm, ToM captures long-term dependencies in video data without being distracted by extraneous ambient changes.  

The researchers note that this method is far more efficient than traditional next-frame prediction approaches, as it uses fewer tokens while retaining the ability to foresee extended temporal horizons.

Pretraining data and methodology  

To equip Magma with its multimodal prowess, the researchers curated a vast, heterogeneous training dataset combining various modalities:  

  • Instructional videos
  • Robotics manipulation datasets
  • UI navigation data
  • Existing multimodal understanding datasets

Pretraining involved both annotated agentic data and unlabeled data “in the wild,” including unstructured video content. To ensure action-specific supervision, camera motion was meticulously removed from the videos, and model training focused on meaningful interactions, such as object manipulation and button clicking.  

The pretraining pipeline unifies text, image, and action modalities into a cohesive framework, laying the foundation for diverse downstream applications.

State-of-the-art multimodal AI for robotics and beyond

Magma’s versatility and performance were validated through extensive zero-shot and fine-tuning evaluations across multiple categories:

Robotics manipulation

In robotic pick-and-place operations and soft object manipulation tasks, evaluated on platforms such as the WidowX series and LIBERO, Magma established itself as the state-of-the-art model.

Even in out-of-distribution tasks (scenarios not covered during training), Magma demonstrated robust generalisation capabilities, surpassing OpenVLA and other robotics-specific AI models.

Videos released by the team showcase Magma in action on real-world tasks, such as placing objects like mushrooms into a pot or smoothly pushing fabric across a surface.

UI navigation

In tasks such as web and mobile UI interaction, Magma demonstrated exceptional precision, even without domain-specific fine-tuning. For example, the model could autonomously execute a sequence of UI actions like searching for weather information and enabling flight mode—the kind of tasks humans perform daily.

When finely tuned on datasets like Mind2Web and AITW, Magma achieved leading results on digital navigation benchmarks, outperforming earlier domain-specific models.

Spatial reasoning 

Magma exhibited strong spatial reasoning, outperforming other models on complex evaluations, including GPT-4. Its ability to understand verbal, spatial, and temporal relationships across multimodal inputs demonstrates profound strides in general intelligence capabilities.

Video Question Answering (Video QA)

Even with access to a smaller volume of video instruction tuning data, Magma excelled at video-related tasks, such as question-answering and temporal interpretation. It surpassed state-of-the-art approaches like Video-Llama2 on most benchmarks, proving its generalisation power.

Implications for multimodal AI 

Magma represents a fundamental leap in developing foundation models for multimodal AI agents. Its ability to perceive, plan, and act marks a shift in AI usability—from being reactive and single-functional to proactive and versatile across domains.  

By integrating verbal and spatial-temporal reasoning, Magma bridges the gap between understanding and executing actions—bringing it one step closer to human-like capabilities.  

While Magma is an impressive leap forward, the researchers acknowledge several limitations. Being primarily designed for research, the model is not optimised for every downstream application and may exhibit biases or inaccuracies in high-risk scenarios. 

Developers working with finely-tuned versions of Magma are advised to evaluate it for safety, fairness, and adherence to regulatory compliance.  

Looking forward, the team envisions leveraging the Magma framework for applications like:

  • Image/video captioning
  • Advanced question answering
  • Complex navigation systems
  • Robotics task automation

By refining and expanding its dataset and pretraining objectives, they aim to continue enhancing Magma’s multimodal and agentic intelligence.  

Magma is undoubtedly a milestone, demonstrating what’s possible when foundational models are extended to unite digital and physical domains.

From controlling robots in factories to automating digital workflows, Magma is a promising blueprint for a future where AI can seamlessly toggle between screens, cameras, and robotics to solve real-world challenges.

(Photo by Marc Szeglat)

See also: Smart Machines 2035: Addressing challenges and driving growth

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including IoT Tech Expo, Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

Tags: ai, artificial intelligence, magma, multimodal ai, robotics, robots



Source link

AI Artificial Intelligence Glimpse magma multimodal multimodal ai Robotics robots Transform
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleHonor brings old Manchester United photos back to life with AI upscali…
Next Article How the Insurance Sector Can Further Enable Cleantech Scaling
Editor-In-Chief
  • Website

Related Posts

IoT

Views from an Insider on the CCNP Automation Track: DCNAUTO 2.0 Editio…

October 14, 2025
IoT

Nordic Semiconductor, Sateliot and Gatehouse Satcom achieve breakthrou…

October 12, 2025
IoT

Sateliot achieves world-first 5G satellite IoT connection

October 11, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

100+ TikTok Statistics Updated for December 2024

December 4, 202487 Views

How to Fix Cant Sign in Apple Account, Verification Code Not Received …

February 11, 202566 Views

Cisco Automation Developer Days 2025

February 10, 202522 Views
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Latest Reviews

Subscribe to Updates

Get the latest tech news from thelinkx.com about tech, gadgets and trendings.

Please enable JavaScript in your browser to complete this form.
Loading
About Us

Welcome to TheLinkX – your trusted source for everything tech and gadgets! We’re passionate about exploring the latest innovations, diving deep into emerging trends, and helping you find the best tech products to suit your needs. Our mission is simple: to make technology accessible, engaging, and inspiring for everyone, from tech enthusiasts to casual users.

Our Picks

This Thiel-backed venture allows doping in its own sports

October 15, 2025

Machine learning helps identify ‘thermal switch’ for next-generation n…

October 15, 2025

Apple doesn't include a charger in the box with the new M5 MacBoo…

October 15, 2025

Subscribe to Updates

Get the latest tech news from thelinkx.com about tech, gadgets and trendings.

Please enable JavaScript in your browser to complete this form.
Loading
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms and Conditions
© 2025 Thelinkx.All Rights Reserved Designed by Prince Ayaan

Type above and press Enter to search. Press Esc to cancel.