ICRA 2026 just showed us the future of manipulation. But one piece is missing.

2026-06-04

I’m in Vienna this week for ICRA 2026. And the exhibition floor is absolutely on fire.


Dexterous hands are everywhere—single booths launching multiple systems, hybrid drives, high grip forces. Walking the floor, the industry’s convergence is clear: dexterous manipulation is no longer a future concept. It is the defining trend of general-purpose embodied AI.


Photos taken at ICRA 2026 byFMC³ Robotics


As foundation models and simulation mature, the end-effector has become the ultimate differentiator for handling complex contacts and diverse real-world tasks.


But hardware is only half the equation.


How to acquire high-quality datasets for dexterous hands remains an underexplored topic.


Large language models ate the internet first. Billions of pages of text. Images. Video. All scraped, all ingested, all used to build foundations that keep getting stronger.


Embodied intelligence doesn't have that luxury.


Your robot can't scrape the internet for how a connector feels when it seats. Can't download how much force it takes to grip a circuit board without cracking it. Can't find pre-trained touch data on any server anywhere.


The data gap isn't small. It's 99%.


Industry consensus: we need millions of hours of real physical interaction data for general-purpose embodied AI. Current global supply? Around 500,000 hours. Almost all of it vision-only.


Force and tactile? The scarcest data type in robotics. Seventy-two percent of robotics teams cite "incomplete multi modal data" as their number one barrier to deployment. Not compute. Not algorithms. Data.


NVIDIA already proved the paradigm.


EgoScale — NVIDIA's three-stage training pipeline, released February 2026 by GEAR Lab and UC Berkeley.


·Stage one: 20,000 hours of egocentric video for visual pretraining.


·Stage two: human-robot alignment using Manus data gloves plus wearable sensors.


·Stage three: task-specific finetuning.


Here's what matters: Stage one runs on internet-scale video. But Stage two? That requires someone to actually wear sensors and perform real manipulation tasks. No shortcut. No scraping. The internet does not have mid-training force data.


NVIDIA's answer was force-sensing gloves plus head-mounted vision. The industry's convergence is clear: ego-centric, multimodal, with force.


Why ego-centric changes the equation.


Most robot datasets are filmed from third-person — cameras mounted above, to the side, looking at the robot working. Clean images. Clear views of the workspace.


But your robot doesn't see the workspace from above. It sees through its own eyes. From its own position. With its own occlusions.


Third-person training data creates a perception-action mismatch. Contact points get lost. Hand trajectories distort. Depth cues break. The model learns the scene but fails at the task.


Ego-centric data closes that gap. What the robot sees during training matches what it sees during deployment. Same perspective. Same occlusions. Same physical relationship between hand and object.


Consistency produces reliability. Mismatch produces failure.


What FMC³ Omni-Grab brings to this pipeline.


We built Omni-Grab because we saw the same bottleneck everyone else did — and we needed it ourselves.


FMC³ runs our own VTLA models across multiple robot platforms. One brain, many bodies. Our own research needs force-rich ego-centric data. So we built the tool we couldn't find.


Head-mounted RGB-D camera. 400-point tactile gloves at 200Hz. Portable sync box. Microsecond-level time alignment across sight, motion, and force streams.


Not vision-only. Not vision-plus-IMU. Sight + motion + force. Three modalities. Zero robot body in the frame.


Photos taken at ICRA 2026 byFMC³ Robotics


The economics matter as much as the specs.


98% lower hardware cost than traditional multi-camera mocap or force plate setups. Industrial-grade wearable modules replace lab-grade fixed infrastructure. Research teams that couldn't afford data capture before can now run their own pipelines.


Datahub cuts dataset preparation labor by 80%. Automated five-step pipeline: capture → anonymize → align → annotate → validate. Raw sensor output becomes training-ready data with minimal human intervention.


One dataset maps to multiple robot form factors. A dual-arm system. A dexterous hand. A humanoid platform. Same captured behavior, different hardware target — because the data is bound to the task, not the robot.


And for teams operating in Europe: EU-compliant by design. Cut-not-copy anonymization. Air-gapped processing stations. Data never leaves your facility unless you decide it should.


The core conviction.


Algorithm generalization will eventually converge. What won't converge is access to high-fidelity physical interaction data at scale. The teams that solve data acquisition win the next phase of embodied AI — not the teams with the cleverest architecture.


Force data is the missing modality. Ego-centric capture is the proven path. And right now, there are only a handful of systems that deliver both in one integrated package.


The question isn't whether this paradigm will dominate. The question is who gets there first with enough scale to matter.


What manipulation task would you crack if you had 1,000 hours of force-rich ego-centric data tomorrow?


One Brain. Any Robot. Infinite Possibilities.

© 2026 FMC3 Robotics GmbH. Imprint

All rights reserved. Last Updated: 18.03.2026