OctoSense: Self-Supervised Learning for Multimodal Robot Perception

Why multimodal self-supervised learning?

Self-supervised foundation models, DINO, SigLIP, V-JEPA, transformed robot perception, but they are vision-only, and in the real world no single sensor suffices. Cameras degrade under low light, high dynamic range, and rapid motion; LiDAR is accurate but sparse with poor semantics. Every sensor has different rates, resolutions, noise, and failure modes, and they fail in different ways. Robust robot perception needs representations that survive these failures, yet self-supervised learning has stayed almost entirely on vision and text.

“The defining enabling technology across all field applications is multi-sensor fusion robust to environmental degradation, the problem that keeps the most capable field robots indoors.”
Global Robotics Technology Roadmap 2025–2035

OctoSense takes this on directly. We release an open sensor platform, a 59-hour time-synchronized dataset, and a method, a late-fusion masked autoencoder that fuses every sensor into one representation and stays robust.

Platform & dataset

An open-source sensor platform with eight time-synchronized sensors, and 59 hours / 2,474 km of driving, one of the largest event-inclusive robotics datasets, with day, night, and degraded-sensor conditions.

Multimodal MAE

Modality-specific tokenizers feed a shared late-fusion masked autoencoder. Token caching at inference makes it real-time, 6.68 ms on an RTX 5090, 112 ms on an embedded Orin NX.

Robust perception

Beats image-only foundation models on depth, flow, segmentation, and ego-motion, and the advantage grows at night and under sensor degradation.

Modality	Sensor	Info	Rate
RGB (stereo)	2× FLIR Blackfly S	1920×1456	100 Hz
Event (stereo)	2× SilkyEV VGA (Prophesee)	640×480	≈7 MEv/s
Thermal	FLIR A35	320×256	50 Hz
LiDAR	Ouster OS1-64	64 × 2048	10 Hz
IMU	VectorNav VN-100T	Acc/Gyro/Mag/Baro	400 Hz
GNSS	u-blox ZED-F9P	RTK (NTRIP)	5 Hz
Proprioception	Vehicle CAN / quadruped joints	steering, throttle, brake / joint angles	50–100 Hz

Diverse conditions

Representative OctoSense scenes: night glare, the Big Duck, a bridge over water, a snowy forest descent, sun glare toward Philadelphia City Hall, and a sandy pine-forest road

Degraded perception: tunnels, blinding sun, lens flare, fog, and near-darkness washing out the camera

OctoSense spans a wide range of environments and lighting across Long Island and Philadelphia, from night-time glare and low-sun lens flare to roadside landmarks, water crossings, snow-covered descents, and unpaved forest roads. This breadth is exactly where single-camera models struggle and multi-sensor fusion pays off.

Play with the time-synchronized data

A live Rerun viewer with a short clip (desktop-only) from a city drive, every sensor on one shared timeline: stereo RGB + event cameras, infrared, the LiDAR point cloud, IMU, GPS, and CAN signals, plus scene captions. Scrub the timeline, rotate the 3D view, and toggle streams, right here in the browser.

A 15 s clip streamed into the hosted Rerun viewer; first load takes a few seconds, sensors load progressively. Open in a full-page viewer ↗. Generate full recordings from any sequence with the open-source viz tooling in the repository.

Where we drove

GPS routes covering Philadelphia, PA and Long Island, NY

2,474 km of routes across Philadelphia, PA and Long Island, NY, spanning highway, residential, urban, and rural driving across many sessions, days, and times of day.

Search the data in natural language

Every 5-second window is captioned with Gemma 4 and embedded with Qwen3-Embedding-8B into a hybrid FAISS + BM25 index, so all 2,474 km are searchable by description. Type a phrase, “police vehicle”, “wet road at night”, “pedestrian crossing”, and get back clips with their captions, source sequence, and timestamps. We release the prebuilt index alongside the data.

Semantic search web UI: the query 'Police Vehicle' returns ranked video clips with captions, source sequences, and timestamps

The search UI: a natural-language query returns the most relevant moments across the dataset, each with its caption, source sequence, and time range.

Ground truth

From LiDAR-inertial odometry (RKO-LIO) we derive metric depth by accumulating 61 deskewed scans and projecting into the rectified RGB image, moving objects removed with YOLO26-medium masks, minimum depth kept per pixel. Ego-motion optical flow follows by reprojecting that depth into future frames, and semantic segmentation pseudo-labels come from EoMT trained on Cityscapes. Together with the fused odometry trajectory, this gives per-sequence supervision for every downstream task.

Method: a late-fusion multimodal MAE

The sensors look nothing alike, a dense 2D array for RGB, a high-frequency point process for the event camera, a sparse pointcloud for LiDAR, a fast multi-channel time series for the IMU. We learn one fused representation by masking part of the input and reconstructing it, self-supervised. The pipeline has four stages, walked through below: per-modality input representations, frozen tokenized targets, a late-fusion masked autoencoder, and lightweight task probes.

1 · Sensor representations

Every modality has to become a sequence of tokens, but each arrives in a different form. The event stream is high-frequency and noisy: we drop isolated events with a spatio-temporal filter, then run a bank of leaky integrators at several bandwidths to turn the asynchronous stream into a multi-channel image that captures motion across timescales. LiDAR is deskewed and projected into a forward 64×512 range image; RGB is undistorted and rectified into ViT patches; and the IMU accelerometer/gyroscope are convolved over a 1.6 s window and pooled with cross-attention. Each image-like modality is then split into patches.

2 · Tokenized targets

Reconstructing raw pixels breaks down across modalities, their loss scales and token counts differ wildly, so training collapses onto the easy ones. Instead, each modality gets its own autoencoder trained ahead of time with finite scalar quantization (FSQ), mapping every patch to a bounded discrete code. The MAE then predicts these frozen codes rather than pixels, putting all modalities on comparable footing. Per-modality tweaks we make are: LiDAR adds a validity mask for ray-drop holes, events use a weighted active/inactive loss, and the IMU is mean-removed and normalized.

3 · Late-fusion MAE

The fusion model masks the tokens in spatio-temporal tubes, each modality's keep-ratio drawn from a Dirichlet distribution so that, across training, the model sees everything from a single dominant sensor to a near-uniform mix, and even whole sensors dropped. The encoder is factorized into three attention stages: spatial within each frame, temporal along each patch's tube across the 8 timesteps, and finally multimodal across all surviving tokens, with a 4D rotary embedding over (time, u, v, sensor). A shared decoder with per-modality heads reconstructs the FSQ codes under an ℓ₁ loss. At inference, per-modality tokens are cached within the window so only new measurements are re-encoded, keeping the late-fusion encoder real-time, about 60% faster than early fusion.

Late-fusion MAE architecture — Tube masking (top left), Dirichlet allocation across modalities (top right), and the three-stage encoder/decoder (bottom).

4 · Downstream probes

The pretrained encoder is then frozen, and lightweight probes read off task predictions. A Dense Prediction Transformer fuses features from four encoder layers to predict optical flow, depth, and segmentation; a separate attentive probe cross-attends to the final encoder layer to regress ego-motion, relative pose, steering, speed, and angular/linear velocity. Only these heads are trained per task; the representation itself is never fine-tuned, so the numbers reflect the quality of the learned features directly.

Reconstruction under many masks

A single driving sequence, tiled into non-overlapping ~1.4 s windows. Each window draws a fresh Dirichlet mask, so the per-sensor masking ratio varies widely as the clip plays , sometimes hiding most of the camera, sometimes most of the LiDAR. For every window we show Ground Truth, Masked Input, and Reconstruction across RGB, event, LiDAR, and IMU. The model rebuilds each modality from whatever sparse, cross-modal context survives the mask, the core self-supervised objective behind the representation.

Encoder (daytime test)	Depth (m) ↓	Flow (px) ↓	Seg (mIoU) ↑	Trans. (m) ↓	Rot. (°) ↓
DINO v3	6.93	19.43	0.382	0.93	0.79
V-JEPA 2.1	6.38	9.13	0.402	0.77	0.47
Late-fusion MAE (ours)	4.73	1.97	0.411	0.06	0.24

Beyond driving

The same sensor platform has been deployed beyond the car. The data release additionally includes boat and Unitree Go2-W quadruped sequences, tagged by a platform column in the metadata, a first step toward one perception model across very different platforms.

Limitations & future work

OctoSense is a research prototype. The dataset is orders of magnitude smaller than those used to train image foundation models like DINO and V-JEPA; scaling it up, ideally through a community-wide effort to pool multi-sensor data, is the path toward a true foundation model for multi-modal robot perception.

A key next step is moving beyond the 1.4 s window to longer temporal context with recurrent architectures, which would unlock tasks like object tracking and 3D instance detection and further cut inference cost. Event cameras and the IMU contribute little at our 5 Hz evaluation rate but should shine on high-speed, low-latency tasks, another direction we leave open, along with studying the unique, synergistic, and redundant information these sensors carry.

BibTeX

@misc{bisulco2026octosense,
  title        = {{OctoSense}: Self-Supervised Learning for Multimodal Robot Perception},
  author       = {Bisulco, Anthony and Wang, Jeremy and Daniilidis, Kostas and Balestriero, Randall and Chaudhari, Pratik},
  year         = {2026},
  howpublished = {Preprint},
}

Acknowledgments

This work was supported by grants from the National Science Foundation (IIS-2145164, CCF-2212519), the NSF and DoD OUSD (R&E) under Agreement PHY-2229929 (The NSF AI Institute for Artificial and Natural Intelligence), DSO National Laboratories, Singapore, and the Office of Naval Research DURIP.

Self-Supervised Learning for Multimodal Robot Perception

One platform, eight sensors, one clock, synchronized multimodal driving across day, night, and degraded conditions.