OctoSense

Self-Supervised Learning for Multimodal Robot Perception

1GRASP Laboratory, University of Pennsylvania, 2Brown University

One platform, eight sensors, one clock, synchronized multimodal driving across day, night, and degraded conditions.

Why multimodal self-supervised learning?

Self-supervised foundation models, DINO, SigLIP, V-JEPA, transformed robot perception, but they are vision-only, and in the real world no single sensor suffices. Cameras degrade under low light, high dynamic range, and rapid motion; LiDAR is accurate but sparse with poor semantics. Every sensor has different rates, resolutions, noise, and failure modes, and they fail in different ways. Robust robot perception needs representations that survive these failures, yet self-supervised learning has stayed almost entirely on vision and text.

“The defining enabling technology across all field applications is multi-sensor fusion robust to environmental degradation, the problem that keeps the most capable field robots indoors.”
Global Robotics Technology Roadmap 2025–2035

OctoSense takes this on directly. We release an open sensor platform, a 59-hour time-synchronized dataset, and a method, a late-fusion masked autoencoder that fuses every sensor into one representation and stays robust.

Platform & dataset

An open-source sensor platform with eight time-synchronized sensors, and 59 hours / 2,474 km of driving, one of the largest event-inclusive robotics datasets, with day, night, and degraded-sensor conditions.

Multimodal MAE

Modality-specific tokenizers feed a shared late-fusion masked autoencoder. Token caching at inference makes it real-time, 6.68 ms on an RTX 5090, 112 ms on an embedded Orin NX.

Robust perception

Beats image-only foundation models on depth, flow, segmentation, and ego-motion, and the advantage grows at night and under sensor degradation.

The platform & dataset

OctoSense aligns all the sensors to a single timeline using our PPS time-sync hardware, a unique six-pulse identifier every four minutes and fifteen seconds lets every stream realign even after a dropped trigger. At native rates the platform produces ~1.7 GB/s; on-board compression (LiDAR/event packets, H.265 video) cuts that 21× to 78.7 MB/s with no dropped data. Calibration uses a retro-reflective circle on an AprilGrid jointly visible to the cameras and LiDAR.

The first release spans urban, suburban, and rural driving on Long Island and in Philadelphia across sunrise, daytime, sunset, and night, including sun-flare and packet-loss degradation. Every 5-second window is captioned (Gemma 4) and embedded (Qwen3) into a FAISS index, so the data is searchable in natural language.

ModalitySensorInfoRate
RGB (stereo)2× FLIR Blackfly S1920×1456100 Hz
Event (stereo)2× SilkyEV VGA (Prophesee)640×480≈7 MEv/s
ThermalFLIR A35320×25650 Hz
LiDAROuster OS1-6464 × 204810 Hz
IMUVectorNav VN-100TAcc/Gyro/Mag/Baro400 Hz
GNSSu-blox ZED-F9PRTK (NTRIP)5 Hz
ProprioceptionVehicle CAN / quadruped jointssteering, throttle, brake / joint angles50–100 Hz

Custom hardware

CAD rendering of the OctoSense sensor platform

CAD of the sensor platform: stereo RGB + event cameras, LiDAR, thermal, IMU, and GPS on one mount.

Custom time-synchronization circuit board

The custom SyncBoard that hardware-triggers every sensor off one clock.

Everything here is open source, the mechanical CAD, the sensor mounts, and the custom electronics. The platform carries all eight sensors on one adjustable bar above a desktop-class CPU, powered by a 24 V battery for about an hour of mobile operation.

The board on the right is our custom SyncBoard, the heart of the platform's hardware time synchronization. A temperature-compensated oscillator and a microcontroller generate a pulse-per-second trigger and fan it out across the PCB to every sensor, so each stream can be aligned to a common timeline in post-processing.

Diverse conditions

OctoSense spans a wide range of environments and lighting across Long Island and Philadelphia, from night-time glare and low-sun lens flare to roadside landmarks, water crossings, snow-covered descents, and unpaved forest roads. This breadth is exactly where single-camera models struggle and multi-sensor fusion pays off.

Play with the time-synchronized data

A live Rerun viewer with a short clip (desktop-only) from a city drive, every sensor on one shared timeline: stereo RGB + event cameras, infrared, the LiDAR point cloud, IMU, GPS, and CAN signals, plus scene captions. Scrub the timeline, rotate the 3D view, and toggle streams, right here in the browser.

A 15 s clip streamed into the hosted Rerun viewer; first load takes a few seconds, sensors load progressively. Open in a full-page viewer ↗. Generate full recordings from any sequence with the open-source viz tooling in the repository.

Where we drove

GPS routes covering Philadelphia, PA and Long Island, NY

2,474 km of routes across Philadelphia, PA and Long Island, NY, spanning highway, residential, urban, and rural driving across many sessions, days, and times of day.

How OctoSense compares

Among autonomous-driving and event-inclusive multimodal datasets, OctoSense offers the largest amount of event-inclusive data (59 hrs / 2,474 km), and includes other sensors such as RGB, thermal, LiDAR, IMU, GPS, and CAN.

Comparison of OctoSense against autonomous-driving and event-inclusive multimodal datasets by duration, sensors, and tasks

Search the data in natural language

Every 5-second window is captioned with Gemma 4 and embedded with Qwen3-Embedding-8B into a hybrid FAISS + BM25 index, so all 2,474 km are searchable by description. Type a phrase, “police vehicle”, “wet road at night”, “pedestrian crossing”, and get back clips with their captions, source sequence, and timestamps. We release the prebuilt index alongside the data.

Semantic search web UI: the query 'Police Vehicle' returns ranked video clips with captions, source sequences, and timestamps

The search UI: a natural-language query returns the most relevant moments across the dataset, each with its caption, source sequence, and time range.

Ground truth

From LiDAR-inertial odometry (RKO-LIO) we derive metric depth by accumulating 61 deskewed scans and projecting into the rectified RGB image, moving objects removed with YOLO26-medium masks, minimum depth kept per pixel. Ego-motion optical flow follows by reprojecting that depth into future frames, and semantic segmentation pseudo-labels come from EoMT trained on Cityscapes. Together with the fused odometry trajectory, this gives per-sequence supervision for every downstream task.

Method: a late-fusion multimodal MAE

The sensors look nothing alike, a dense 2D array for RGB, a high-frequency point process for the event camera, a sparse pointcloud for LiDAR, a fast multi-channel time series for the IMU. We learn one fused representation by masking part of the input and reconstructing it, self-supervised. The pipeline has four stages, walked through below: per-modality input representations, frozen tokenized targets, a late-fusion masked autoencoder, and lightweight task probes.

1 · Sensor representations

Every modality has to become a sequence of tokens, but each arrives in a different form. The event stream is high-frequency and noisy: we drop isolated events with a spatio-temporal filter, then run a bank of leaky integrators at several bandwidths to turn the asynchronous stream into a multi-channel image that captures motion across timescales. LiDAR is deskewed and projected into a forward 64×512 range image; RGB is undistorted and rectified into ViT patches; and the IMU accelerometer/gyroscope are convolved over a 1.6 s window and pooled with cross-attention. Each image-like modality is then split into patches.

Sensor representations
Each modality is converted to a tokenizable form before patch tokenization.

2 · Tokenized targets

Reconstructing raw pixels breaks down across modalities, their loss scales and token counts differ wildly, so training collapses onto the easy ones. Instead, each modality gets its own autoencoder trained ahead of time with finite scalar quantization (FSQ), mapping every patch to a bounded discrete code. The MAE then predicts these frozen codes rather than pixels, putting all modalities on comparable footing. Per-modality tweaks we make are: LiDAR adds a validity mask for ray-drop holes, events use a weighted active/inactive loss, and the IMU is mean-removed and normalized.

Tokenized targets
A frozen per-modality FSQ encoder/decoder supplies bounded discrete targets for the MAE.

3 · Late-fusion MAE

The fusion model masks the tokens in spatio-temporal tubes, each modality's keep-ratio drawn from a Dirichlet distribution so that, across training, the model sees everything from a single dominant sensor to a near-uniform mix, and even whole sensors dropped. The encoder is factorized into three attention stages: spatial within each frame, temporal along each patch's tube across the 8 timesteps, and finally multimodal across all surviving tokens, with a 4D rotary embedding over (time, u, v, sensor). A shared decoder with per-modality heads reconstructs the FSQ codes under an ℓ1 loss. At inference, per-modality tokens are cached within the window so only new measurements are re-encoded, keeping the late-fusion encoder real-time, about 60% faster than early fusion.

Late-fusion MAE architecture
Tube masking (top left), Dirichlet allocation across modalities (top right), and the three-stage encoder/decoder (bottom).

4 · Downstream probes

The pretrained encoder is then frozen, and lightweight probes read off task predictions. A Dense Prediction Transformer fuses features from four encoder layers to predict optical flow, depth, and segmentation; a separate attentive probe cross-attends to the final encoder layer to regress ego-motion, relative pose, steering, speed, and angular/linear velocity. Only these heads are trained per task; the representation itself is never fine-tuned, so the numbers reflect the quality of the learned features directly.

Downstream probes
Frozen-feature probes: a DPT for dense tasks and an attentive probe for ego-motion.

Reconstruction under many masks

A single driving sequence, tiled into non-overlapping ~1.4 s windows. Each window draws a fresh Dirichlet mask, so the per-sensor masking ratio varies widely as the clip plays , sometimes hiding most of the camera, sometimes most of the LiDAR. For every window we show Ground Truth, Masked Input, and Reconstruction across RGB, event, LiDAR, and IMU. The model rebuilds each modality from whatever sparse, cross-modal context survives the mask, the core self-supervised objective behind the representation.

Results: the gap widens when vision fails

Frozen and probed across depth, optical flow, semantic segmentation, and ego-motion, the late-fusion MAE outperforms every image-only foundation model (DINOv2/v3, SigLIP 2, Perception Encoder, V-JEPA 2.1) on all tasks. The story is sharpest under degraded and nighttime conditions.

4.77 m

depth RMSE under
degradation (vs 7.40 m)

2.12 px

flow error under
degradation (vs 10.85 px)

≈2×

larger depth advantage
at night vs. day

112 ms

to encode all sensors
on an embedded Jetson Orin NX

Ego-motion, translation, rotation, and steering, is predicted essentially perfectly (~0.06 m, ~0.23°), because the model reads LiDAR and inertial signal that RGB-only encoders simply lack. Leave-one-out probes show where each sensor matters: LiDAR dominates both ego-motion and dense prediction, dropping it alone sends depth from 4.73 to 6.74 m and flow from 1.97 to 7.19 px, while RGB anchors the tokens for segmentation. The representation also transfers zero-shot to M3ED, a different platform and sensor configuration, staying competitive with foundation models trained on orders of magnitude more data.

Qualitative dense-perception rollout on one test sequence, rows: optical flow, depth, segmentation; columns: input RGB, ground truth, DINO v3, V-JEPA 2.1, and our late-fusion MAE.

Encoder (daytime test) Depth (m) ↓ Flow (px) ↓ Seg (mIoU) ↑ Trans. (m) ↓ Rot. (°) ↓
DINO v3 6.9319.430.3820.930.79
V-JEPA 2.1 6.389.130.4020.770.47
Late-fusion MAE (ours) 4.731.970.4110.060.24

Daytime test split; lower is better except segmentation (mIoU over 19 Cityscapes classes), reported where speed ≥ 5 mph. Full results, nighttime, and degraded splits are in the paper.

Beyond driving

The same sensor platform has been deployed beyond the car. The data release additionally includes boat and Unitree Go2-W quadruped sequences, tagged by a platform column in the metadata, a first step toward one perception model across very different platforms.

Limitations & future work

OctoSense is a research prototype. The dataset is orders of magnitude smaller than those used to train image foundation models like DINO and V-JEPA; scaling it up, ideally through a community-wide effort to pool multi-sensor data, is the path toward a true foundation model for multi-modal robot perception.

A key next step is moving beyond the 1.4 s window to longer temporal context with recurrent architectures, which would unlock tasks like object tracking and 3D instance detection and further cut inference cost. Event cameras and the IMU contribute little at our 5 Hz evaluation rate but should shine on high-speed, low-latency tasks, another direction we leave open, along with studying the unique, synergistic, and redundant information these sensors carry.

Get the data

The dataset (HDF5 schema, per-sequence metadata, downloads) lives on the Hugging Face dataset page. Start with the getting-started Colab, and find the collection / processing tooling on GitHub.

hf download anthonytec2/OctoSense --repo-type dataset --local-dir ./octosense

BibTeX

@misc{bisulco2026octosense,
  title        = {{OctoSense}: Self-Supervised Learning for Multimodal Robot Perception},
  author       = {Bisulco, Anthony and Wang, Jeremy and Daniilidis, Kostas and Balestriero, Randall and Chaudhari, Pratik},
  year         = {2026},
  howpublished = {Preprint},
}

Acknowledgments

This work was supported by grants from the National Science Foundation (IIS-2145164, CCF-2212519), the NSF and DoD OUSD (R&E) under Agreement PHY-2229929 (The NSF AI Institute for Artificial and Natural Intelligence), DSO National Laboratories, Singapore, and the Office of Naval Research DURIP.