Data Services

Robot Data Annotation: How to Label Robot Demonstrations for Training

Annotation is the least glamorous part of robot learning and the most consequential. A dataset of 500 well-annotated demonstrations will train a better policy than 2,000 poorly labeled ones. Here is what annotation means for robot data and how to do it right.

What Annotation Means for Robot Data

Unlike image classification, where annotation means drawing boxes or clicking labels, robot demonstration annotation is richer and more structured. A single robot episode — typically 20-200 seconds of manipulation — needs to be labeled at multiple levels: was the episode a success or failure, what language describes the task, where do semantically distinct phases begin and end, and are there any frames that should be excluded from training due to hardware errors or operator mistakes.

Annotation is typically done by human reviewers watching video replays of recorded episodes alongside plots of joint states and gripper aperture. Good annotation tools display synchronized video from multiple cameras simultaneously, making it easy to judge success from perspectives the robot's own cameras might not capture clearly.

Annotation Types: A Complete Taxonomy

Robot demonstration data supports a range of annotation types, each serving different downstream purposes. Understanding which annotations you need — and which you do not — is critical for managing annotation budget and timeline.

Episode-Level Annotations

Annotation Type	Format	Required For	Cost per Episode
Binary success/failure	Boolean flag	All IL training (ACT, Diffusion Policy, VLA)	$0.05-0.10
Language instruction	Free-text string	Language-conditioned policies, VLA fine-tuning	$0.10-0.25
Task category ID	Integer enum	Multi-task policies, dataset organization	$0.02-0.05
Demonstration quality score	1-5 integer scale	Quality-weighted training, dataset curation	$0.10-0.20
Object instance IDs	List of string IDs	Object-centric policies, diversity analysis	$0.05-0.10

Frame-Level Annotations

Annotation Type	Format	Required For	Cost per Frame
Task phase segmentation	Timestamped phase boundaries	Hierarchical policies, sub-task analysis	$0.50-2.00/episode
Contact event labels	Timestamped contact start/end	Contact-aware policies, force control	$0.30-0.80/episode
Object bounding boxes	2D bbox per frame per object	Object detection, visual grounding	$0.05-0.15/frame
Segmentation masks	Per-pixel object mask	Semantic segmentation, sim-to-real	$0.50-2.50/frame
Keypoint annotations	2D pixel coordinates per keypoint	Keypoint-based policies, pose estimation	$0.10-0.30/frame
Frame exclusion flags	Boolean per frame	Removing hardware glitches, operator errors	$0.01-0.03/frame

Most imitation learning projects need only the first two episode-level annotations (success flag + language instruction) and optionally task phase segmentation. Frame-level annotations like segmentation masks and keypoints are needed for specific policy architectures or perception pre-training — they are expensive and should not be applied by default.

Success Flags: The Most Important Annotation

Every episode in a robot training dataset must be labeled with a binary success flag: did the robot complete the task successfully. This sounds simple, but success criteria must be defined precisely before annotation begins. "Place the cup on the plate" requires a specification: does the cup need to be upright, does the handle orientation matter, how much positional error is acceptable? Annotators applying different implicit standards to the same dataset create noisy labels that degrade training performance.

Write a one-page success specification document before annotation begins, with example images of success and failure cases. Use this document to calibrate annotators. Measure inter-annotator agreement on a shared subset of episodes — if agreement is below 90%, your success criteria need clarification. SVRC's annotation pipeline requires explicit success criteria documents and inter-annotator agreement checks before any dataset is marked ready for training.

Automated Success Classification

For datasets of 1,000+ episodes, manual success labeling becomes expensive. Automated classifiers can handle the bulk of labeling with human review on uncertain cases. The standard approach:

Heuristic classifiers: Rule-based checks using final gripper state, end-effector position, and force/torque readings. Example: for a pick-place task, check if the gripper is open (released) AND the end-effector is within 3cm of the target position at the final frame. Covers 60-70% of episodes with >95% accuracy.
Learned classifiers: Train a lightweight CNN (ResNet-18) on the final 10 frames of the wrist camera to predict success/failure. Requires 200+ manually labeled examples per task to train. Handles the remaining ambiguous cases with 85-92% accuracy.
Human review: Reserve for the 10-15% of episodes where automated classifiers are uncertain (confidence between 0.3 and 0.7). This hybrid approach reduces annotation cost by 5-8x compared to full manual labeling.

Language Labels

Language annotations attach natural language descriptions to episodes or episode segments. These are required for training language-conditioned policies — policies that follow instructions like "pick up the red block" rather than having the task hardcoded. Language annotations also enable compatibility with vision-language-action (VLA) models and allow datasets to be searched and filtered by task description.

Write language annotations at two levels of specificity: a short task name ("cup placement") and a natural language instruction ("pick up the white cup and place it on the blue plate"). The instruction should describe what a human observer sees happening, not the robot's internal state. If your task involves task variations — different objects, different target locations — each variation should have a corresponding instruction that distinguishes it from the others.

Language Annotation Best Practices for VLA Training

If you plan to fine-tune VLA models like OpenVLA or RT-2, your language annotations need more care than simple task labels:

Vary phrasing: Do not copy-paste the same instruction across all episodes. "Pick up the red cup" and "Grab the red mug and lift it" teach the model that different phrasings map to similar actions. Aim for 5-10 instruction variants per task.
Include referring expressions: "Pick up the object on the left" rather than only "pick up the cup." This teaches spatial reasoning.
Describe the full action: "Pick up the red cup from the table and place it on the blue plate" is better than "cup to plate." The full instruction maps to the complete demonstration trajectory.
Use consistent object names: Decide whether it is a "cup," "mug," or "glass" and stick with it within a dataset. Inconsistent naming confuses the language encoder.

Contact Labels for Force-Aware Policies

For tasks involving significant contact forces — insertion, assembly, surface following — timestamped contact event labels provide critical information that visual annotations cannot capture. Contact labels mark the frame at which the robot makes or breaks contact with an object, the type of contact (initial touch, sustained grasp, insertion contact, release), and optionally the contact force magnitude from a wrist F/T sensor.

Contact labels can be partially automated using force/torque sensor data: a threshold detector on the F/T signal identifies contact onset (force magnitude exceeds 0.5N from baseline) and release (force drops below threshold). Human review is needed to classify contact type (intentional grasp vs. collision) and to catch cases where the F/T threshold misclassifies incidental contact as task-relevant.

Task Segmentation

For long-horizon tasks involving multiple sequential sub-tasks, segmentation labels mark the boundaries between phases. A table-setting task might be segmented into: reach cup, grasp cup, transport cup, place cup, release cup. Segmentation enables hierarchical policy training, sub-task-level success metrics, and selective data augmentation. It also enables surgical debugging: if a policy fails during transport but succeeds during grasping, segmentation labels let you measure sub-task success rates and target data collection effort where it is needed most.

Segmentation annotation is more expensive than success flagging and not always necessary. Prioritize segmentation for tasks with three or more semantically distinct phases, or when you plan to use a hierarchical policy architecture.

Annotation Tooling Comparison

Tool	License	Video Support	Time-Series	Multi-Camera Sync	Best For
Label Studio	Apache 2.0	Yes	Limited	No (single video)	General annotation, extensible via templates
V7 (Darwin)	Commercial	Yes	No	No	Auto-labeling with SAM, managed workforce
CVAT	MIT	Yes	No	No	Bounding boxes, segmentation masks on video
Custom Gradio/Streamlit	Custom	Full control	Yes	Yes (custom build)	Robot-specific workflows with joint state plots
SVRC Platform	SaaS	Yes	Yes	Yes (native)	Purpose-built for robot episodes with all sensors

The main limitation of general-purpose annotation tools (Label Studio, CVAT, V7) for robot data is that they are designed for image/video annotation, not synchronized multi-modal episode annotation. Reviewing a robot demonstration requires simultaneous display of 2-4 camera streams plus joint state time-series plus force/torque plots, all time-synchronized. Most teams building serious annotation pipelines end up with a custom Gradio or Streamlit app that reads HDF5 episodes directly. SVRC's data platform provides this as a web-based interface integrated with the data collection pipeline.

Annotation Cost Benchmarks

Based on SVRC's annotation operations, here are realistic cost benchmarks for different annotation configurations:

Annotation Configuration	Cost per Episode	Time per Episode	Suitable For
Success flag only (automated + spot-check)	$0.05-0.10	5-10 sec	ACT, Diffusion Policy (single-task)
Success + language instruction	$0.15-0.35	20-40 sec	VLA fine-tuning, multi-task BC
Success + language + phase segmentation	$0.50-1.50	2-5 min	Hierarchical policies, detailed debugging
Full annotation (all above + bboxes)	$2.00-5.00	10-20 min	Object detection training, perception research
Segmentation masks (per frame, SAM-assisted)	$0.50-2.50/frame	30-120 sec/frame	Sim-to-real domain adaptation, visual pre-training

For a 500-episode dataset annotated with success flags and language instructions (the most common configuration for VLA fine-tuning), total annotation cost is approximately $75-175. This is a small fraction of the data collection cost and should never be skimped on — the marginal cost of annotation is vastly lower than the cost of retraining on poorly labeled data.

Quality Control: Inter-Annotator Agreement

Inter-annotator agreement (IAA) measures how consistently multiple annotators label the same data. For robot data, the relevant metric is Cohen's kappa (for two annotators) or Fleiss' kappa (for three or more).

Target IAA thresholds for robot data:

Binary success labels: kappa > 0.85 (achievable with clear success criteria document)
Task phase boundaries: kappa > 0.75 (some boundary ambiguity is inherent)
Demonstration quality scores: kappa > 0.65 (quality is more subjective; weighted kappa is more appropriate)

If your IAA falls below these thresholds, the annotation criteria need refinement before proceeding. Common fixes:

Low success label agreement: Add photo examples of edge cases to the success criteria document. Define exact positional tolerance (e.g., "object center within 2cm of target center").
Low segmentation agreement: Define phase transitions in terms of observable physical events ("gripper closes on object" not "approach phase ends"). Timestamp should correspond to the frame where the event occurs, not the frame before or after.
Low quality score agreement: Reduce the scale from 5 to 3 levels (good/acceptable/reject). Finer scales are unreliable for subjective quality judgments.

Annotation Quality Standards

SVRC applies a three-stage quality gate to all datasets: operator self-annotation immediately after recording, secondary review by a trained annotator, and automated consistency checks comparing annotations against joint state statistics (e.g., episodes marked success where the gripper never closed are flagged for re-review).

Automated consistency checks that catch common errors:

Success-labeled episode where end-effector never moves more than 5cm from start position → flag
Success-labeled episode shorter than 50% of median episode length → flag
Language instruction mentions "left" but all objects are on the right side of the workspace → flag
Phase segmentation where any phase is shorter than 0.5s or longer than 60s → flag
Two adjacent episodes with identical language instructions but different task categories → flag

SVRC's Annotation Pipeline

When you use SVRC's data collection services, annotation is part of the deliverable. Our operators annotate each episode with success flags and language labels during the recording session, and our annotation team performs secondary review before dataset export. You receive a dataset with high-confidence annotations, annotator agreement scores, and a full quality report.

For teams bringing their own collected data, SVRC offers annotation-only services at the following tiers:

Basic ($0.10/episode): Success flag verification and cleanup of existing labels
Standard ($0.30/episode): Success flags + language instructions + quality score
Full ($1.50/episode): All annotations including phase segmentation and contact labels

We can process existing datasets collected on any supported hardware platform (ALOHA, OpenArm, Franka, UR, Unitree). Datasets must be in HDF5 or RLDS format with video streams accessible for review. Contact us to discuss your dataset annotation needs, or explore our annotation interface through the SVRC data platform.