Robot Data Annotation: How to Label Robot Demonstrations for Training
Annotation is the least glamorous part of robot learning and the most consequential. A dataset of 500 well-annotated demonstrations will train a better policy than 2,000 poorly labeled ones. Here is what annotation means for robot data and how to do it right.
What Annotation Means for Robot Data
Unlike image classification, where annotation means drawing boxes or clicking labels, robot demonstration annotation is richer and more structured. A single robot episode — typically 20-200 seconds of manipulation — needs to be labeled at multiple levels: was the episode a success or failure, what language describes the task, where do semantically distinct phases begin and end, and are there any frames that should be excluded from training due to hardware errors or operator mistakes.
Annotation is typically done by human reviewers watching video replays of recorded episodes alongside plots of joint states and gripper aperture. Good annotation tools display synchronized video from multiple cameras simultaneously, making it easy to judge success from perspectives the robot's own cameras might not capture clearly.
Annotation Types: A Complete Taxonomy
Robot demonstration data supports a range of annotation types, each serving different downstream purposes. Understanding which annotations you need — and which you do not — is critical for managing annotation budget and timeline.
Episode-Level Annotations
| Annotation Type | Format | Required For | Cost per Episode |
|---|---|---|---|
| Binary success/failure | Boolean flag | All IL training (ACT, Diffusion Policy, VLA) | $0.05-0.10 |
| Language instruction | Free-text string | Language-conditioned policies, VLA fine-tuning | $0.10-0.25 |
| Task category ID | Integer enum | Multi-task policies, dataset organization | $0.02-0.05 |
| Demonstration quality score | 1-5 integer scale | Quality-weighted training, dataset curation | $0.10-0.20 |
| Object instance IDs | List of string IDs | Object-centric policies, diversity analysis | $0.05-0.10 |
Frame-Level Annotations
| Annotation Type | Format | Required For | Cost per Frame |
|---|---|---|---|
| Task phase segmentation | Timestamped phase boundaries | Hierarchical policies, sub-task analysis | $0.50-2.00/episode |
| Contact event labels | Timestamped contact start/end | Contact-aware policies, force control | $0.30-0.80/episode |
| Object bounding boxes | 2D bbox per frame per object | Object detection, visual grounding | $0.05-0.15/frame |
| Segmentation masks | Per-pixel object mask | Semantic segmentation, sim-to-real | $0.50-2.50/frame |
| Keypoint annotations | 2D pixel coordinates per keypoint | Keypoint-based policies, pose estimation | $0.10-0.30/frame |
| Frame exclusion flags | Boolean per frame | Removing hardware glitches, operator errors | $0.01-0.03/frame |
Most imitation learning projects need only the first two episode-level annotations (success flag + language instruction) and optionally task phase segmentation. Frame-level annotations like segmentation masks and keypoints are needed for specific policy architectures or perception pre-training — they are expensive and should not be applied by default.
Success Flags: The Most Important Annotation
Every episode in a robot training dataset must be labeled with a binary success flag: did the robot complete the task successfully. This sounds simple, but success criteria must be defined precisely before annotation begins. "Place the cup on the plate" requires a specification: does the cup need to be upright, does the handle orientation matter, how much positional error is acceptable? Annotators applying different implicit standards to the same dataset create noisy labels that degrade training performance.
Write a one-page success specification document before annotation begins, with example images of success and failure cases. Use this document to calibrate annotators. Measure inter-annotator agreement on a shared subset of episodes — if agreement is below 90%, your success criteria need clarification. SVRC's annotation pipeline requires explicit success criteria documents and inter-annotator agreement checks before any dataset is marked ready for training.
Automated Success Classification
For datasets of 1,000+ episodes, manual success labeling becomes expensive. Automated classifiers can handle the bulk of labeling with human review on uncertain cases. The standard approach:
- Heuristic classifiers: Rule-based checks using final gripper state, end-effector position, and force/torque readings. Example: for a pick-place task, check if the gripper is open (released) AND the end-effector is within 3cm of the target position at the final frame. Covers 60-70% of episodes with >95% accuracy.
- Learned classifiers: Train a lightweight CNN (ResNet-18) on the final 10 frames of the wrist camera to predict success/failure. Requires 200+ manually labeled examples per task to train. Handles the remaining ambiguous cases with 85-92% accuracy.
- Human review: Reserve for the 10-15% of episodes where automated classifiers are uncertain (confidence between 0.3 and 0.7). This hybrid approach reduces annotation cost by 5-8x compared to full manual labeling.
Language Labels
Language annotations attach natural language descriptions to episodes or episode segments. These are required for training language-conditioned policies — policies that follow instructions like "pick up the red block" rather than having the task hardcoded. Language annotations also enable compatibility with vision-language-action (VLA) models and allow datasets to be searched and filtered by task description.
Write language annotations at two levels of specificity: a short task name ("cup placement") and a natural language instruction ("pick up the white cup and place it on the blue plate"). The instruction should describe what a human observer sees happening, not the robot's internal state. If your task involves task variations — different objects, different target locations — each variation should have a corresponding instruction that distinguishes it from the others.
Language Annotation Best Practices for VLA Training
If you plan to fine-tune VLA models like OpenVLA or RT-2, your language annotations need more care than simple task labels:
- Vary phrasing: Do not copy-paste the same instruction across all episodes. "Pick up the red cup" and "Grab the red mug and lift it" teach the model that different phrasings map to similar actions. Aim for 5-10 instruction variants per task.
- Include referring expressions: "Pick up the object on the left" rather than only "pick up the cup." This teaches spatial reasoning.
- Describe the full action: "Pick up the red cup from the table and place it on the blue plate" is better than "cup to plate." The full instruction maps to the complete demonstration trajectory.
- Use consistent object names: Decide whether it is a "cup," "mug," or "glass" and stick with it within a dataset. Inconsistent naming confuses the language encoder.
Contact Labels for Force-Aware Policies
For tasks involving significant contact forces — insertion, assembly, surface following — timestamped contact event labels provide critical information that visual annotations cannot capture. Contact labels mark the frame at which the robot makes or breaks contact with an object, the type of contact (initial touch, sustained grasp, insertion contact, release), and optionally the contact force magnitude from a wrist F/T sensor.
Contact labels can be partially automated using force/torque sensor data: a threshold detector on the F/T signal identifies contact onset (force magnitude exceeds 0.5N from baseline) and release (force drops below threshold). Human review is needed to classify contact type (intentional grasp vs. collision) and to catch cases where the F/T threshold misclassifies incidental contact as task-relevant.
Task Segmentation
For long-horizon tasks involving multiple sequential sub-tasks, segmentation labels mark the boundaries between phases. A table-setting task might be segmented into: reach cup, grasp cup, transport cup, place cup, release cup. Segmentation enables hierarchical policy training, sub-task-level success metrics, and selective data augmentation. It also enables surgical debugging: if a policy fails during transport but succeeds during grasping, segmentation labels let you measure sub-task success rates and target data collection effort where it is needed most.
Segmentation annotation is more expensive than success flagging and not always necessary. Prioritize segmentation for tasks with three or more semantically distinct phases, or when you plan to use a hierarchical policy architecture.
Annotation Tooling Comparison
| Tool | License | Video Support | Time-Series | Multi-Camera Sync | Best For |
|---|---|---|---|---|---|
| Label Studio | Apache 2.0 | Yes | Limited | No (single video) | General annotation, extensible via templates |
| V7 (Darwin) | Commercial | Yes | No | No | Auto-labeling with SAM, managed workforce |
| CVAT | MIT | Yes | No | No | Bounding boxes, segmentation masks on video |
| Custom Gradio/Streamlit | Custom | Full control | Yes | Yes (custom build) | Robot-specific workflows with joint state plots |
| SVRC Platform | SaaS | Yes | Yes | Yes (native) | Purpose-built for robot episodes with all sensors |
The main limitation of general-purpose annotation tools (Label Studio, CVAT, V7) for robot data is that they are designed for image/video annotation, not synchronized multi-modal episode annotation. Reviewing a robot demonstration requires simultaneous display of 2-4 camera streams plus joint state time-series plus force/torque plots, all time-synchronized. Most teams building serious annotation pipelines end up with a custom Gradio or Streamlit app that reads HDF5 episodes directly. SVRC's data platform provides this as a web-based interface integrated with the data collection pipeline.
Annotation Cost Benchmarks
Based on SVRC's annotation operations, here are realistic cost benchmarks for different annotation configurations:
| Annotation Configuration | Cost per Episode | Time per Episode | Suitable For |
|---|---|---|---|
| Success flag only (automated + spot-check) | $0.05-0.10 | 5-10 sec | ACT, Diffusion Policy (single-task) |
| Success + language instruction | $0.15-0.35 | 20-40 sec | VLA fine-tuning, multi-task BC |
| Success + language + phase segmentation | $0.50-1.50 | 2-5 min | Hierarchical policies, detailed debugging |
| Full annotation (all above + bboxes) | $2.00-5.00 | 10-20 min | Object detection training, perception research |
| Segmentation masks (per frame, SAM-assisted) | $0.50-2.50/frame | 30-120 sec/frame | Sim-to-real domain adaptation, visual pre-training |
For a 500-episode dataset annotated with success flags and language instructions (the most common configuration for VLA fine-tuning), total annotation cost is approximately $75-175. This is a small fraction of the data collection cost and should never be skimped on — the marginal cost of annotation is vastly lower than the cost of retraining on poorly labeled data.
Quality Control: Inter-Annotator Agreement
Inter-annotator agreement (IAA) measures how consistently multiple annotators label the same data. For robot data, the relevant metric is Cohen's kappa (for two annotators) or Fleiss' kappa (for three or more).
Target IAA thresholds for robot data:
- Binary success labels: kappa > 0.85 (achievable with clear success criteria document)
- Task phase boundaries: kappa > 0.75 (some boundary ambiguity is inherent)
- Demonstration quality scores: kappa > 0.65 (quality is more subjective; weighted kappa is more appropriate)
If your IAA falls below these thresholds, the annotation criteria need refinement before proceeding. Common fixes:
- Low success label agreement: Add photo examples of edge cases to the success criteria document. Define exact positional tolerance (e.g., "object center within 2cm of target center").
- Low segmentation agreement: Define phase transitions in terms of observable physical events ("gripper closes on object" not "approach phase ends"). Timestamp should correspond to the frame where the event occurs, not the frame before or after.
- Low quality score agreement: Reduce the scale from 5 to 3 levels (good/acceptable/reject). Finer scales are unreliable for subjective quality judgments.
Annotation Quality Standards
SVRC applies a three-stage quality gate to all datasets: operator self-annotation immediately after recording, secondary review by a trained annotator, and automated consistency checks comparing annotations against joint state statistics (e.g., episodes marked success where the gripper never closed are flagged for re-review).
Automated consistency checks that catch common errors:
- Success-labeled episode where end-effector never moves more than 5cm from start position → flag
- Success-labeled episode shorter than 50% of median episode length → flag
- Language instruction mentions "left" but all objects are on the right side of the workspace → flag
- Phase segmentation where any phase is shorter than 0.5s or longer than 60s → flag
- Two adjacent episodes with identical language instructions but different task categories → flag
SVRC's Annotation Pipeline
When you use SVRC's data collection services, annotation is part of the deliverable. Our operators annotate each episode with success flags and language labels during the recording session, and our annotation team performs secondary review before dataset export. You receive a dataset with high-confidence annotations, annotator agreement scores, and a full quality report.
For teams bringing their own collected data, SVRC offers annotation-only services at the following tiers:
- Basic ($0.10/episode): Success flag verification and cleanup of existing labels
- Standard ($0.30/episode): Success flags + language instructions + quality score
- Full ($1.50/episode): All annotations including phase segmentation and contact labels
We can process existing datasets collected on any supported hardware platform (ALOHA, OpenArm, Franka, UR, Unitree). Datasets must be in HDF5 or RLDS format with video streams accessible for review. Contact us to discuss your dataset annotation needs, or explore our annotation interface through the SVRC data platform.