
fiftyone-model-evaluation
Evaluate model predictions against ground truth using COCO, Open Images, or custom protocols. Use when computing mAP, precision, recall, confusion matrices, or analyzing TP/FP/FN examples for detection, classification, segmentation, or regression tasks.
Evaluate model predictions against ground truth using COCO, Open Images, or custom protocols. Use when computing mAP, precision, recall, confusion matrices, or analyzing TP/FP/FN examples for detection, classification, segmentation, or regression tasks.
Evaluate Model Predictions in FiftyOne
Key Directives
ALWAYS follow these rules:
1. Check if dataset exists and has required fields
list_datasets()
set_context(dataset_name="my-dataset")
dataset_summary(name="my-dataset")
Verify the dataset has both prediction and ground truth fields of compatible types.
2. Install evaluation plugin if not available
list_plugins()
# If @voxel51/evaluation not listed:
download_plugin(url_or_repo="voxel51/fiftyone-plugins", plugin_names=["@voxel51/evaluation"])
enable_plugin(plugin_name="@voxel51/evaluation")
3. Ask user for evaluation parameters
Always confirm with the user:
- Prediction field name
- Ground truth field name
- Evaluation key (unique identifier for this evaluation)
- Evaluation method (coco, open-images, simple, top-k, binary)
- Whether to compute mAP (for detection tasks)
4. Launch App for evaluation operators
launch_app(dataset_name="my-dataset")
5. Close app when done
close_app()
Workflow
Step 1: Verify Dataset and Fields
list_datasets()
set_context(dataset_name="my-dataset")
dataset_summary(name="my-dataset")
Review:
- Sample count
- Available label fields and their types
- Identify prediction field (model outputs)
- Identify ground truth field (annotations)
Label Types and Compatible Evaluations:
| Label Type | Evaluation Method | Supported Methods |
|---|---|---|
Detections |
evaluate_detections() |
coco, open-images |
Polylines |
evaluate_detections() |
coco, open-images |
Keypoints |
evaluate_detections() |
coco, open-images |
TemporalDetections |
evaluate_detections() |
activitynet |
Classification |
evaluate_classifications() |
simple, top-k, binary |
Segmentation |
evaluate_segmentations() |
simple |
Regression |
evaluate_regressions() |
simple |
Step 2: Ensure Evaluation Plugin is Installed
list_plugins()
If @voxel51/evaluation is not in the list:
download_plugin(
url_or_repo="voxel51/fiftyone-plugins",
plugin_names=["@voxel51/evaluation"]
)
enable_plugin(plugin_name="@voxel51/evaluation")
Step 3: Launch App
launch_app(dataset_name="my-dataset")
Step 4: Run Evaluation
Ask user for:
- Prediction field (
pred_field) - Ground truth field (
gt_field) - Evaluation key (
eval_key) - must be unique identifier - Evaluation method
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval",
"method": "coco",
"iou": 0.5,
"compute_mAP": true
}
)
Step 5: View Results
After evaluation, the dataset will have new fields:
{eval_key}_tp- True positive count per sample{eval_key}_fp- False positive count per sample{eval_key}_fn- False negative count per sample
View only samples with false positives:
set_view(filters={"eval_fp": {"$gt": 0}})
Use the Model Evaluation Panel in the App to interactively explore:
- Summary metrics (mAP, precision, recall)
- Confusion matrices
- Per-class performance
- Scenario analysis
Step 6: View Evaluation Patches (TP/FP/FN)
To examine individual true positives, false positives, and false negatives, guide users to the Python SDK:
import fiftyone as fo
dataset = fo.load_dataset("my-dataset")
# Convert to evaluation patches view
eval_patches = dataset.to_evaluation_patches("eval")
# Count by type
print(eval_patches.count_values("type"))
# Output: {'fn': 246, 'fp': 4131, 'tp': 986}
# View only false positives
fp_view = eval_patches.match(F("type") == "fp")
session = fo.launch_app(view=fp_view)
Step 7: Clean Up
close_app()
Evaluation Types
Detection Evaluation
For Detections, Polylines, Keypoints labels.
COCO-style (default):
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_coco",
"method": "coco",
"iou": 0.5,
"classwise": true,
"compute_mAP": true
}
)
| Parameter | Type | Default | Description |
|---|---|---|---|
iou |
float | 0.5 | IoU threshold for matching |
classwise |
bool | true | Only match objects with same class |
compute_mAP |
bool | false | Compute mAP, mAR, and PR curves |
use_masks |
bool | false | Use instance masks for IoU (if available) |
iscrowd |
string | null | Attribute name for crowd annotations |
iou_threshs |
string | null | Comma-separated IoU thresholds for mAP |
max_preds |
int | null | Max predictions per sample for mAP |
Open Images-style:
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_oi",
"method": "open-images",
"iou": 0.5
}
)
Supports additional parameters:
pos_label_field: Classifications specifying which classes should be evaluatedneg_label_field: Classifications specifying which classes should NOT be evaluated
ActivityNet-style (temporal):
For TemporalDetections in video datasets:
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_temporal",
"method": "activitynet",
"compute_mAP": true
}
)
Classification Evaluation
For Classification labels.
Simple (default):
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_cls",
"method": "simple"
}
)
Per-sample field {eval_key} stores boolean indicating if prediction was correct.
Top-k:
Requires predictions with logits field:
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_topk",
"method": "top-k",
"k": 5
}
)
Binary:
For binary classifiers:
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_binary",
"method": "binary"
}
)
Per-sample field {eval_key} stores: "tp", "fp", "tn", or "fn".
Segmentation Evaluation
For Segmentation labels.
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_seg",
"method": "simple",
"bandwidth": 5 # Optional: evaluate only boundary pixels
}
)
| Parameter | Type | Default | Description |
|---|---|---|---|
bandwidth |
int | null | Pixels along contours to evaluate (null = entire mask) |
average |
string | "micro" | Averaging strategy: micro, macro, weighted, samples |
Per-sample fields:
{eval_key}_accuracy{eval_key}_precision{eval_key}_recall
Regression Evaluation
For Regression labels.
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_reg",
"method": "simple",
"metric": "squared_error" # or "absolute_error"
}
)
Per-sample field {eval_key} stores the error value.
Metrics available:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- Median Absolute Error
- R² Score
- Explained Variance Score
- Max Error
Managing Evaluations
List Existing Evaluations
execute_operator(
operator_uri="@voxel51/evaluation/get_evaluation_info",
params={
"eval_key": "eval"
}
)
Load Evaluation View
Load the exact view on which an evaluation was performed:
execute_operator(
operator_uri="@voxel51/evaluation/load_evaluation_view",
params={
"eval_key": "eval",
"select_fields": false
}
)
Rename Evaluation
execute_operator(
operator_uri="@voxel51/evaluation/rename_evaluation",
params={
"eval_key": "eval",
"new_eval_key": "eval_v2"
}
)
Delete Evaluation
execute_operator(
operator_uri="@voxel51/evaluation/delete_evaluation",
params={
"eval_key": "eval"
}
)
Common Use Cases
Use Case 1: Evaluate Object Detection Model
# Verify dataset has detection fields
set_context(dataset_name="my-dataset")
dataset_summary(name="my-dataset")
# Launch app
launch_app(dataset_name="my-dataset")
# Run COCO-style evaluation with mAP
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval",
"method": "coco",
"iou": 0.5,
"compute_mAP": true
}
)
# View samples with most false positives
set_view(filters={"eval_fp": {"$gt": 5}})
Use Case 2: Compare Two Detection Models
set_context(dataset_name="my-dataset")
launch_app(dataset_name="my-dataset")
# Evaluate first model
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "model_a_predictions",
"gt_field": "ground_truth",
"eval_key": "eval_model_a",
"method": "coco",
"compute_mAP": true
}
)
# Evaluate second model
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "model_b_predictions",
"gt_field": "ground_truth",
"eval_key": "eval_model_b",
"method": "coco",
"compute_mAP": true
}
)
# Use the Model Evaluation Panel to compare results
Use Case 3: Evaluate Classification Model
set_context(dataset_name="my-classification-dataset")
launch_app(dataset_name="my-classification-dataset")
# Simple classification evaluation
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_cls",
"method": "simple"
}
)
# View misclassified samples
set_view(filters={"eval_cls": false})
Use Case 4: Evaluate at Different IoU Thresholds
set_context(dataset_name="my-dataset")
launch_app(dataset_name="my-dataset")
# Strict evaluation (IoU 0.75)
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_strict",
"method": "coco",
"iou": 0.75,
"compute_mAP": true
}
)
# Lenient evaluation (IoU 0.25)
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_lenient",
"method": "coco",
"iou": 0.25,
"compute_mAP": true
}
)
Use Case 5: Evaluate Segmentation Model
set_context(dataset_name="my-segmentation-dataset")
launch_app(dataset_name="my-segmentation-dataset")
# Full mask evaluation
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_seg",
"method": "simple"
}
)
# Boundary-only evaluation (5 pixel bandwidth)
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_seg_boundary",
"method": "simple",
"bandwidth": 5
}
)
Python SDK Alternative
For more control over evaluation and access to full results, guide users to the Python SDK:
import fiftyone as fo
import fiftyone.zoo as foz
# Load dataset
dataset = fo.load_dataset("my-dataset")
# Evaluate detections
results = dataset.evaluate_detections(
"predictions",
gt_field="ground_truth",
eval_key="eval",
method="coco",
iou=0.5,
compute_mAP=True,
)
# Print classification report
results.print_report()
# Get mAP value
print(f"mAP: {results.mAP():.3f}")
# Plot confusion matrix (interactive)
plot = results.plot_confusion_matrix()
plot.show()
# Plot precision-recall curves
plot = results.plot_pr_curves(classes=["person", "car", "dog"])
plot.show()
# Convert to evaluation patches to view TP/FP/FN
eval_patches = dataset.to_evaluation_patches("eval")
print(eval_patches.count_values("type"))
# View false positives in the App
from fiftyone import ViewField as F
fp_view = eval_patches.match(F("type") == "fp")
session = fo.launch_app(view=fp_view)
Python SDK evaluation methods:
dataset.evaluate_detections()- Object detectiondataset.evaluate_classifications()- Classificationdataset.evaluate_segmentations()- Semantic segmentationdataset.evaluate_regressions()- Regression
Results object methods:
results.print_report()- Print classification reportresults.print_metrics()- Print aggregate metricsresults.mAP()- Get mAP value (detection only)results.mAR()- Get mAR value (detection only)results.plot_confusion_matrix()- Interactive confusion matrixresults.plot_pr_curves()- Precision-recall curvesresults.plot_results()- Scatter plot (regression only)
Troubleshooting
Error: "No suitable label fields"
- Dataset must have label fields of compatible types
- Use
dataset_summary()to see available fields and types
Error: "No suitable ground truth fields"
- Ground truth field must be same type as prediction field
- Cannot compare Detections predictions with Classification ground truth
Error: "Evaluation key already exists"
- Each evaluation must have a unique key
- Delete existing evaluation or use a different key name
Error: "Plugin not found"
- Install the evaluation plugin:
download_plugin(url_or_repo="voxel51/fiftyone-plugins", plugin_names=["@voxel51/evaluation"]) enable_plugin(plugin_name="@voxel51/evaluation")
mAP is not computed
- Set
compute_mAP: truein params - mAP requires multiple predictions per image to be meaningful
Evaluation is slow
- Large datasets take time
- Consider evaluating a filtered view first
- Use delegated execution for background processing
Best Practices
- Use descriptive eval_keys -
eval_yolov8_coco,eval_resnet_topk5 - Don't overwrite evaluations - Use unique keys for each evaluation run
- Compare at same IoU - When comparing models, use consistent IoU thresholds
- Check field types first - Ensure prediction and ground truth fields are compatible
- Use Model Evaluation Panel - Interactive exploration is easier than scripting
- Examine patches - Use
to_evaluation_patches()to understand errors
Resources
You Might Also Like
Related Skills

summarize
Summarize or extract text/transcripts from URLs, podcasts, and local files (great fallback for “transcribe this YouTube/video”).
openclaw
prompt-lookup
Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
f
skill-lookup
Activates when the user asks about Agent Skills, wants to find reusable AI capabilities, needs to install skills, or mentions skills for Claude. Use for discovering, retrieving, and installing skills.
f
seo-review
Perform a focused SEO audit on JavaScript concept pages to maximize search visibility, featured snippet optimization, and ranking potential
leonardomso
