Skip to content

Trend

The trend command analyzes score movement across multiple historical run manifests and reports whether quality is improving, degrading, or stable over time.

Use it when pairwise compare is too narrow and you want to detect gradual drift across a sequence of runs.

Analyze the last 8 canonical runs in the current workspace:

Terminal window
agentv trend --last 8

This is the primary day-to-day workflow. In most cases, users should start with --last.

Filter to one dataset and target:

Terminal window
agentv trend --last 8 --dataset code-review --target claude-sonnet

Point directly at run workspaces or index.jsonl manifests when you need a specific historical slice or want a reproducible example:

Terminal window
agentv trend \
.agentv/results/runs/2026-03-01T10-00-00-000Z/ \
.agentv/results/runs/2026-03-08T10-00-00-000Z/index.jsonl \
.agentv/results/runs/2026-03-15T10-00-00-000Z/

Concrete regression-gating example:

Terminal window
agentv trend --last 8 --dataset code-review --target claude-sonnet \
--fail-on-degrading --slope-threshold 0.01

trend only accepts canonical run workspaces:

  • .agentv/results/runs/<run-id>/
  • .agentv/results/runs/<run-id>/index.jsonl

Legacy flat results.jsonl files are rejected. The command stays on lightweight index.jsonl manifests and does not require per-test artifact hydration.

OptionDescription
--last <n>Use the most recent n runs from .agentv/results/runs/
--dataset <name>Filter records to one dataset
--target <name>Filter records to one target inside each run
--slope-threshold <n>Minimum absolute slope required to classify improving or degrading (default: 0.01)
--fail-on-degradingExit non-zero when the detected trend is degrading beyond the threshold
--allow-missing-testsAggregate each run independently instead of intersecting test IDs across runs
--format, -fOutput format: table (default) or json
--jsonShorthand for --format=json
  1. Loads each selected index.jsonl manifest.
  2. Applies dataset and target filters per record.
  3. By default, reduces every run to the intersection of test IDs present in all selected runs.
  4. Computes one mean score per run.
  5. Fits a simple linear regression over run index 0..N-1.
  6. Classifies the slope as improving, degrading, or stable.

Strict matched-test analysis is the default because changing test composition across runs can create false drift signals.

Suppose three historical runs for dataset=code-review and target=claude-sonnet produce matched mean scores of 0.92, 0.86, and 0.80.

  • The slope is negative.
  • The command reports direction=degrading.
  • With --fail-on-degrading --slope-threshold 0.01, the command exits with code 1.

This is the intended CI workflow for detecting slow drift that a single pairwise comparison can miss.

Trend Analysis
Runs: 3 | Range: 2026-03-01T10:00:00.000Z → 2026-03-15T10:00:00.000Z
Filters: dataset=code-review target=claude-sonnet mode=matched-tests
Matched Tests: 42 | Verdict: degrading
Run Tests Mean Score
---------------------------- ----- ----------
2026-03-01T10:00:00.000Z 42 0.920
2026-03-08T10:00:00.000Z 42 0.905
2026-03-15T10:00:00.000Z 42 0.892
Summary: slope=-0.014 intercept=0.920 r²=0.943
Regression Gate: threshold=0.010 fail_on_degrading=true triggered=true
{
"runs": [
{
"label": "2026-03-01T10:00:00.000Z",
"path": "/repo/.agentv/results/runs/2026-03-01T10-00-00-000Z/index.jsonl",
"timestamp": "2026-03-01T10:00:00.000Z",
"matched_test_count": 42,
"mean_score": 0.92
}
],
"filters": {
"dataset": "code-review",
"target": "claude-sonnet",
"allow_missing_tests": false
},
"summary": {
"run_count": 8,
"matched_test_count": 42,
"date_range": {
"start": "2026-03-01T10:00:00.000Z",
"end": "2026-03-15T10:00:00.000Z"
},
"slope": -0.014,
"intercept": 0.923,
"r_squared": 0.943,
"direction": "degrading"
},
"regression": {
"slope_threshold": 0.01,
"fail_on_degrading": true,
"triggered": true
}
}
CodeMeaning
0Informational mode, or no degrading trend triggered
1Invalid input, analysis error, or --fail-on-degrading detected a degrading trend
  • compare answers: “Did this run beat that run?”
  • trend answers: “Across many runs, are scores drifting up or down?”

Use compare for pairwise regressions. Use trend for longitudinal drift detection.