Skip to main content

Benchmarks

GET/v1/orgs/{org_id}/benchmarks/{run_id}/results

Get Benchmark Results

Get per-sample results for a benchmark run.

Runtime, CI or Admin tokenscope: readoperation_id: benchmarks.getResults

Authentication

Any bearer token belonging to the org can read this endpoint.

SDK install

pip install znyx-sdknpm install @znyx/sdk

Path parameters

NameTypeRequiredDescription
org_id#pathstringrequired
run_id#pathstringrequired

Query parameters

NameTypeRequiredDescription
limit#queryintegeroptional
offset#queryintegeroptional

Header parameters

NameTypeRequiredDescription
X-API-Key#headerstring | nulloptional
authorization#headerstring | nulloptional

Responses

StatusDescription
200Successful Response
422Validation Error

Response schema

totalrequiredinteger
limitrequiredinteger
offsetrequiredinteger
resultsrequired

Errors & what triggers them

CodeTriggerFix
401Missing or expired Authorization header.
403Token does not have org access (wrong org_id, or insufficient role).
404Resource does not exist in this org.

Notes & examples

When to use this

After a benchmark run completes (status == completed), call this to get the detail that the summary on GET /benchmarks/{id} glosses over. One row per evaluated sample. Typical use:

  • Triage regressions — filter for is_correct=false to find where the new policy disagrees with the dataset's expected decision.
  • Latency drill-down — sort by latency_ms to find the slow detector hits.
  • Generate retraining data — export false-positive rows directly into an annotated dataset via POST /v1/orgs/{org_id}/annotations/export.

Response row shape

Each result row carries:

  • input_text / context — the sample's original input, for grepping.
  • expected_decision / expected_rule_hits — what the dataset says should happen.
  • actual_decision / actual_risk_score / actual_rule_hits — what the runtime actually did.
  • detector_results — per-detector breakdown, same shape as the Traces page.
  • is_correct — boolean: matched the expected decision.

Pagination

limit caps at 1000. For datasets above that size, paginate with offset and concatenate client-side.

  • GET /benchmarks/compare?a=…&b=… — diff two runs side-by-side without pulling per-sample rows yourself.
  • GET /v1/orgs/{org_id}/benchmarks/{run_id} — run summary + aggregate pass/fail counts.

Request

curl -X GET 'https://api.znyx.ai/v1/orgs/00000000-0000-0000-0000-000000000000/benchmarks/00000000-0000-0000-0000-000000000000/results' \
  -H 'Authorization: Bearer $ZNYX_TOKEN'

Response

application/json

Successful Response

{
  "total": 0,
  "limit": 0,
  "offset": 0,
  "results": [
    {
      "id": "string",
      "sample_id": "string",
      "input_text": null,
      "context": null,
      "expected_decision": null,
      "expected_rule_hits": null,
      "actual_decision": null,
      "actual_risk_score": 0,
      "actual_rule_hits": null,
      "latency_ms": 0,
      "is_correct": null,
      "detector_results": null
    }
  ]
}

Schema: object