Benchmarks
GET
/v1/orgs/{org_id}/benchmarks/{run_id}/resultsGet Benchmark Results
Get per-sample results for a benchmark run.
Runtime, CI or Admin tokenscope: readoperation_id: benchmarks.getResults
Authentication
Any bearer token belonging to the org can read this endpoint.
SDK install
pip install znyx-sdknpm install @znyx/sdkHeader parameters
| Name | Type | Required | Description |
|---|---|---|---|
| X-API-Key#header | string | null | optional | — |
| authorization#header | string | null | optional | — |
Responses
| Status | Description |
|---|---|
| 200 | Successful Response |
| 422 | Validation Error |
Response schema
totalrequiredinteger
limitrequiredinteger
offsetrequiredinteger
resultsrequired
Errors & what triggers them
| Code | Trigger | Fix |
|---|---|---|
| 401 | Missing or expired Authorization header. | — |
| 403 | Token does not have org access (wrong org_id, or insufficient role). | — |
| 404 | Resource does not exist in this org. | — |
Notes & examples
When to use this
After a benchmark run completes (status == completed), call this to get the detail that the summary on GET /benchmarks/{id} glosses over. One row per evaluated sample. Typical use:
- Triage regressions — filter for
is_correct=falseto find where the new policy disagrees with the dataset's expected decision. - Latency drill-down — sort by
latency_msto find the slow detector hits. - Generate retraining data — export false-positive rows directly into an annotated dataset via
POST /v1/orgs/{org_id}/annotations/export.
Response row shape
Each result row carries:
input_text/context— the sample's original input, for grepping.expected_decision/expected_rule_hits— what the dataset says should happen.actual_decision/actual_risk_score/actual_rule_hits— what the runtime actually did.detector_results— per-detector breakdown, same shape as the Traces page.is_correct— boolean: matched the expected decision.
Pagination
limit caps at 1000. For datasets above that size, paginate with offset and concatenate client-side.
Related
GET /benchmarks/compare?a=…&b=…— diff two runs side-by-side without pulling per-sample rows yourself.GET /v1/orgs/{org_id}/benchmarks/{run_id}— run summary + aggregate pass/fail counts.
Request
curl -X GET 'https://api.znyx.ai/v1/orgs/00000000-0000-0000-0000-000000000000/benchmarks/00000000-0000-0000-0000-000000000000/results' \ -H 'Authorization: Bearer $ZNYX_TOKEN'
Response
application/json
Successful Response
{
"total": 0,
"limit": 0,
"offset": 0,
"results": [
{
"id": "string",
"sample_id": "string",
"input_text": null,
"context": null,
"expected_decision": null,
"expected_rule_hits": null,
"actual_decision": null,
"actual_risk_score": 0,
"actual_rule_hits": null,
"latency_ms": 0,
"is_correct": null,
"detector_results": null
}
]
}Schema: object