API & developer
Eval set format.
How we test a prompt before it ships. The schema. The labels. The reject feedback loop.
An eval set is a YAML file with labelled examples and a small grader. Every prompt has one. The set lives next to the prompt and ships with it.
name: extract_signals_v0_9_x
grader:
type: structured_match
fields: [signal_type, date, citation_url]
tolerance:
date: ±7d
cases:
- id: case_001
inputs:
company: "Acme Logistics"
web_audit: { ... }
social: { ... }
expected:
signals:
- signal_type: "carrier_capacity_squeeze"
date: "2026-02-14"
citation_url: "https://acme.com/blog/q1-update"
- id: case_002
inputs: { ... }
expected:
signals: [] # negative example, must produce nothing
Reject feedback
When a reviewer rejects a draft in human_review(), the rejection reason and the offending output are added to the eval set for the prompt that produced them. The next version of the prompt is graded against the larger set. That is the self-teaching loop.