API & developer

Eval set format.

How we test a prompt before it ships. The schema. The labels. The reject feedback loop.

An eval set is a YAML file with labelled examples and a small grader. Every prompt has one. The set lives next to the prompt and ships with it.

name: extract_signals_v0_9_x
        grader:
          type: structured_match
          fields: [signal_type, date, citation_url]
          tolerance:
            date: ±7d

        cases:
          - id: case_001
            inputs:
              company: "Acme Logistics"
              web_audit: { ... }
              social: { ... }
            expected:
              signals:
                - signal_type: "carrier_capacity_squeeze"
                  date: "2026-02-14"
                  citation_url: "https://acme.com/blog/q1-update"

          - id: case_002
            inputs: { ... }
            expected:
              signals: []   # negative example, must produce nothing

Reject feedback

When a reviewer rejects a draft in human_review(), the rejection reason and the offending output are added to the eval set for the prompt that produced them. The next version of the prompt is graded against the larger set. That is the self-teaching loop.