Every LLM call in Paitho is versioned. You can read the prompt. You can fork it. What you cannot do is accidentally not know what the model is doing — because we do not allow black boxes in the pipeline.
extract_signals() is Stage 05. It takes the output of Stage 02 (Web Audit) and Stage 03 (Social Research) and extracts a structured list of pain signals, each with a cited source and a confidence score. It is the most important call in the pipeline.
Here is what changed between v0.9 and v1.0, and why.
What v0.9 was doing
v0.9 of extract_signals() used a single-pass extraction prompt. The model received the full web audit and social research output — typically 3,200–6,100 tokens of structured text — and produced a list of signals in one call.
The system prompt in v0.9 looked like this (abbreviated):
# extract_signals_v0.9.yaml
system: |
You are a B2B research analyst. Given the following company research,
extract all observable pain signals relevant to [vertical].
For each signal, provide:
- signal_type: (from the provided taxonomy)
- description: (one sentence)
- evidence: (direct quote or paraphrase from the research)
- confidence: (high / medium / low)
Return signals in descending confidence order.
Taxonomy: {{ vertical_pack.taxonomy | yaml }}
user: |
Company: {{ company.name }}
Web audit: {{ web_audit.output }}
Social research: {{ social_research.output }}
This worked. Reply rates on v0.9 signals ran 11–13% across the beta operators. The failure mode was not quality — it was coverage. The model, given everything at once, consistently prioritized the most obvious signals and missed subtler ones.
We found this by auditing the gap between "signals available in the raw research" and "signals the model extracted." A human analyst, reading the same web audit, would extract 4.2 signals on average. The v0.9 model was extracting 2.1. It was not wrong. It was incomplete.
The diagnostic
To understand why, we ran a comparison study. We took 40 web audit outputs from the beta corpus — leads where we had both the AI signal output and a human analyst's independent signal extraction — and compared them.
The gap was not random. The v0.9 model consistently missed:
Signals that required cross-referencing two separate sections of the research. A hiring post in the web audit combined with a social post from the CTO mentioning a specific pain — the model did not naturally join these. It processed them as separate facts.
Signals with weak individual evidence but strong corroboration across sources. If three data points each had "low" individual confidence, the model marked the signal as low confidence. A human analyst who saw three low-confidence signals pointing at the same conclusion would synthesize them into a medium-confidence signal.
Signals outside the top 5 in the taxonomy ranking. The prompt sorted signals by confidence. The model had a recency bias toward signals it placed at the top of the output. Signals 6 through 10 were systematically underexplored.
What v1.0 changed
Three changes, each addressing one failure mode.
Change 1: Two-pass extraction with explicit cross-referencing
v1.0 splits the extraction into two calls.
Pass one extracts signals from each source independently — web audit signals, social research signals — without asking the model to combine them.
Pass two receives the two independent signal lists and asks the model to perform two tasks: identify cross-source corroborations, and produce a unified signal list ranked by combined evidence strength.
The cost of two calls versus one, at Stage 05 complexity: approximately approximately ~35-45% more tokens . The output improvement: average signals extracted per lead went from 2.1 to 3.7. The extra cost is worth it.
# extract_signals_v1.0.yaml — pass one (abbreviated)
system: |
Extract observable pain signals from the following source.
Do not attempt to combine with other sources — that comes later.
For each signal, you must provide a direct citation to the source text.
If you cannot cite it, do not include it.
Source type: {{ source_type }}
Taxonomy: {{ vertical_pack.taxonomy | yaml }}
user: |
{{ source_output }}
# extract_signals_v1.0.yaml — pass two (abbreviated)
system: |
You have received independent signal extractions from two sources.
Your tasks:
1. Identify signals that appear in both sources (corroborations).
Upgrade corroborated signals' confidence by one level.
2. Identify signals that appear in one source and are contradicted
in the other. Flag as conflicted; do not include in final list.
3. Produce a unified signal list ranked by: corroborated first,
then single-source by confidence.
Required output fields:
signal_id, signal_type, description, evidence_web,
evidence_social (null if single-source), confidence,
corroborated (bool), source_urls (list)
user: |
Web audit signals: {{ pass_one_web.signals | json }}
Social signals: {{ pass_one_social.signals | json }}
Change 2: Confidence synthesis rule
In v0.9, confidence was assigned by the model on a per-signal basis using natural language instructions. "High" meant the model was confident. This produced inconsistent calibration.
The prompt is yours to read and yours to fork. We shipped it that way on purpose.
v1.0 replaces model-assigned confidence with a rule-based calculation:
def confidence_score(signal):
score = 0
if signal.corroborated:
score += 2
if signal.evidence_web is not None:
score += 1
if signal.evidence_social is not None:
score += 1
if signal.source_date_days_ago < 30:
score += 1
elif signal.source_date_days_ago < 60:
score += 0
else:
score -= 1
if score >= 4:
return "high"
elif score >= 2:
return "medium"
else:
return "low"
The model no longer assigns confidence. It provides the inputs; the function calculates the output. Confidence calibration is now consistent and auditable.
Change 3: Taxonomy coverage enforcement
v0.9 allowed the model to stop after finding the most obvious signals. v1.0 adds a taxonomy coverage check at the end of pass two:
# appended to pass two system prompt
After producing the unified signal list, check: are any of the
following high-priority signal types absent from your output?
{{ vertical_pack.priority_signals | yaml }}
If absent, briefly note whether you looked for evidence and
found none, or whether you did not examine that signal type.
This is required. Do not skip it.
This forces the model to confirm it checked for priority signals, even when it did not find them. The output includes a coverage report alongside the signal list. Operators can see not just what signals were found but what was checked and not found.
The diff
Here is the actual git diff header for the v0.9 → v1.0 transition:
# prompts/extract_signals.yaml
- version: 0.9
+ version: 1.0
- extraction_passes: 1
+ extraction_passes: 2
- confidence_assignment: model
+ confidence_assignment: rule_based (see confidence.py)
- taxonomy_coverage_check: false
+ taxonomy_coverage_check: true
- avg_signals_per_lead: 2.1
+ avg_signals_per_lead: 3.7
- avg_tokens_per_run: 4200
+ avg_tokens_per_run: 5900
The token cost increase is real. It shows up in BYOK billing. At the split-model configuration most operators use for Stage 05 (Haiku or Gemini Flash), the additional 1,700 tokens adds approximately $0.001 per lead. At 1,000 leads, that is $1. The reply-rate improvement paid for it in the first week.
How to fork this prompt
If you are on a paid plan, the prompt files are readable in your account settings under Prompt Versions. The extract_signals_v1.0.yaml file is there along with every prior version.
To fork: copy the file, modify it, and save it as a custom version under your workspace. Your custom version will apply to all future leads in your pipeline. The original v1.0 is preserved and unmodified.
Things operators have done with forks in beta:
One operator running the Devtools pack added a custom signal type — "OSS community health" — that checks for declining GitHub star velocity and contributor dropout as a proxy for platform risk. It is not in the default taxonomy. It was relevant to their specific ICP. They forked the prompt, added the signal type definition, and the extraction model picked it up in the next run.
One operator in fintech SaaS restricted the taxonomy to 8 of the 28 signals — the ones most relevant to their specific sub-vertical (B2B payments infrastructure). Fewer signals, higher precision. Their rejection rate at Stage 09 dropped 11 percentage points after the fork.
One operator forked the confidence function to add a recency weight for their vertical, where signals older than 45 days were less relevant than our default 60-day threshold. Three lines of Python. Result was measurable in their pipeline within two weeks.
Receipts
All figures: Paitho internal data. Prompt performance measured across production pipeline, Q1 2026. Illustrative.
- v0.9 average signals extracted per lead: 2.1
- v1.0 average signals extracted per lead: 3.7
- v1.0 token cost increase vs. v0.9: +40% at Stage 05
- v1.0 incremental cost per lead at Haiku pricing: ~$0.001
- v1.0 reply rate improvement vs. v0.9 across all angles: +2.6 percentage points
- Operators who have forked extract_signals() in beta: 4
- Forks that produced measurable reply-rate improvement: 3 of 4
- Lines of diff between v0.9 and v1.0 YAML: 47
Closing
Principle 6 — Transparent prompts. Versioned angles — exists because the prompt is the product logic. If you cannot read it, you cannot trust it. If you cannot fork it, you cannot improve it for your specific situation.
v1.0 is better than v0.9 on the metrics that matter. v1.1 will be better than v1.0 for the same reason: the eval set grows, the failure modes clarify, and the prompt changes to address them. That process is visible to every operator who wants to look.
You should look. The prompt is yours to read and yours to fork. We shipped it that way on purpose.
Related:
- Reply Rates by Angle: 14 Months of Versioned Prompt Data
- BYOK Economics: What 1,000 Leads Actually Cost Across 6 Model Providers
- Prompt Versioning — Docs
— Alex Kim , Engineering
Principle 6 — Transparent prompts. Versioned angles.