Methodology Calibration — long-run audit of AlphaSteve's valuation outputs

Per-thesis calibration tracks whether the specific call was right (see _calibration-template). This file tracks something different: whether the methodology itself is systematically biased. A single thesis can be wrong for honest reasons; if every thesis in a category misses in the same direction by a similar magnitude, the kit has a structural bug — and the bug propagates into every future thesis until found.

This is the calibration-loop discipline applied to the kit's own machinery <span class="tier-t2" title="Marks, *The Most Important Thing* (Columbia, 2011), ch. 17 on tracking one's record honestly">T2</span>.

What we track

For every thesis at publication and at each calibration checkpoint:

Field	Meaning
Ticker	The name
Verdict	buy / pass-with-trigger / pass / avoid
Verdict date	When the central value was set
AlphaSteve central value	Our number at the time
Price at verdict	Market price the day the thesis published
Consensus mean PT (at verdict)	Sell-side mean from the consensus map
Sell-side range (low–high)	Distribution width
Independent intrinsic (Simply Wall St / GuruFocus / Morningstar avg)	Mean of non-sell-side intrinsic values
Gap to consensus mean	(AlphaSteve − consensus mean) / consensus mean
Gap direction	below / above / aligned
Gap classification	structural (methodology) / specific (evidence-driven) / mixed
Subsequent price (1y / 3y / 5y)	Filled in at checkpoints
Verdict-correctness	Was our call right based on subsequent price action?
Methodology component most responsible for the gap	growth credit / WACC / terminal / multiple / methodology choice itself

What counts as a methodology bias

A single thesis below consensus by 50%+ is not evidence of bias. It is evidence of an idiosyncratic call that may or may not be right.

Five consecutive theses below consensus by 40%+, all on growth/quality names, with consensus directionally right on 4 of 5 over 12-24 months is evidence of bias — likely that EPV-only framing systematically under-credits growth value on businesses where Greenwald-style growth value is real.

Ten theses across mixed business types showing a median gap of 50%+ is evidence of either a bias or a systematic discipline (deep value is supposed to be conservative). The test: when AlphaSteve is on the bear side and prices subsequently appreciate, were we wrong on input assumptions specifically (e.g., we modeled 10% growth, actual was 25%), or were we right on inputs but wrong on multiple (we assumed multiple compression, market expanded multiple)? The former is a bias to investigate; the latter is the deep-value discipline functioning as designed (and may still produce alpha over a full cycle when multiples eventually mean-revert).

The two failure modes to watch for

Failure mode 1 — systematic over-conservatism (the deep-value trap)

The kit produces central values that are systematically below intrinsic. Over multi-year holding periods, our pass-with-trigger names appreciate without us, our triggers never fire, capital sits in cash, and we under-perform.

Tells:

Median AlphaSteve central value vs consensus mean below -50% across 10+ theses
Re-engagement triggers fire on fewer than 20% of pass-with-trigger names within 24 months
When triggers do fire, subsequent appreciation is modest (kit is buying late, market already revalued)
Theses on durable-growth-quality names (compounders) consistently undervalue them, while theses on cigar-butt/asset-rich names track consensus

If detected: examine WACC inputs (are we using equity risk premium and beta inputs that are too conservative for the relevant business risk class?), terminal-multiple inputs (are we using historical-median multiples in a regime where structurally higher multiples are warranted?), growth-value treatment (are we leaving Greenwald's growth value at zero when it should be positive given moat verification?).

Failure mode 2 — narrative leakage (the consensus-anchoring trap)

The opposite. Our central values drift toward consensus over time because the agent absorbs sell-side narratives via its research notes and starts crediting growth and AI-optionality assumptions the kit was designed to discount.

Tells:

Median AlphaSteve central value vs consensus mean above -20% across 10+ theses (gap shrinking)
"Buy" verdicts increasingly on names that already screen as quality/growth in mainstream filters
Reverse-DCF outputs increasingly find "embedded growth looks reasonable" even on richly-priced names
Thesis prose increasingly cites sell-side estimates as evidence rather than as consensus to be analyzed

If detected: the agent has been epistemically captured by the consensus information set. The fix is to re-anchor on the soul files, re-emphasize source-tier discipline (sources-policy.md), and audit recent theses for cited assumptions that came from T3 aggregator data rather than T1 filings.

Calibration cadence

Frequency	What happens
Per thesis (at publication)	Consensus map and gap recorded. Gap classification assigned.
Per thesis (at 12-month checkpoint)	Subsequent price recorded. Gap re-measured if price moved materially.
Per thesis (at 24-month checkpoint)	Verdict-correctness assessed: was our central value within ±25% of actual price by month 24? If not, classify why.
Quarterly	Aggregate scorecard: median gap across all active theses, distribution of gap directions, count of triggers fired vs total pass-with-trigger names.
Annually	Methodology audit: do we see Failure mode 1 or 2 patterns? Are there specific business-type buckets (e.g., high-growth software) where the kit is systematically off? If yes, queue a kit-debrief and consider parameter revisions.

Calibration events log

Material changes to the kit's methodology are logged here with date, trigger, change, and expected impact.

2026-05-24 — Doctrine recalibration to Greenwald-modified deep value

Trigger: First cross-thesis analysis (PLTR-consensus-gap) surfaced that the kit's EPV-only default and 40-50% MoS requirement were producing central values 60-70% below intrinsic-value sources on a verified-quality compounder (PLTR). The structural-vs-specific gap decomposition showed ~70% of the gap was kit-side methodology, not evidence-driven.

Change:

Doctrine explicitly named as Greenwald-modified deep value (was: pure Klarman/Graham on the conservative end). See 02-philosophy-deep-value new section "Doctrine calibration."
Default valuation: EPV-plus-growth when all three Greenwald gating tests pass (ROIIC > WACC, durable moat, multi-year runway). EPV-only reserved for cases where one or more tests fail. See earnings-power-value-greenwald updated decision tree.
MoS bands narrowed for verified compounders: ~25-30% (was: 50%+). Standard quality ~40%; cyclical/contested ~50%; turnaround/binary 60-70% or pass. See margin-of-safety-pricing.

Expected impact:

Central values on verified-quality compounders will lift materially (PLTR: $52 → $85, +63%)
Re-engagement triggers lift proportionally (PLTR: $29 → $60, +107% — i.e., the price needed for a buy is now realistic rather than catastrophe-required)
Median gap to consensus mean across future theses expected to narrow from 60-70% toward 30-40%
Verdict distribution likely to shift: more "buy" verdicts on quality compounders during normal cyclical drawdowns; fewer "pass-with-trigger" verdicts that never fire
Risk: narrative leakage (Failure Mode 2 above) — the discipline against crediting unobservable growth value tightens as the kit credits more growth value in general. Source-tier discipline (sources-policy) and the three gating tests are the guardrails.

Counter-test: Applied retroactively to PLTR — central value lifts to $85, trigger to $60. Subsequent price action will calibrate whether this revision improves or hurts the kit's record. First checkpoint: PLTR Q2 FY26 print (August 2026); 12-month checkpoint May 2027.

Documentation: Updates to 4 soul/skill files (philosophy, EPV, MoS, the PLTR thesis itself). Two closed backlog items (Closed/2026-05-24-epv-only-default-fix.md, Closed/2026-05-24-pltr-refresh.md).

2026-05-25 — Autonomy doctrine update

Trigger: User observation that the existing pipeline gated thesis-building on user pre-approval — first-read decisions required user action via the Backlog before any thesis work began. This made the user the bottleneck on which ideas got developed and caused the funnel to stall when the Backlog wasn't actively managed.

Change:

All four first-read decisions (continue, pass, shelve-with-trigger, avoid) reclassified to Tier 1 (autonomous). See Rules updated tier table.
New alphasteve-thesis-builder scheduled task at 11:00 AM weekdays — picks oldest first-read continue decision from the Backlog, autonomously writes the full thesis bundle per thesis-bundle-standard (thesis, shadow-matrix, consensus-gap if gap≥25%, calibration), and assigns the final verdict. Max one thesis per run.
Watchlist row updates derived from a published verdict are now Tier 1 (the previous P2 Backlog item on PLTR watchlist propagation is closed).
Active-idea refresh (walking existing theses for new evidence and appending calibration checkpoints) folded into the consolidated 7:30 AM scan-and-triage task.

Expected impact:

Pipeline throughput increases; no manual gate between first-read continue and full thesis publication
User time shifts from "approving the start of deep work" to "reviewing completed bundles" — a higher-leverage use of attention
Risk: agent commits compute and analytical capacity to names the user might have passed on. Mitigation: review cycle catches misallocations; verdict override via revision file is preserved.
Risk: narrative leakage as the agent has more autonomy — guarded by sources-policy discipline and the bundle standard

Counter-test: First autonomous thesis (likely FCN, from 2026-05-25 first-read) lands tomorrow morning. Subsequent calibration over 12-24 months will show whether the autonomy improves the kit's record or surfaces a bias (e.g., over-promotion to continue, under-rigor on shadow-matrix discipline).

Documentation: Updates to Rules (tier reclassifications + autonomy doctrine section), first-read-standard (decision authority section), the consolidated alphasteve-daily-deep-value-scan SKILL.md, new alphasteve-thesis-builder SKILL.md, disabled alphasteve-first-read task (kept for audit).

Shadow matrix scorecard

The shadow valuation matrix (see shadow-valuation-matrix) records four central values on every thesis — pure Klarman, Greenwald-modified (chosen), Buffett-modern, Mauboussin-compounder. Aggregated here as the forward-test data accumulates.

Scorecard format

### Shadow matrix — YYYY-MM (quarterly)

| Methodology | Theses tracked | Hit rate (±25% of 12-mo price) | Median absolute error | Bias direction |
|---|---|---|---|---|
| Pure Klarman / Graham | N | X% | Y% | low / high / aligned |
| **Greenwald-modified** *(chosen)* | N | X% | Y% | ... |
| Buffett-modern | N | X% | Y% | ... |
| Mauboussin-compounder | N | X% | Y% | ... |

**Regime context this quarter:** [bull / bear / neutral; vol regime; rates regime]
**Methodology-fitness read:** [under-N=10 = no inference; N=10-20 = directional only; N>20 = signal forming]

Interpretation discipline (carried from shadow-valuation-matrix)

N<10 theses: matrix is descriptive only; do not draw inference about methodology fitness
N=10-20: weak directional signal possible
N>20 across multiple regime windows: signal becomes credible
Even at N>30: methodology fitness is regime-dependent; track regime context per quarter

Annual methodology-fitness review

Every year on the kit's anniversary (2027-05-24 first), the scorecard data drives an explicit "should we recalibrate?" review. Format:

What does the scorecard say about the chosen methodology's hit rate vs the shadows?
What regime have we been in, and is that regime persistent or about to change?
Has any shadow methodology produced systematically better central values across multiple checkpoints?
Does the data justify a doctrine shift? (Bar is high — see "When to revise the kit's parameters" above.)
If no shift: confirm the current calibration explicitly and document why.

The first review is May 2027. Until then the matrix accumulates without acting.

Aggregate scorecard

The scorecard lives in this file as it accumulates. Format:

### Scorecard — YYYY-MM (quarterly)

| Metric | Value | Trend vs prior quarter |
|---|---|---|
| Active theses | N | ... |
| Median gap to consensus mean | -X% | ... |
| Gap direction distribution (below / aligned / above) | X / Y / Z | ... |
| Triggers fired this quarter | N | ... |
| New buy verdicts | N | ... |
| New pass-with-trigger | N | ... |
| Verdict revisions this quarter | N | ... |

### Notes
[What changed in the kit's behavior this quarter? Anything to investigate?]

Where this connects

Connection	How
consensus-benchmarking	The skill that produces the data this file aggregates
_calibration-template	Per-thesis tracking; this file is the cross-thesis aggregate
kit-debrief-001-PLTR	The first kit retrospective; future retrospectives feed methodology-calibration findings
Backlog	If a methodology bias is detected, the fix becomes a P1 backlog item
position-sizing-kelly	A persistent over-conservatism bias could be expressed in sizing as well as valuation

When to revise the kit's parameters

The standard for revising load-bearing kit parameters (WACC defaults, terminal-multiple assumptions, EPV vs EPV-plus-growth treatment, MoS thresholds) is high:

Evidence: 10+ theses across multiple sectors showing the same directional bias
Mechanism: A clear hypothesis for why the current parameter produces the bias
Counter-test: The proposed revision, applied retroactively to closed theses, would have changed verdicts in a way that improved record
Reversibility: The parameter change is documented in the kit-debrief log so it can be unwound if it makes the record worse

Revisions of load-bearing parameters are Tier 2 changes per Rules — they require user review, not autonomous execution.

Sources

Marks, H. (2011). The Most Important Thing. Columbia Business School Publishing. — ch. 17 on tracking one's record.
Mauboussin, M. (2012). The Success Equation. Harvard Business Review Press. — separating skill from luck in investment outcomes.
Greenwald, B. (2001). Value Investing. Wiley. — the EPV vs EPV-plus-growth distinction whose calibration this file partially audits.
Klarman, S. (1991). Margin of Safety. HarperBusiness. — the discipline behind why a deep-value kit should sit below consensus on average, and the discipline behind catching when the gap is bias rather than discipline.

This file is the cross-thesis audit infrastructure. The first entries land as the kit accumulates a record.