α
AlphaSteve
← Home

Six-Month Test — pre-specified evaluation for 2026-11-26

This file specifies, before any operating data exists, what success and what failure look like at the first major checkpoint of the live paper portfolio. Inception is 2026-05-26; the checkpoint is 2026-11-26.

The reason to write it now: every investor rationalizes performance after the fact. A six-month read produced retrospectively will find the framing that flatters the numbers. A six-month read produced prospectively, against criteria that cannot be moved without an audit trail, is the only honest version. The discipline this file enforces is on the operator, not on the kit.

The criteria below are written knowing that six months is a small sample. They are not designed to produce a confident verdict on the kit's edge — that takes 24-36 months minimum per Performance § Reading the alpha. They are designed to detect whether the kit is operating to its own standard, whether the basic structural pieces (drawdown protocol, pre-flight checklist, deliverable suite) are being followed, and whether the attribution of returns is informative even when the magnitude is not.

What gets measured

Six dimensions. Each has a pre-specified threshold for "operating well," "operating with concerns," and "operating poorly." The aggregate read across dimensions is the test result.

Dimension 1 — Thesis throughput and quality

A six-month operating window should produce a meaningful workbench, not a single thesis.

Status Closed theses (verdict published) Pre-flight compliance
Operating well ≥ 5 closed theses 100% of closed theses pass thesis-preflight in audit
With concerns 3–4 closed theses ≥ 80% pre-flight compliance
Operating poorly ≤ 2 closed theses, OR < 80% pre-flight compliance

A workbench with fewer than three closed theses after six months suggests the daily-scan funnel is too narrow, the gate-1 circle-of-competence test is too restrictive, or the operator is not producing output. Each has a different remediation; the kit-debrief at the six-month mark identifies which.

Dimension 2 — Calibration scorecard

For closed positions and for theses that have reached their first quarterly calibration checkpoint, were the predictions directionally right? Scored against the test in methodology-calibration.

Status Directional accuracy on closed predictions Notes
Operating well ≥ 60% directionally correct Above flip-coin baseline; methodology is additive
With concerns 50%–60% Within noise of random; sample too small to be confident either way
Operating poorly < 45% Below flip-coin; the kit is producing anti-signal

At six months the sample is small. The threshold is set deliberately low for "operating well" because the goal at this checkpoint is not yet to claim edge — it is to confirm the kit is not actively miscalibrated. A 60% accuracy at this stage is encouraging; 45% would be a structural warning.

Dimension 3 — Performance against the three benchmarks

The full attribution split from Performance § The factor-vs-skill question. The status read is the combination across SPY, RPV, and RPG, not the headline against any one.

Configuration Reading
Positive alpha vs. all three (SPY, RPV, RPG) Strong; the kit is generating excess return above and beyond the value-factor tailwind
Positive vs. SPY, ~zero vs. RPV, positive vs. RPG Likely factor beta dressed as alpha; methodology is matching what a value ETF would do
Positive vs. RPV, negative vs. RPG (with positive or ~zero vs. SPY) Methodologically additive within a hostile value regime — continue if 06-falsification conditions are not firing
Negative vs. all three Escalate to falsification review per 06-falsification
Mixed / noisy Six-month sample size cannot distinguish; treat as descriptive and re-evaluate at 12 months

The headline alpha number is not the test. The configuration is. Annotate the configuration plainly in the checkpoint write-up.

Dimension 4 — Drawdown protocol compliance

The drawdown-protocol is advisory, not enforced. But the calibration record exists to test whether the operator followed the protocol when the bands fired.

Status Compliance
Operating well Every band crossing during the period has a corresponding documented response in the day's portfolio note within the required window
With concerns One band crossing without documented response, OR a response that materially deviated from the recommended action with no written rationale
Operating poorly Multiple band crossings without response, OR the operator demonstrably ignored the protocol when emotionally significant (Band 4+ crossing, response was "ride it out" without rebook)

A protocol the operator ignored is worse than no protocol. If Dimension 4 reads "poorly," the protocol itself needs revision (the bands or the actions may be wrong) or the operator's commitment needs revision (the protocol is right, the discipline failed). Either is fixable; the question must be asked.

Dimension 5 — Deliverable-suite completeness

Per deliverable-suite, every closed thesis ships with the Tier 1 artifacts: thesis note, dashboard, calibration tracker. Tier 2 artifacts (model, deck, research log) are produced when warranted.

Status Tier 1 completeness Tier 2 completeness
Operating well 100% of closed theses have all three Tier 1 artifacts present and complete Tier 2 produced where warranted; gap reasoning documented where omitted
With concerns Tier 1 ≥ 90%; one missing dashboard or tracker Tier 2 inconsistent without documented reasoning
Operating poorly Tier 1 < 90%, OR systematic absence of dashboards or trackers

The deliverable suite is the operational discipline that makes the analysis reviewable, shareable, and calibratable. If it has eroded by month six, it will not survive month twelve.

Dimension 6 — Kit improvement loop

The optimization run (per Rules) executes Tier 1 edits autonomously and queues Tier 2 proposals to Backlog. The six-month test scores whether the loop is operating as designed.

Status Tier 1 activity Tier 2 activity
Operating well Daily Tier 1 edits logged; reversal rate < 10% (i.e., the autonomous edits are mostly correct) At least 2 Tier 2 proposals queued with evidence citations, of which ≥ 1 has been reviewed and either accepted or rejected with reasoning
With concerns Tier 1 activity present but reversal rate 10%–25%; OR Tier 2 backlog has items but no movement
Operating poorly Tier 1 reversal rate > 25%, OR Tier 2 backlog empty or stale, OR the optimization run has been routinely skipped

Tier 1 reversal rate above 25% means the autonomous-edit boundary is wrong — the kit is making changes that need to be undone. Either the autonomous-edit rules in Rules are too permissive or the evidence threshold is too low. Tier 2 backlog stasis means the kit-debrief mechanism is not surfacing real improvements; the kit is not learning from its operating experience.

The aggregate read

The test produces one of four outcomes based on the aggregate across the six dimensions. The thresholds for the aggregate are pre-specified to prevent retroactive grading.

Outcome A — Operating to standard

All six dimensions read "operating well." The kit is doing what it was designed to do, the operator is following the protocols, and the configuration of returns against the three benchmarks is at minimum non-disqualifying (not configuration 4 above).

Response: continue without structural changes. Schedule the 12-month checkpoint per the standard calibration cadence. Any items in the Tier 2 backlog that have accumulated supporting evidence are reviewed and either approved or moved to "evidence insufficient."

Outcome B — Operating with concerns

At most two dimensions read "with concerns" and none read "operating poorly." The kit is broadly functional but has surfaced areas needing attention.

Response: identify the specific items behind the "with concerns" reads, file a kit-debrief in 10-Calibration documenting each (kit-debrief-002-six-month-checkpoint.md), and address each as Tier 1 or Tier 2 actions per Rules. Continue operation; schedule the 12-month checkpoint on the standard cadence.

Outcome C — Operating poorly on a fixable dimension

One dimension reads "operating poorly" and the others read "operating well" or "with concerns." The kit has a specific operational failure that is identifiable and addressable.

Response: depending on the dimension —

  • Dim 1 (throughput) poorly → re-examine the daily-scan and circle-of-competence settings; the funnel is broken upstream of the workbench
  • Dim 2 (calibration) poorly → escalate immediately to methodology-calibration review; the kit's central-value generation is mis-calibrated and every additional thesis compounds the error
  • Dim 3 (performance) poorly (configuration 4) → escalate to 06-falsification review; the lens may be mis-fit
  • Dim 4 (drawdown compliance) poorly → revise the protocol (bands wrong) or the operator commitment (protocol right, discipline failed); freeze new buys until decided
  • Dim 5 (deliverables) poorly → reinstate the thesis-preflight checklist enforcement; do not publish new theses until the back-fill is complete
  • Dim 6 (optimization) poorly → review Rules for whether the Tier 1 boundaries are wrong or the evidence threshold for Tier 2 is mis-set

In all cases: continue operation in the dimensions that are working; pause new activity in the dimension that is failing until the remediation is documented.

Outcome D — Multiple dimensions failing, or the falsification condition triggered

Two or more dimensions read "operating poorly," OR Dimension 3 reads "operating poorly" with configuration 4 (negative alpha vs. all three benchmarks), OR a 06-falsification condition has independently fired during the period.

Response: this is the band where the question moves from operational tuning to structural review. The user takes the kit-debrief and the calibration record and decides whether to:

  • Pause operation for a defined observation window (60–90 days) and continue research-only mode while attribution is clarified
  • Modify the kit per the falsification response menu (methodology evolution within deep value, lens addition, structural soul edit)
  • Continue with documented expectation that the next 6 months may also be hard if the cause is regime rather than methodology
  • Discontinue the live paper portfolio if the kit's edge cannot be empirically demonstrated in any configuration

The decision is not automatic. The point of the pre-specified test is that the question is on the table, not that the answer is mechanical.

Procedural rules for the checkpoint itself

  1. The checkpoint runs on 2026-11-26 or the next trading day if that is a holiday. Not before; not after by more than three trading days.
  2. The thresholds in this file are frozen as of the date this file was written. They cannot be moved between now and the checkpoint. If new information suggests a threshold is wrong, that observation is logged but does not retroactively change the test.
  3. The checkpoint write-up is a separate file at 10-Calibration/kit-debrief-002-six-month-checkpoint.md, following the kit-debrief-001-PLTR precedent for structure. It scores each dimension, identifies the aggregate outcome, documents the response, and links to all supporting evidence.
  4. Any threshold modification proposed after the checkpoint is a Tier 2 change to this file with full Backlog routing per Rules. The discipline of writing the test prospectively requires that revisions be deliberate and audited.

What this checkpoint is not

  • Not a verdict on whether the kit generates alpha. Six months is too short. The verdict requires 24-36 months minimum.
  • Not a basis for soul-file edits. Even an Outcome D result triggers a falsification review, not an automatic philosophy change. The soul-immutable discipline applies; the falsification file exists precisely to specify what would justify a deliberate edit.
  • Not a measurement of luck. Performance dimensions are checked for configuration, not magnitude. A small positive alpha and a small negative alpha read the same way at this stage if the attribution is the same.
  • Not the only checkpoint. The 12-month, 24-month, and per-thesis calibration checkpoints continue on their own schedules. The six-month test is an early operational read, not a substitute for the longer cycle.

Linked