Six-Month Test — pre-specified evaluation for 2026-11-26

This file specifies, before any operating data exists, what success and what failure look like at the first major checkpoint of the live paper portfolio. Inception is 2026-05-26; the checkpoint is 2026-11-26.

The reason to write it now: every investor rationalizes performance after the fact. A six-month read produced retrospectively will find the framing that flatters the numbers. A six-month read produced prospectively, against criteria that cannot be moved without an audit trail, is the only honest version. The discipline this file enforces is on the operator, not on the kit.

The criteria below are written knowing that six months is a small sample. They are not designed to produce a confident verdict on the kit's edge — that takes 24-36 months minimum per Performance § Reading the alpha. They are designed to detect whether the kit is operating to its own standard, whether the basic structural pieces (drawdown protocol, pre-flight checklist, deliverable suite) are being followed, and whether the attribution of returns is informative even when the magnitude is not.

What gets measured

Six dimensions. Each has a pre-specified threshold for "operating well," "operating with concerns," and "operating poorly." The aggregate read across dimensions is the test result.

Dimension 1 — Thesis throughput and quality

A six-month operating window should produce a meaningful workbench, not a single thesis.

Status	Closed theses (verdict published)	Pre-flight compliance
Operating well	≥ 5 closed theses	100% of closed theses pass thesis-preflight in audit
With concerns	3–4 closed theses	≥ 80% pre-flight compliance
Operating poorly	≤ 2 closed theses, OR < 80% pre-flight compliance

A workbench with fewer than three closed theses after six months suggests the daily-scan funnel is too narrow, the gate-1 circle-of-competence test is too restrictive, or the operator is not producing output. Each has a different remediation; the kit-debrief at the six-month mark identifies which.

Dimension 2 — Calibration scorecard

For closed positions and for theses that have reached their first quarterly calibration checkpoint, were the predictions directionally right? Scored against the test in methodology-calibration.

Status	Directional accuracy on closed predictions	Notes
Operating well	≥ 60% directionally correct	Above flip-coin baseline; methodology is additive
With concerns	50%–60%	Within noise of random; sample too small to be confident either way
Operating poorly	< 45%	Below flip-coin; the kit is producing anti-signal

At six months the sample is small. The threshold is set deliberately low for "operating well" because the goal at this checkpoint is not yet to claim edge — it is to confirm the kit is not actively miscalibrated. A 60% accuracy at this stage is encouraging; 45% would be a structural warning.

Dimension 3 — Performance against the three benchmarks

The full attribution split from Performance § The factor-vs-skill question. The status read is the combination across SPY, RPV, and RPG, not the headline against any one.

Configuration	Reading
Positive alpha vs. all three (SPY, RPV, RPG)	Strong; the kit is generating excess return above and beyond the value-factor tailwind
Positive vs. SPY, ~zero vs. RPV, positive vs. RPG	Likely factor beta dressed as alpha; methodology is matching what a value ETF would do
Positive vs. RPV, negative vs. RPG (with positive or ~zero vs. SPY)	Methodologically additive within a hostile value regime — continue if 06-falsification conditions are not firing
Negative vs. all three	Escalate to falsification review per 06-falsification
Mixed / noisy	Six-month sample size cannot distinguish; treat as descriptive and re-evaluate at 12 months

The headline alpha number is not the test. The configuration is. Annotate the configuration plainly in the checkpoint write-up.

Dimension 4 — Drawdown protocol compliance

The drawdown-protocol is advisory, not enforced. But the calibration record exists to test whether the operator followed the protocol when the bands fired.

Status	Compliance
Operating well	Every band crossing during the period has a corresponding documented response in the day's portfolio note within the required window
With concerns	One band crossing without documented response, OR a response that materially deviated from the recommended action with no written rationale
Operating poorly	Multiple band crossings without response, OR the operator demonstrably ignored the protocol when emotionally significant (Band 4+ crossing, response was "ride it out" without rebook)

A protocol the operator ignored is worse than no protocol. If Dimension 4 reads "poorly," the protocol itself needs revision (the bands or the actions may be wrong) or the operator's commitment needs revision (the protocol is right, the discipline failed). Either is fixable; the question must be asked.

Dimension 5 — Deliverable-suite completeness

Per deliverable-suite, every closed thesis ships with the Tier 1 artifacts: thesis note, dashboard, calibration tracker. Tier 2 artifacts (model, deck, research log) are produced when warranted.

Status	Tier 1 completeness	Tier 2 completeness
Operating well	100% of closed theses have all three Tier 1 artifacts present and complete	Tier 2 produced where warranted; gap reasoning documented where omitted
With concerns	Tier 1 ≥ 90%; one missing dashboard or tracker	Tier 2 inconsistent without documented reasoning
Operating poorly	Tier 1 < 90%, OR systematic absence of dashboards or trackers

The deliverable suite is the operational discipline that makes the analysis reviewable, shareable, and calibratable. If it has eroded by month six, it will not survive month twelve.

Dimension 6 — Kit improvement loop

The optimization run (per Rules) executes Tier 1 edits autonomously and queues Tier 2 proposals to Backlog. The six-month test scores whether the loop is operating as designed.

Status	Tier 1 activity	Tier 2 activity
Operating well	Daily Tier 1 edits logged; reversal rate < 10% (i.e., the autonomous edits are mostly correct)	At least 2 Tier 2 proposals queued with evidence citations, of which ≥ 1 has been reviewed and either accepted or rejected with reasoning
With concerns	Tier 1 activity present but reversal rate 10%–25%; OR Tier 2 backlog has items but no movement
Operating poorly	Tier 1 reversal rate > 25%, OR Tier 2 backlog empty or stale, OR the optimization run has been routinely skipped

Tier 1 reversal rate above 25% means the autonomous-edit boundary is wrong — the kit is making changes that need to be undone. Either the autonomous-edit rules in Rules are too permissive or the evidence threshold is too low. Tier 2 backlog stasis means the kit-debrief mechanism is not surfacing real improvements; the kit is not learning from its operating experience.

Dimension 7 — Capital deployment and decisiveness (added 2026-06-28)

Authorized amendment per external-audit-2026-06-28 (user-approved). The original six dimensions could all read "operating well" while the portfolio held 100% cash and never bought a single share — the audit's central finding was that "you can ace the six-month test having never invested a dollar." A research process that never deploys capital produces no realized data and cannot be evaluated. This dimension closes that hole. Per the procedural rules it is logged as a Tier-2 amendment in Backlog and the audit-log; the original six thresholds remain frozen.

The question: did the kit act, or did it only document? Measured over the inception-to-checkpoint window.

Status	Capital deployed / positions taken
Operating well	≥ 3 distinct positions opened (probes count), AND the Shadow-Book and Near-Miss-Ledger are populated with real marked prices (not placeholders), AND time-at-100%-cash is < 50% of trading days
With concerns	1–2 positions opened, OR the miss-tracking ledgers exist but are partially populated, OR time-at-100%-cash is 50–80% of trading days
Operating poorly	Zero positions opened, OR time-at-100%-cash > 80% of trading days, OR the Shadow-Book / Near-Miss ledgers remain empty placeholders

Zero positions after six months reads "operating poorly," full stop — regardless of how well the other six dimensions read. Patience is a virtue; total inaction sustained for half a year, with empty calibration ledgers, is the over-conservatism failure mode methodology-calibration Failure Mode 1 warned about. The remediation if this dimension fails: re-run the watchlist through the recalibrated margin-of-safety-pricing ruler and check whether the failure is a genuinely barren opportunity set or a still-miscalibrated ruler; verify the probe-book mandate (position-sizing-kelly) is being honored.

The aggregate read

The test produces one of four outcomes based on the aggregate across the seven dimensions. The thresholds for the aggregate are pre-specified to prevent retroactive grading. Dimension 7 (capital deployment) reading "operating poorly" is, by itself, sufficient to bar an Outcome A "operating to standard" result — the kit cannot be "operating to standard" if it has never acted.

Outcome A — Operating to standard

All six dimensions read "operating well." The kit is doing what it was designed to do, the operator is following the protocols, and the configuration of returns against the three benchmarks is at minimum non-disqualifying (not configuration 4 above).

Response: continue without structural changes. Schedule the 12-month checkpoint per the standard calibration cadence. Any items in the Tier 2 backlog that have accumulated supporting evidence are reviewed and either approved or moved to "evidence insufficient."

Outcome B — Operating with concerns

At most two dimensions read "with concerns" and none read "operating poorly." The kit is broadly functional but has surfaced areas needing attention.

Response: identify the specific items behind the "with concerns" reads, file a kit-debrief in 10-Calibration documenting each (kit-debrief-002-six-month-checkpoint.md), and address each as Tier 1 or Tier 2 actions per Rules. Continue operation; schedule the 12-month checkpoint on the standard cadence.

Outcome C — Operating poorly on a fixable dimension

One dimension reads "operating poorly" and the others read "operating well" or "with concerns." The kit has a specific operational failure that is identifiable and addressable.

Response: depending on the dimension —

Dim 1 (throughput) poorly → re-examine the daily-scan and circle-of-competence settings; the funnel is broken upstream of the workbench
Dim 2 (calibration) poorly → escalate immediately to methodology-calibration review; the kit's central-value generation is mis-calibrated and every additional thesis compounds the error
Dim 3 (performance) poorly (configuration 4) → escalate to 06-falsification review; the lens may be mis-fit
Dim 4 (drawdown compliance) poorly → revise the protocol (bands wrong) or the operator commitment (protocol right, discipline failed); freeze new buys until decided
Dim 5 (deliverables) poorly → reinstate the thesis-preflight checklist enforcement; do not publish new theses until the back-fill is complete
Dim 6 (optimization) poorly → review Rules for whether the Tier 1 boundaries are wrong or the evidence threshold for Tier 2 is mis-set

In all cases: continue operation in the dimensions that are working; pause new activity in the dimension that is failing until the remediation is documented.

Outcome D — Multiple dimensions failing, or the falsification condition triggered

Two or more dimensions read "operating poorly," OR Dimension 3 reads "operating poorly" with configuration 4 (negative alpha vs. all three benchmarks), OR a 06-falsification condition has independently fired during the period.

Response: this is the band where the question moves from operational tuning to structural review. The user takes the kit-debrief and the calibration record and decides whether to:

Pause operation for a defined observation window (60–90 days) and continue research-only mode while attribution is clarified
Modify the kit per the falsification response menu (methodology evolution within deep value, lens addition, structural soul edit)
Continue with documented expectation that the next 6 months may also be hard if the cause is regime rather than methodology
Discontinue the live paper portfolio if the kit's edge cannot be empirically demonstrated in any configuration

The decision is not automatic. The point of the pre-specified test is that the question is on the table, not that the answer is mechanical.

Procedural rules for the checkpoint itself

The checkpoint runs on 2026-11-26 or the next trading day if that is a holiday. Not before; not after by more than three trading days.
The thresholds in this file are frozen as of the date this file was written. They cannot be moved between now and the checkpoint. If new information suggests a threshold is wrong, that observation is logged but does not retroactively change the test.
The checkpoint write-up is a separate file at 10-Calibration/kit-debrief-002-six-month-checkpoint.md, following the kit-debrief-001-PLTR precedent for structure. It scores each dimension, identifies the aggregate outcome, documents the response, and links to all supporting evidence.
Any threshold modification proposed after the checkpoint is a Tier 2 change to this file with full Backlog routing per Rules. The discipline of writing the test prospectively requires that revisions be deliberate and audited.

What this checkpoint is not

Not a verdict on whether the kit generates alpha. Six months is too short. The verdict requires 24-36 months minimum.
Not a basis for soul-file edits. Even an Outcome D result triggers a falsification review, not an automatic philosophy change. The soul-immutable discipline applies; the falsification file exists precisely to specify what would justify a deliberate edit.
Not a measurement of luck. Performance dimensions are checked for configuration, not magnitude. A small positive alpha and a small negative alpha read the same way at this stage if the attribution is the same.
Not the only checkpoint. The 12-month, 24-month, and per-thesis calibration checkpoints continue on their own schedules. The six-month test is an early operational read, not a substitute for the longer cycle.

Linked

Performance — the daily ledger and the three-benchmark configuration this test reads
Portfolio — current positions and the drawdown state
drawdown-protocol — Dim 4 reads against this
thesis-preflight — Dim 1 reads compliance against this
deliverable-suite — Dim 5 reads completeness against this
methodology-calibration — the long-run bias audit Dim 2 escalates to if it fails
06-falsification — the framework-level question Dim 3 routes toward if configuration 4 fires
Rules — Dim 6 reads against the optimization tier discipline
kit-debrief-001-PLTR — the template for the checkpoint write-up
Backlog — where Tier 2 items accumulate between now and the checkpoint