Six-Month Test — pre-specified evaluation for 2026-11-26
This file specifies, before any operating data exists, what success and what failure look like at the first major checkpoint of the live paper portfolio. Inception is 2026-05-26; the checkpoint is 2026-11-26.
The reason to write it now: every investor rationalizes performance after the fact. A six-month read produced retrospectively will find the framing that flatters the numbers. A six-month read produced prospectively, against criteria that cannot be moved without an audit trail, is the only honest version. The discipline this file enforces is on the operator, not on the kit.
The criteria below are written knowing that six months is a small sample. They are not designed to produce a confident verdict on the kit's edge — that takes 24-36 months minimum per Performance § Reading the alpha. They are designed to detect whether the kit is operating to its own standard, whether the basic structural pieces (drawdown protocol, pre-flight checklist, deliverable suite) are being followed, and whether the attribution of returns is informative even when the magnitude is not.
What gets measured
Six dimensions. Each has a pre-specified threshold for "operating well," "operating with concerns," and "operating poorly." The aggregate read across dimensions is the test result.
Dimension 1 — Thesis throughput and quality
A six-month operating window should produce a meaningful workbench, not a single thesis.
| Status | Closed theses (verdict published) | Pre-flight compliance |
|---|---|---|
| Operating well | ≥ 5 closed theses | 100% of closed theses pass thesis-preflight in audit |
| With concerns | 3–4 closed theses | ≥ 80% pre-flight compliance |
| Operating poorly | ≤ 2 closed theses, OR < 80% pre-flight compliance |
A workbench with fewer than three closed theses after six months suggests the daily-scan funnel is too narrow, the gate-1 circle-of-competence test is too restrictive, or the operator is not producing output. Each has a different remediation; the kit-debrief at the six-month mark identifies which.
Dimension 2 — Calibration scorecard
For closed positions and for theses that have reached their first quarterly calibration checkpoint, were the predictions directionally right? Scored against the test in methodology-calibration.
| Status | Directional accuracy on closed predictions | Notes |
|---|---|---|
| Operating well | ≥ 60% directionally correct | Above flip-coin baseline; methodology is additive |
| With concerns | 50%–60% | Within noise of random; sample too small to be confident either way |
| Operating poorly | < 45% | Below flip-coin; the kit is producing anti-signal |
At six months the sample is small. The threshold is set deliberately low for "operating well" because the goal at this checkpoint is not yet to claim edge — it is to confirm the kit is not actively miscalibrated. A 60% accuracy at this stage is encouraging; 45% would be a structural warning.
Dimension 3 — Performance against the three benchmarks
The full attribution split from Performance § The factor-vs-skill question. The status read is the combination across SPY, RPV, and RPG, not the headline against any one.
| Configuration | Reading |
|---|---|
| Positive alpha vs. all three (SPY, RPV, RPG) | Strong; the kit is generating excess return above and beyond the value-factor tailwind |
| Positive vs. SPY, ~zero vs. RPV, positive vs. RPG | Likely factor beta dressed as alpha; methodology is matching what a value ETF would do |
| Positive vs. RPV, negative vs. RPG (with positive or ~zero vs. SPY) | Methodologically additive within a hostile value regime — continue if 06-falsification conditions are not firing |
| Negative vs. all three | Escalate to falsification review per 06-falsification |
| Mixed / noisy | Six-month sample size cannot distinguish; treat as descriptive and re-evaluate at 12 months |
The headline alpha number is not the test. The configuration is. Annotate the configuration plainly in the checkpoint write-up.
Dimension 4 — Drawdown protocol compliance
The drawdown-protocol is advisory, not enforced. But the calibration record exists to test whether the operator followed the protocol when the bands fired.
| Status | Compliance |
|---|---|
| Operating well | Every band crossing during the period has a corresponding documented response in the day's portfolio note within the required window |
| With concerns | One band crossing without documented response, OR a response that materially deviated from the recommended action with no written rationale |
| Operating poorly | Multiple band crossings without response, OR the operator demonstrably ignored the protocol when emotionally significant (Band 4+ crossing, response was "ride it out" without rebook) |
A protocol the operator ignored is worse than no protocol. If Dimension 4 reads "poorly," the protocol itself needs revision (the bands or the actions may be wrong) or the operator's commitment needs revision (the protocol is right, the discipline failed). Either is fixable; the question must be asked.
Dimension 5 — Deliverable-suite completeness
Per deliverable-suite, every closed thesis ships with the Tier 1 artifacts: thesis note, dashboard, calibration tracker. Tier 2 artifacts (model, deck, research log) are produced when warranted.
| Status | Tier 1 completeness | Tier 2 completeness |
|---|---|---|
| Operating well | 100% of closed theses have all three Tier 1 artifacts present and complete | Tier 2 produced where warranted; gap reasoning documented where omitted |
| With concerns | Tier 1 ≥ 90%; one missing dashboard or tracker | Tier 2 inconsistent without documented reasoning |
| Operating poorly | Tier 1 < 90%, OR systematic absence of dashboards or trackers |
The deliverable suite is the operational discipline that makes the analysis reviewable, shareable, and calibratable. If it has eroded by month six, it will not survive month twelve.
Dimension 6 — Kit improvement loop
The optimization run (per Rules) executes Tier 1 edits autonomously and queues Tier 2 proposals to Backlog. The six-month test scores whether the loop is operating as designed.
| Status | Tier 1 activity | Tier 2 activity |
|---|---|---|
| Operating well | Daily Tier 1 edits logged; reversal rate < 10% (i.e., the autonomous edits are mostly correct) | At least 2 Tier 2 proposals queued with evidence citations, of which ≥ 1 has been reviewed and either accepted or rejected with reasoning |
| With concerns | Tier 1 activity present but reversal rate 10%–25%; OR Tier 2 backlog has items but no movement | |
| Operating poorly | Tier 1 reversal rate > 25%, OR Tier 2 backlog empty or stale, OR the optimization run has been routinely skipped |
Tier 1 reversal rate above 25% means the autonomous-edit boundary is wrong — the kit is making changes that need to be undone. Either the autonomous-edit rules in Rules are too permissive or the evidence threshold is too low. Tier 2 backlog stasis means the kit-debrief mechanism is not surfacing real improvements; the kit is not learning from its operating experience.
The aggregate read
The test produces one of four outcomes based on the aggregate across the six dimensions. The thresholds for the aggregate are pre-specified to prevent retroactive grading.
Outcome A — Operating to standard
All six dimensions read "operating well." The kit is doing what it was designed to do, the operator is following the protocols, and the configuration of returns against the three benchmarks is at minimum non-disqualifying (not configuration 4 above).
Response: continue without structural changes. Schedule the 12-month checkpoint per the standard calibration cadence. Any items in the Tier 2 backlog that have accumulated supporting evidence are reviewed and either approved or moved to "evidence insufficient."
Outcome B — Operating with concerns
At most two dimensions read "with concerns" and none read "operating poorly." The kit is broadly functional but has surfaced areas needing attention.
Response: identify the specific items behind the "with concerns" reads, file a kit-debrief in 10-Calibration documenting each (kit-debrief-002-six-month-checkpoint.md), and address each as Tier 1 or Tier 2 actions per Rules. Continue operation; schedule the 12-month checkpoint on the standard cadence.
Outcome C — Operating poorly on a fixable dimension
One dimension reads "operating poorly" and the others read "operating well" or "with concerns." The kit has a specific operational failure that is identifiable and addressable.
Response: depending on the dimension —
- Dim 1 (throughput) poorly → re-examine the daily-scan and circle-of-competence settings; the funnel is broken upstream of the workbench
- Dim 2 (calibration) poorly → escalate immediately to methodology-calibration review; the kit's central-value generation is mis-calibrated and every additional thesis compounds the error
- Dim 3 (performance) poorly (configuration 4) → escalate to 06-falsification review; the lens may be mis-fit
- Dim 4 (drawdown compliance) poorly → revise the protocol (bands wrong) or the operator commitment (protocol right, discipline failed); freeze new buys until decided
- Dim 5 (deliverables) poorly → reinstate the thesis-preflight checklist enforcement; do not publish new theses until the back-fill is complete
- Dim 6 (optimization) poorly → review Rules for whether the Tier 1 boundaries are wrong or the evidence threshold for Tier 2 is mis-set
In all cases: continue operation in the dimensions that are working; pause new activity in the dimension that is failing until the remediation is documented.
Outcome D — Multiple dimensions failing, or the falsification condition triggered
Two or more dimensions read "operating poorly," OR Dimension 3 reads "operating poorly" with configuration 4 (negative alpha vs. all three benchmarks), OR a 06-falsification condition has independently fired during the period.
Response: this is the band where the question moves from operational tuning to structural review. The user takes the kit-debrief and the calibration record and decides whether to:
- Pause operation for a defined observation window (60–90 days) and continue research-only mode while attribution is clarified
- Modify the kit per the falsification response menu (methodology evolution within deep value, lens addition, structural soul edit)
- Continue with documented expectation that the next 6 months may also be hard if the cause is regime rather than methodology
- Discontinue the live paper portfolio if the kit's edge cannot be empirically demonstrated in any configuration
The decision is not automatic. The point of the pre-specified test is that the question is on the table, not that the answer is mechanical.
Procedural rules for the checkpoint itself
- The checkpoint runs on 2026-11-26 or the next trading day if that is a holiday. Not before; not after by more than three trading days.
- The thresholds in this file are frozen as of the date this file was written. They cannot be moved between now and the checkpoint. If new information suggests a threshold is wrong, that observation is logged but does not retroactively change the test.
- The checkpoint write-up is a separate file at
10-Calibration/kit-debrief-002-six-month-checkpoint.md, following the kit-debrief-001-PLTR precedent for structure. It scores each dimension, identifies the aggregate outcome, documents the response, and links to all supporting evidence. - Any threshold modification proposed after the checkpoint is a Tier 2 change to this file with full Backlog routing per Rules. The discipline of writing the test prospectively requires that revisions be deliberate and audited.
What this checkpoint is not
- Not a verdict on whether the kit generates alpha. Six months is too short. The verdict requires 24-36 months minimum.
- Not a basis for soul-file edits. Even an Outcome D result triggers a falsification review, not an automatic philosophy change. The soul-immutable discipline applies; the falsification file exists precisely to specify what would justify a deliberate edit.
- Not a measurement of luck. Performance dimensions are checked for configuration, not magnitude. A small positive alpha and a small negative alpha read the same way at this stage if the attribution is the same.
- Not the only checkpoint. The 12-month, 24-month, and per-thesis calibration checkpoints continue on their own schedules. The six-month test is an early operational read, not a substitute for the longer cycle.
Linked
- Performance — the daily ledger and the three-benchmark configuration this test reads
- Portfolio — current positions and the drawdown state
- drawdown-protocol — Dim 4 reads against this
- thesis-preflight — Dim 1 reads compliance against this
- deliverable-suite — Dim 5 reads completeness against this
- methodology-calibration — the long-run bias audit Dim 2 escalates to if it fails
- 06-falsification — the framework-level question Dim 3 routes toward if configuration 4 fires
- Rules — Dim 6 reads against the optimization tier discipline
- kit-debrief-001-PLTR — the template for the checkpoint write-up
- Backlog — where Tier 2 items accumulate between now and the checkpoint