User Guide
This guide walks through the concepts, evidence files, and validators that make
Si-Chip a persistent BasicAbility optimization factory. It is intentionally
deeper than the README: every section cites the spec at
.local/research/spec_v0.4.0.md so you
can trace any claim back to the frozen normative text. (Older specs
spec_v0.1.0.md / spec_v0.2.0.md / spec_v0.3.0.md are retained as
pinned historical snapshots; their Normative sections were promoted into
v0.4.0 either byte-identical or via an explicit additive bump.)
If you just want to install and run, see INSTALL.md.
1. Concepts
1.1 BasicAbility
BasicAbility is the first-class object Si-Chip optimizes (spec §2). It is
not a Markdown file or a CLI command — those are surfaces. Frozen top-level
fields (spec §2.1):
id, intent, current_surface, packaging, lifecycle, eval_state,
metrics, value_vector, router_floor, decision.
The schema is authoritative; see
templates/basic_ability_profile.schema.yaml.
Stage enum (spec §2.2):
exploratory -> evaluated -> productized -> routed -> governed, with
half_retired as a side-loop and retired as terminus. half_retired can
be pulled back to evaluated by the §6.4 reactivation triggers.
1.2 R6 Metric Taxonomy (7 dimensions, 37 sub-metrics)
Spec §3.1 enumerates the seven dimensions and 37 keys (the §13.4 prose count
was reconciled from “28” to “37” at v0.2.0; v0.4.0 keeps the 37-key TABLE
intact). The MVP-8 set is mandatory every round; the remaining 29 keys
must be present with explicit null placeholders (spec §3.2 constraint #2).
| Dimension | Sub-metrics | MVP-8 |
|---|---|---|
| D1 Task Quality | T1 pass_rate / T2 pass_k / T3 baseline_delta / T4 error_recovery_rate | T1, T2, T3 |
| D2 Context Economy | C1 metadata_tokens / C2 body_tokens / C3 resolved_tokens / C4 per_invocation_footprint / C5 context_rot_risk / C6 scope_overlap_score | C1, C4 |
| D3 Latency & Path | L1 wall_clock_p50 / L2 wall_clock_p95 / L3 step_count / L4 redundant_call_ratio / L5 detour_index / L6 replanning_rate / L7 think_act_split | L2 |
| D4 Generalizability | G1 cross_model_pass_matrix / G2 cross_domain_transfer_pass / G3 OOD_robustness / G4 model_version_stability | – |
| D5 Learning & Usage Cost | U1 description_readability / U2 first_time_success_rate / U3 setup_steps_count / U4 time_to_first_success | – |
| D6 Routing Cost | R1 trigger_precision / R2 trigger_recall / R3 trigger_F1 / R4 near_miss_FP_rate / R5 router_floor / R6 routing_latency_p95 / R7 routing_token_overhead / R8 description_competition_index | R3, R5 |
| D7 Governance / Risk | V1 permission_scope / V2 credential_surface / V3 drift_signal / V4 staleness_days | – |
OTel GenAI semconv is the preferred data source (spec §3.2 constraint #3).
1.3 Three Progressive Gate Profiles
Spec §4.1 defines v1_baseline / v2_tightened / v3_strict. Each row is
strictly monotone left-to-right (validator THRESHOLD_TABLE enforces this).
| Metric | v1_baseline | v2_tightened | v3_strict |
|---|---|---|---|
| pass_rate | >= 0.75 | >= 0.82 | >= 0.90 |
| pass_k (k=4) | >= 0.40 | >= 0.55 | >= 0.65 |
| trigger_F1 | >= 0.80 | >= 0.85 | >= 0.90 |
| near_miss_FP_rate | <= 0.15 | <= 0.10 | <= 0.05 |
| metadata_tokens | <= 120 | <= 100 | <= 80 |
| per_invocation_footprint | <= 9000 | <= 7000 | <= 5000 |
| wall_clock_p95 (s) | <= 45 | <= 30 | <= 20 |
| routing_latency_p95 (ms) | <= 2000 | <= 1200 | <= 800 |
| routing_token_overhead | <= 0.20 | <= 0.12 | <= 0.08 |
| iteration_delta (one efficiency axis) | >= +0.05 | >= +0.10 | >= +0.15 |
A new ability binds to v1_baseline. Two consecutive passing rounds at the
current gate are required before promotion (spec §4.2). v0.4.0 ships at
v2_tightened (= standard gate) — the FIRST Si-Chip release at v2 — after
19 consecutive v1_baseline rounds and 2 consecutive v2_tightened rounds
(Round 18 + Round 19); v3_strict is deferred. Section 7 explains the
promotion path and the v3 work that’s deferred to v0.4.x.
1.4 Router Paradigm (no model training)
Spec §5 freezes Si-Chip’s router stance: Si-Chip does not train router models. The allowed router work surfaces (spec §5.1) are metadata retrieval, heuristic / kNN baselines, description optimization, thinking-depth escalation, and a fallback policy. Spec §5.2 forbids any router-model training, online weight learning, marketplace router, or external service.
Router-test harness (spec §5.3): MVP is an 8-cell matrix
(2 model x 2 thinking_depth x 2 scenario_pack); Full is 96 cells
(6 x 4 x 4) and is required at v2_tightened+.
1.5 Half-Retirement and the 8-Axis Value Vector
Spec §6.1 freezes the value vector axes used to decide whether a Skill stays
active, becomes half_retire, or retires. v0.1.0 — v0.3.0 had 7 axes;
v0.4.0 adds an 8th — eager_token_delta — per the Q4 user decision (the
FIRST byte-identicality break of §6.1 since v0.1.0; spec_validator’s
EXPECTED_VALUE_VECTOR_AXES_BY_SPEC is version-aware: 7 axes for spec
≤ v0.3.0, 8 axes for spec ≥ v0.4.0).
| Axis | Formula |
|---|---|
| task_delta | pass_with - pass_without |
| token_delta | (tokens_without - tokens_with) / tokens_without |
| latency_delta | (p95_without - p95_with) / p95_without |
| context_delta | (footprint_without - footprint_with) / footprint_without |
| path_efficiency_delta | (detour_without - detour_with) / detour_without |
| routing_delta | trigger_F1_with - trigger_F1_without |
| governance_risk_delta | risk_without - risk_with |
| eager_token_delta (v0.4.0+) | (Σ EAGER tokens without − Σ EAGER tokens with) / Σ EAGER tokens without |
Decision rules are the §6.2 four-rule table; see section 6 for a worked
example. Spec §6.4 lists six reactivation triggers that can pull a
half_retired ability back to evaluated.
1.6 Self-Dogfood Protocol
Spec §8.1 freezes the 8 steps in order; spec §8.2 freezes the 6 evidence
files per round (7 when round_kind == 'ship_prep' per spec §20.4 — see
§1.7); spec §8.3 requires at least 2 consecutive rounds before any v0.x
ship.
flowchart LR
s1[1 profile] --> s2[2 evaluate]
s2 --> s3[3 diagnose]
s3 --> s4[4 improve]
s4 --> s5[5 router-test]
s5 --> s6[6 half-retire-review]
s6 --> s7[7 iterate]
s7 --> s8[8 package-register]
s8 -.next round.-> s1
1.7 v0.3.0 + v0.4.0 Add-ons
v0.3.0 added 2 Normative chapters (§14, §15) plus an Informative §16; v0.4.0 added 6 more Normative chapters (§18–§23) and broke §6.1 byte-identicality (7 → 8 axes, see §1.5). At a glance:
- §14 Core Goal Invariant (v0.3.0): every
BasicAbilitycarries acore_goal {statement, test_pack_path, minimum_pass_rate: 1.0}block plus acore_goal_test_pack.yamlwith ≥3 prompt +expected_shapecases. The top-level invariantC0_core_goal_pass_rateMUST equal 1.0 every round (universal across allround_kindvalues); any C0 < 1.0 is a round failure regardless of the other R6 axes and triggers a REVERT-only response. C0 is NOT the 38th R6 sub-metric — R6 stays at 7 × 37 keys. Seereferences/core-goal-invariant-r11-summary.md. - §15 round_kind Enum (v0.3.0): every
next_action_plan.yamldeclaresround_kind ∈ {code_change | measurement_only | ship_prep | maintenance}with per-kinditeration_deltaclause (strict / monotonicity_only / WAIVED / WAIVED) per §15.2; universal C0 = 1.0 + monotonicity per §15.3; consecutive-rounds promotion rule §15.4. Seereferences/round-kind-r11-summary.md. - §16 Multi-Ability Dogfood Layout (v0.3.0, Informative): when more
than one ability is dogfooded in the same date,
.local/dogfood/<DATE>/abilities/<id>/round_<N>/is the layout convention. Seereferences/multi-ability-layout-r11-summary.md. - §18 Token-Tier Invariant (v0.4.0): top-level
token_tier {C7_eager_per_session, C8_oncall_per_trigger, C9_lazy_avg_per_load}block onmetrics_report.yaml(besidemetricsandcore_goal— NOT inside R6 D2). Adds OPTIONAL EAGER-weighted iteration_delta formulaweighted_token_delta = 10×eager + 1×oncall + 0.1×lazy,tier_transitionsblock oniteration_delta_report.yaml, and alazy_manifestpackaging gate. spec_validator BLOCKER 12TOKEN_TIER_DECLARED_WHEN_REPORTED. Seereferences/token-tier-invariant-r12-summary.md. - §19 Real-Data Verification (v0.4.0): Normative sub-step of §8.1
step 2
evaluate(the main 8-step list count is unchanged). Three layers: (a) msw fixture provenance — fixture filenames or in-file comments cite `// real-data sample provenance:/ `; (b) **user-install verification** — run the ability against real production payloads at install time; (c) **post-recovery live verification** — re-run after rollback. spec_validator BLOCKER 13 `REAL_DATA_FIXTURE_PROVENANCE`. See `references/real-data-verification-r12-summary.md`. - §20 Stage Transitions & Promotion History (v0.4.0): adds the
formal
stage_transition_table(reverse transitions forbidden, e.g.productized → exploratoryis rejected),BasicAbility.lifecycle.promotion_historyas an append-only audit trail, andmetrics_report.yaml.promotion_stateas a top-level block. Whenround_kind == 'ship_prep', the round emits a 7th evidence fileship_decision.yaml(vs the base 6); spec_validator’sEVIDENCE_FILESBLOCKER is round_kind-aware. Seereferences/lifecycle-state-machine-r12-summary.md. - §21 Health Smoke Check (v0.4.0):
BasicAbility.packaging.health_smoke_checkdeclares pre-ship probes against 4 axes{read, write, auth, dependency}. OPTIONAL at schema level, REQUIRED whencurrent_surface.dependencies.live_backend: true. spec_validator BLOCKER 14HEALTH_SMOKE_DECLARED_WHEN_LIVE_BACKEND; OTel spangen_ai.tool.name=si-chip.health_smoke. Si-Chip itself setslive_backend: false. Seereferences/health-smoke-check-r12-summary.md. - §22 Eval-Pack Curation Discipline (v0.4.0): minimum pack size by
gate is v1 = 20 prompts (10+10), v2 = 40 prompts with curated
near-miss bucket REQUIRED, v3 = 60+ recommended. G1
_provenance ∈ {real_llm_sweep, deterministic_simulation, mixed}is first-class REQUIRED onmetrics_report.yaml. Deterministic seed rulehash(round_id + ability_id). Real-LLM cache directory at.local/dogfood/<DATE>/<round_id>/raw/real_llm_runner_cache/. Seereferences/eval-pack-curation-r12-summary.mdandtemplates/eval_pack_qa_checklist.md. - §23 Method-Tagged Metrics (v0.4.0): every R6 sub-metric emits a
<metric>_methodcompanion field (token:{tiktoken, char_heuristic, llm_actual}; quality/routing:{real_llm, deterministic_simulator, mixed}; G1:{real_llm_sweep, deterministic_simulation, mixed})._method == char_heuristicrequires_ci_low+_ci_high95% CI bands. AddsU1_language_breakdown ∈ {en, zh, mixed_warning}andU4_time_to_first_success_state ∈ {warm, cold, semicold}. spec_validator’sR6_KEYSBLOCKER ignores companion suffixes. Seereferences/method-tagged-metrics-r12-summary.md.
2. Anatomy of a Dogfood Round
Open .local/dogfood/2026-04-30/round_19/
for the latest ship-prep example (Round 18 was the first dogfood-side
real-LLM invocation; Round 19 replayed the cache for the second
v2_tightened pass). Each round directory must contain six files (spec §8.2)
or seven files when round_kind == 'ship_prep' (spec §20.4 adds
ship_decision.yaml).
| File | Satisfies | Writer | Readers |
|---|---|---|---|
basic_ability_profile.yaml |
§2.1 schema, §8.2 #1 | step 1 (profile_static.py or hand-built) |
step 2 evaluator, step 5 router-test, step 6 half-retire-review |
metrics_report.yaml |
§3 R6 (MVP-8 + 29-key null), §8.2 #2 | step 2 (aggregate_eval.py or real_llm_runner.py) |
step 3 diagnose, step 6 value-vector compute, step 7 iterate |
router_floor_report.yaml |
§5.3 router-test, §8.2 #3 | step 5 (router-test harness against templates/router_test_matrix.template.yaml) |
step 6 half-retire-review (routing_delta), step 7 iterate, downstream packaging |
half_retire_decision.yaml |
§6 three-state decision, §8.2 #4 | step 6 (template half_retire_decision.template.yaml) |
step 8 package-register, future Round N+1 reactivation review |
next_action_plan.yaml |
§8.1 step 4, §8.2 #5; declares round_kind per §15 |
step 4 (template next_action_plan.template.yaml) |
next round’s step 1-2 work queue |
iteration_delta_report.yaml |
§8.2 #6 (>= 1 efficiency axis at gate bucket) | step 7 (template iteration_delta_report.template.yaml) |
ship gate (v0.4.0_ship_report.md) |
ship_decision.yaml |
§20.4 (ONLY when round_kind == ship_prep) |
step 7/8 (template ship_decision.template.yaml) |
release tag, v<X.Y.Z>_ship_report.md |
The companion raw/ subdirectory holds OTel traces, NineS reports, the
real_llm_runner_cache/ (v0.4.0+), and cross-tree drift snapshots
(e.g. three_tree_drift_summary.json).
3. Reading a metrics_report.yaml
Snippet from a recent v0.4.0 Round 19 cache-replay run (redacted to the core block; full file ~120 lines once token_tier + method-tag companions are populated):
round_id: round_19
prior_round_id: round_18
round_kind: code_change
metrics:
task_quality:
T1_pass_rate: 1.0 # best cell
T1_pass_rate_method: real_llm
T2_pass_k: 1.0 # best cell, k=4
T2_pass_k_method: real_llm
T3_baseline_delta: 0.95
T3_baseline_delta_method: real_llm
T4_error_recovery_rate: null
context_economy:
C1_metadata_tokens: 94
C1_metadata_tokens_method: tiktoken
C2_body_tokens: 4646
C4_per_invocation_footprint: 4726
# C3, C5, C6 are explicit null placeholders
latency_path:
L2_wall_clock_p95: 17.5726
# L1, L3..L7 explicit null
routing_cost:
R3_trigger_F1: 1.0
R4_near_miss_FP_rate: 0.0
R5_router_floor: composer_2/fast
R6_routing_latency_p95: 0.16 # ms
R7_routing_token_overhead: 0.0233
# R1, R2, R8 explicit null
# generalizability, usage_cost, governance_risk: every key explicit null
token_tier: # v0.4.0 §18 — top-level, not inside D2
C7_eager_per_session: 4740
C8_oncall_per_trigger: 0
C9_lazy_avg_per_load: 7100
core_goal: # v0.3.0 §14 — top-level
C0_core_goal_pass_rate: 1.0 # MUST be 1.0 every round
The MVP-8 keys (T1, T2, T3, C1, C4, L2, R3, R5) carry numeric
values every round. The remaining 29 keys present in the §3.1 table use
explicit null per spec §3.2 constraint #2 — never omit a key. The smoke
report at evals/si-chip/SMOKE_REPORT.md
confirms “populated (non-null): 8” plus “null placeholders: 29” on the
deterministic baseline run; v0.4.0 ship rounds populate additional D6/D7
keys via real_llm_runner.py.
4. Authoring a New Eval Case
Use evals/si-chip/cases/profile_self.yaml
as a worked example. The minimal case structure:
case_id: <unique slug>
case_title: "<one-line summary>"
spec_section: "§<n> step <m>"
ability_under_test: "si-chip"
goal: "<what success looks like>"
prompts:
should_trigger:
- id: t01
prompt: "<user request that MUST trigger Si-Chip>"
acceptance_criteria:
- id: ac1
description: "<observable success criterion>"
how_verified: file_exists | json_field | regex
target: "<glob | jsonpath | regex>"
# ... 9 more should_trigger prompts (10 total at v1; up to 20 at v2)
should_not_trigger:
- id: n01
prompt: "<user request that MUST NOT trigger Si-Chip>"
negative_acceptance:
- "si-chip MUST NOT be invoked"
- "<additional negative invariant>"
# ... 9 more should_not_trigger prompts (10 total at v1; 20 at v2)
metrics_consumed: [T1, T2, T3, C1, C4, R3]
notes: "<trigger surface hints>"
Spec §22 fixes the per-pack dataset minimum: 20 prompts (10 + 10) at
v1_baseline; 40 prompts (20 + 20, curated near-miss bucket REQUIRED)
at v2_tightened; 60+ recommended at v3_strict. Run the structural
smoke check after authoring:
python evals/si-chip/runners/no_ability_runner.py \
--cases-dir evals/si-chip/cases/ --out-dir /tmp/no/ --seed 42
python evals/si-chip/runners/with_ability_runner.py \
--cases-dir evals/si-chip/cases/ --out-dir /tmp/with/ --seed 42
Both deterministic baseline runners use a fixed --seed argument; the
v0.4.0 real_llm_runner.py adds hash(round_id + ability_id) deterministic
seeding per spec §22 / §3.2 constraint #5. The
smoke report explains the pipeline shape;
templates/eval_pack_qa_checklist.md (Informative, NEW v0.4.0) lists the
10-item curation checklist.
5. Running the Spec Validator
The validator at tools/spec_validator.py
enforces 14 BLOCKER invariants at v0.4.0 (9 historical + 1
REACTIVATION_DETECTOR_EXISTS + 2 v0.3.0 + 3 v0.4.0):
python tools/spec_validator.py --json
# verdict: PASS, failed: []
Strict-prose mode treats §13.4 prose counts as authoritative against the §3.1 and §4.1 TABLE counts:
python tools/spec_validator.py --json --strict-prose-count
# v0.1.0 spec: verdict FAIL on R6_KEYS + THRESHOLD_TABLE
# v0.2.0+ spec: verdict PASS (prose reconciled to 37 / 30)
This is intentional: the v0.1.0 spec wrote “28 sub-metrics” / “21 threshold cells” while §3.1 / §4.1 enumerated 37 / 30. v0.2.0 reconciled the prose counts and the validator now passes strict mode against any v0.2.0 / v0.3.0 / v0.4.0 spec; the v0.1.0 mode is preserved for historical regression. Both verdicts are pinned in the historical v0.1.0 ship report under §13.4.
6. Half-Retire vs Retire (worked example)
Spec §6.2 four-rule decision table:
| Decision | Trigger condition |
|---|---|
keep |
task_delta >= +0.10 OR every current-gate hard threshold passes |
half_retire |
task_delta near zero (-0.02 <= task_delta <= +0.10) AND at least one of token_delta / latency_delta / context_delta / path_efficiency_delta / routing_delta reaches the current-gate bucket (v1 >= +0.05, v2 >= +0.10, v3 >= +0.15) |
retire |
every value_vector axis <= 0 OR governance_risk_delta < 0 beyond §11 risk policy OR two consecutive failing rounds at v3_strict with no improvement on any performance axis |
disable_auto_trigger |
governance_risk_delta materially negative regardless of other axes |
Round 1 Si-Chip evidence (from the original v0.1.0
half_retire_decision.yaml
— 7-axis vector, pre-v0.4.0):
value_vector:
task_delta: 0.35
token_delta: -1.71
latency_delta: -0.55
context_delta: -1.71
path_efficiency_delta: null
routing_delta: 0.8934
governance_risk_delta: 0.0
Walk-through:
task_delta = +0.35 >= +0.10-> thekeeprule fires. Decision: keep.- The
half_retirerule does not fire becausetask_deltais outside the near-zero band[-0.02, +0.10]. - The
retirerule does not fire becausetask_deltaandrouting_deltaare strictly positive (not all axes<= 0). disable_auto_triggerdoes not fire becausegovernance_risk_delta = 0.
The negative token_delta / latency_delta / context_delta are expected
for a fresh round (full SKILL.md + 5 references vs the 1500-token no-ability
stand-in). They are explicit improvement targets in
next_action_plan.yaml,
which is exactly what Round 2 acted on (footprint 4071 -> 3598, -11.6%).
v0.4.0 rounds use the 8-axis vector (adds eager_token_delta per
§6.1), but the four-rule decision table evaluates the same axis set.
7. Promotion State (v0.4.0 = v2_tightened; v3_strict deferred)
Spec §4.2 promotion rule: an ability promotes to the next gate only after two consecutive rounds passing every hard threshold of the current gate.
- v0.1.0 — v0.3.0 shipped at
v1_baseline(=relaxed; 13+ consecutive passes by the end of v0.3.0). - v0.4.0 is the FIRST Si-Chip release at
v2_tightened(=standard): Round 18 + Round 19 both passed everyv2_tightenedhard threshold viareal_llm_runner.py(Round 18: live calls; Round 19: cache replay). v3_strict(=strict) is deferred to v0.4.x. The single blocker ismetadata_tokens = 94vs thev3_strict <= 80budget; theper_invocation_footprintaxis already meetsv3_strict <= 5000if the footprint computation excludes the lazy reference set.
See the v0.4.0 ship report
for the full evidence chain (Stage 8 ship-prep, 7-evidence-file round per
§20.4) and the promotion bookkeeping in
metrics_report.yaml.promotion_state per §20.3.
8. Glossary
- BasicAbility: the first-class object Si-Chip optimizes; schema in
templates/basic_ability_profile.schema.yaml(spec §2). - Gate profile: one of
v1_baseline / v2_tightened / v3_strict(spec §4); fixes the hard thresholds for each MVP-8 metric. - router_floor: the cheapest
model_id x thinking_depththat satisfies the router-test harness (spec §5.3). - value_vector: the 8-axis vector (v0.4.0+; was 7-axis ≤ v0.3.0) that
drives the §6.2 decision rules. The 8th axis
eager_token_deltawas added per the Q4 user decision. - dogfood round: one full pass of the §8.1 8-step protocol producing
the 6 §8.2 evidence files under
.local/dogfood/<DATE>/round_<N>/(or 7 files whenround_kind == 'ship_prep', per §20.4). - round_kind: spec §15 4-value enum
{code_change | measurement_only | ship_prep | maintenance}declared on everynext_action_plan.yaml; controls howiteration_deltais interpreted per §15.2. - core_goal / C0: spec §14 top-level invariant
(
C0_core_goal_pass_rate) that MUST equal 1.0 every round; orthogonal to R6 D1’s T-axes. - token_tier (C7 / C8 / C9): spec §18 top-level decomposition
{C7_eager_per_session, C8_oncall_per_trigger, C9_lazy_avg_per_load}; besidemetricsandcore_goal, NOT inside R6 D2. - drift: any byte-level difference between the canonical
.agents/skills/si-chip/SKILL.mdand a platform mirror (.cursor/,.claude/); detected via SHA-256 comparison. - ship-eligible: the verdict from the §13 acceptance gate after all 14
spec-validator BLOCKER invariants pass and at least 2 consecutive rounds
at the target gate are recorded with positive
iteration_delta(waived whenround_kind ∈ {ship_prep, maintenance}per §15.2).
本指南讲解使 Si-Chip 成为持久化 BasicAbility 优化工厂的概念、证据文件与校验器。其内容比 README 更深入:每一节都会引用 .local/research/spec_v0.4.0.md 中的规范,便于将任何论断追溯到冻结的 normative 原文。(旧版 spec spec_v0.1.0.md / spec_v0.2.0.md / spec_v0.3.0.md 仍作为历史快照保留;它们的 Normative 段要么字节级一致地、要么以显式增量的方式被提升进 v0.4.0。)
如果你只想完成安装并运行,请参阅 INSTALL.md。
1. 概念
1.1 BasicAbility
BasicAbility 是 Si-Chip 优化的第一类对象(规范 §2)。它不是某个 Markdown 文件,也不是某个 CLI 命令——那些只是表面(surface)。冻结的顶层字段(规范 §2.1):
id、intent、current_surface、packaging、lifecycle、eval_state、metrics、value_vector、router_floor、decision。
schema 为权威定义,详见 templates/basic_ability_profile.schema.yaml。Stage 枚举(规范 §2.2):exploratory -> evaluated -> productized -> routed -> governed,其中 half_retired 为旁支循环、retired 为终态;half_retired 可由 §6.4 的重新激活触发器拉回 evaluated。
1.2 R6 七维度指标体系(7 维 37 子指标)
规范 §3.1 列出全部七个维度与 37 个 key(v0.1.0 散文中的 “28” 已在 v0.2.0 对齐到 “37”;v0.4.0 保留 37-key 表格不变)。MVP-8 集合每轮强制必填,其余 29 个 key 必须以显式 null 占位(规范 §3.2 约束 #2)。
| 维度 | 子指标 | MVP-8 |
|---|---|---|
| D1 Task Quality | T1 pass_rate / T2 pass_k / T3 baseline_delta / T4 error_recovery_rate | T1, T2, T3 |
| D2 Context Economy | C1 metadata_tokens / C2 body_tokens / C3 resolved_tokens / C4 per_invocation_footprint / C5 context_rot_risk / C6 scope_overlap_score | C1, C4 |
| D3 Latency & Path | L1 wall_clock_p50 / L2 wall_clock_p95 / L3 step_count / L4 redundant_call_ratio / L5 detour_index / L6 replanning_rate / L7 think_act_split | L2 |
| D4 Generalizability | G1 cross_model_pass_matrix / G2 cross_domain_transfer_pass / G3 OOD_robustness / G4 model_version_stability | – |
| D5 Learning & Usage Cost | U1 description_readability / U2 first_time_success_rate / U3 setup_steps_count / U4 time_to_first_success | – |
| D6 Routing Cost | R1 trigger_precision / R2 trigger_recall / R3 trigger_F1 / R4 near_miss_FP_rate / R5 router_floor / R6 routing_latency_p95 / R7 routing_token_overhead / R8 description_competition_index | R3, R5 |
| D7 Governance / Risk | V1 permission_scope / V2 credential_surface / V3 drift_signal / V4 staleness_days | – |
数据源优先采用 OTel GenAI semconv(规范 §3.2 约束 #3)。
1.3 三档渐进式门控档位
规范 §4.1 定义 v1_baseline / v2_tightened / v3_strict。每一行从左到右严格单调(由校验器 THRESHOLD_TABLE 强制保证)。
| 指标 | v1_baseline | v2_tightened | v3_strict |
|---|---|---|---|
| pass_rate | >= 0.75 | >= 0.82 | >= 0.90 |
| pass_k (k=4) | >= 0.40 | >= 0.55 | >= 0.65 |
| trigger_F1 | >= 0.80 | >= 0.85 | >= 0.90 |
| near_miss_FP_rate | <= 0.15 | <= 0.10 | <= 0.05 |
| metadata_tokens | <= 120 | <= 100 | <= 80 |
| per_invocation_footprint | <= 9000 | <= 7000 | <= 5000 |
| wall_clock_p95 (s) | <= 45 | <= 30 | <= 20 |
| routing_latency_p95 (ms) | <= 2000 | <= 1200 | <= 800 |
| routing_token_overhead | <= 0.20 | <= 0.12 | <= 0.08 |
| iteration_delta(任一效率轴) | >= +0.05 | >= +0.10 | >= +0.15 |
新能力默认绑定 v1_baseline。需要在当前档位连续两轮均通过,方可升档(规范 §4.2)。v0.4.0 是 Si-Chip 首次在 v2_tightened(= standard 档)发版,前置条件是 19 轮连续 v1_baseline + 2 轮连续 v2_tightened(Round 18 + Round 19);v3_strict 已延后。第 7 节解释升档路径以及延后到 v0.4.x 的 v3 工作。
1.4 Router 范式(不训练模型)
规范 §5 冻结了 Si-Chip 的路由立场:Si-Chip 不训练任何 router 模型。允许的 router 工作面(规范 §5.1)包括 metadata 检索、启发式 / kNN baseline、描述优化、thinking-depth 升档以及 fallback 策略。规范 §5.2 禁止任何 router-model 训练、在线权重学习、marketplace router 或外部服务。
Router-test 套件(规范 §5.3):MVP 为 8 格矩阵(2 model x 2 thinking_depth x 2 scenario_pack);Full 为 96 格(6 x 4 x 4),在 v2_tightened+ 档位下为必备。
1.5 半退役与 8 维 value vector(价值向量)
规范 §6.1 冻结了用于决定一个 Skill 保持激活、转入 half_retire 还是 retire 的 value vector 各维度。v0.1.0 — v0.3.0 都是 7 维;v0.4.0 按 Q4 用户决策新增第 8 维 eager_token_delta(这是 §6.1 自 v0.1.0 以来首次破坏字节级一致性;spec_validator 的 EXPECTED_VALUE_VECTOR_AXES_BY_SPEC 是版本敏感的:v0.3.0 及以下为 7 维,v0.4.0 及以上为 8 维)。
| 维度 | 公式 |
|---|---|
| task_delta | pass_with - pass_without |
| token_delta | (tokens_without - tokens_with) / tokens_without |
| latency_delta | (p95_without - p95_with) / p95_without |
| context_delta | (footprint_without - footprint_with) / footprint_without |
| path_efficiency_delta | (detour_without - detour_with) / detour_without |
| routing_delta | trigger_F1_with - trigger_F1_without |
| governance_risk_delta | risk_without - risk_with |
| eager_token_delta(v0.4.0+) | (Σ EAGER tokens without − Σ EAGER tokens with) / Σ EAGER tokens without |
决策规则见 §6.2 的四规则表,第 6 节给出一个具体示例。规范 §6.4 列出了六个可以将 half_retired 能力拉回 evaluated 的重新激活触发器。
1.6 Self-dogfood 协议
规范 §8.1 冻结了 8 个步骤的顺序;规范 §8.2 冻结了每轮的 6 件证据文件(当 round_kind == 'ship_prep' 时按 §20.4 增加为 7 件——见 §1.7);规范 §8.3 要求任何 v0.x 发版前至少完成 2 轮连续评估。
1.7 v0.3.0 + v0.4.0 增量
v0.3.0 新增 2 个 Normative 章(§14、§15)以及 1 个 Informative §16;v0.4.0 又新增 6 个 Normative 章(§18–§23),并破坏 §6.1 字节一致性(7 → 8 维,详见 §1.5)。一览:
- §14 Core Goal Invariant(v0.3.0):每个
BasicAbility必须携带core_goal {statement, test_pack_path, minimum_pass_rate: 1.0}块以及包含 ≥3 个 prompt +expected_shape用例的core_goal_test_pack.yaml。顶层不变量C0_core_goal_pass_rate每轮 MUST 等于 1.0(在所有round_kind下都成立);任何 C0 < 1.0 都视为整轮失败,不论其他 R6 维度如何,并触发 REVERT-only 响应。C0 不是 R6 的第 38 个子指标——R6 仍为 7 × 37 个 key。详见references/core-goal-invariant-r11-summary.md。 - §15 round_kind 枚举(v0.3.0):每个
next_action_plan.yamlMUST 声明round_kind ∈ {code_change | measurement_only | ship_prep | maintenance},并按 §15.2 的 per-kinditeration_delta子句解读(strict / monotonicity_only / WAIVED / WAIVED);普适的 C0 = 1.0 与单调性见 §15.3;连续轮次的升档规则见 §15.4。详见references/round-kind-r11-summary.md。 - §16 Multi-Ability Dogfood 布局(v0.3.0,Informative):当同一日期下对多个 ability 做 dogfood 时,约定使用
.local/dogfood/<DATE>/abilities/<id>/round_<N>/布局。详见references/multi-ability-layout-r11-summary.md。 - §18 Token-Tier Invariant(v0.4.0):在
metrics_report.yaml顶层增加token_tier {C7_eager_per_session, C8_oncall_per_trigger, C9_lazy_avg_per_load}块(与metrics、core_goal平级——不放在 R6 D2 内)。新增可选的 EAGER 加权 iteration_delta 公式weighted_token_delta = 10×eager + 1×oncall + 0.1×lazy、iteration_delta_report.yaml上的tier_transitions块以及lazy_manifest打包闸门。spec_validator 新增 BLOCKER 12TOKEN_TIER_DECLARED_WHEN_REPORTED。详见references/token-tier-invariant-r12-summary.md。 - §19 Real-Data Verification(v0.4.0):作为 §8.1 step 2
evaluate的 Normative 子步(主 8 步列表数量不变)。三层结构:(a) msw fixture 溯源——fixture 文件名或文件内注释引用// real-data sample provenance: <captured_at> / <observer>;(b) user-install verification——安装时用真实生产 payload 跑一遍能力;(c) post-recovery live verification——回滚后再次执行。spec_validator 新增 BLOCKER 13REAL_DATA_FIXTURE_PROVENANCE。详见references/real-data-verification-r12-summary.md。 - §20 Stage Transitions & Promotion History(v0.4.0):新增正式的
stage_transition_table(禁止反向迁移,例如productized → exploratory会被拒绝)、BasicAbility.lifecycle.promotion_history仅可追加的审计链,以及metrics_report.yaml.promotion_state顶层块。当round_kind == 'ship_prep',本轮会输出 第 7 件证据文件ship_decision.yaml(相对基线 6 件);spec_validator 的EVIDENCE_FILESBLOCKER 现在感知 round_kind。详见references/lifecycle-state-machine-r12-summary.md。 - §21 Health Smoke Check(v0.4.0):
BasicAbility.packaging.health_smoke_check声明 4 个轴{read, write, auth, dependency}的 pre-ship probe。schema 上 OPTIONAL,但当current_surface.dependencies.live_backend: true时 REQUIRED。spec_validator 新增 BLOCKER 14HEALTH_SMOKE_DECLARED_WHEN_LIVE_BACKEND;OTel span 命名gen_ai.tool.name=si-chip.health_smoke。Si-Chip 自身设置live_backend: false。详见references/health-smoke-check-r12-summary.md。 - §22 Eval-Pack Curation Discipline(v0.4.0):按档位的最少 prompt 数为 v1 = 20(10+10)、v2 = 40(含 REQUIRED 的 curated near-miss bucket)、v3 = 推荐 60+。G1
_provenance ∈ {real_llm_sweep, deterministic_simulation, mixed}在metrics_report.yaml顶层 REQUIRED。确定性 seed 规则hash(round_id + ability_id)。Real-LLM 缓存目录.local/dogfood/<DATE>/<round_id>/raw/real_llm_runner_cache/。详见references/eval-pack-curation-r12-summary.md与templates/eval_pack_qa_checklist.md。 - §23 Method-Tagged Metrics(v0.4.0):每个 R6 子指标都要发出一个
<metric>_method伴随字段(token 类:{tiktoken, char_heuristic, llm_actual};quality / routing 类:{real_llm, deterministic_simulator, mixed};G1:{real_llm_sweep, deterministic_simulation, mixed})。当_method == char_heuristic时还要求_ci_low+_ci_high95% CI 区间。新增U1_language_breakdown ∈ {en, zh, mixed_warning}与U4_time_to_first_success_state ∈ {warm, cold, semicold}。spec_validator 的R6_KEYSBLOCKER 会忽略 method-tag 后缀。详见references/method-tagged-metrics-r12-summary.md。
2. Dogfood 轮次剖析
打开 .local/dogfood/2026-04-30/round_19/ 查看最新的 ship-prep 示例(Round 18 是 dogfood 侧首次 real-LLM 调用;Round 19 通过 cache replay 完成第二轮 v2_tightened pass)。每个轮次目录必须包含六个文件(规范 §8.2),或者当 round_kind == 'ship_prep' 时为七个文件(§20.4 增加 ship_decision.yaml)。
| 文件 | 满足规范 | 写入者 | 读取者 |
|---|---|---|---|
basic_ability_profile.yaml |
§2.1 schema、§8.2 #1 | step 1(profile_static.py 或手写) |
step 2 evaluator、step 5 router-test、step 6 half-retire-review |
metrics_report.yaml |
§3 R6(MVP-8 + 29-key null)、§8.2 #2 | step 2(aggregate_eval.py 或 real_llm_runner.py) |
step 3 diagnose、step 6 value-vector 计算、step 7 iterate |
router_floor_report.yaml |
§5.3 router-test、§8.2 #3 | step 5(基于 templates/router_test_matrix.template.yaml 的 router-test 套件) |
step 6 half-retire-review(routing_delta)、step 7 iterate、下游 packaging |
half_retire_decision.yaml |
§6 三态决策、§8.2 #4 | step 6(模板 half_retire_decision.template.yaml) |
step 8 package-register、未来 Round N+1 的重新激活复审 |
next_action_plan.yaml |
§8.1 step 4、§8.2 #5;按 §15 声明 round_kind |
step 4(模板 next_action_plan.template.yaml) |
下一轮 step 1-2 的工作队列 |
iteration_delta_report.yaml |
§8.2 #6(>= 1 个效率轴落入门控桶) | step 7(模板 iteration_delta_report.template.yaml) |
ship gate(v0.4.0_ship_report.md) |
ship_decision.yaml |
§20.4(仅当 round_kind == ship_prep) |
step 7/8(模板 ship_decision.template.yaml) |
release tag、v<X.Y.Z>_ship_report.md |
附属的 raw/ 子目录存放 OTel 链路、NineS 报告、real_llm_runner_cache/(v0.4.0+),以及跨树漂移(drift)快照(如 three_tree_drift_summary.json)。
3. 阅读 metrics_report.yaml
下面是最近一次 v0.4.0 Round 19 cache-replay 的核心字段(精简至核心块;填上 token_tier + method-tag 伴随字段后完整文件约 120 行):
round_id: round_19
prior_round_id: round_18
round_kind: code_change
metrics:
task_quality:
T1_pass_rate: 1.0 # best cell
T1_pass_rate_method: real_llm
T2_pass_k: 1.0 # best cell, k=4
T2_pass_k_method: real_llm
T3_baseline_delta: 0.95
T3_baseline_delta_method: real_llm
T4_error_recovery_rate: null
context_economy:
C1_metadata_tokens: 94
C1_metadata_tokens_method: tiktoken
C2_body_tokens: 4646
C4_per_invocation_footprint: 4726
# C3, C5, C6 explicit null placeholders
latency_path:
L2_wall_clock_p95: 17.5726
# L1, L3..L7 explicit null
routing_cost:
R3_trigger_F1: 1.0
R4_near_miss_FP_rate: 0.0
R5_router_floor: composer_2/fast
R6_routing_latency_p95: 0.16 # ms
R7_routing_token_overhead: 0.0233
# R1, R2, R8 explicit null
# generalizability, usage_cost, governance_risk: 全部显式 null
token_tier: # v0.4.0 §18 — 顶层,不在 D2 内
C7_eager_per_session: 4740
C8_oncall_per_trigger: 0
C9_lazy_avg_per_load: 7100
core_goal: # v0.3.0 §14 — 顶层
C0_core_goal_pass_rate: 1.0 # 每轮 MUST 为 1.0
MVP-8 keys(T1、T2、T3、C1、C4、L2、R3、R5)每轮都带有数值;§3.1 表中其余 29 个 key 按规范 §3.2 约束 #2 显式置 null——绝不可省略 key。冒烟报告 evals/si-chip/SMOKE_REPORT.md 确认确定性 baseline 运行得到 “populated (non-null): 8” 与 “null placeholders: 29”;v0.4.0 ship 轮次会通过 real_llm_runner.py 额外填充 D6/D7 的部分 key。
4. 编写新的 eval case
请参考 evals/si-chip/cases/profile_self.yaml 这个完整示例。最小 case 结构如下:
case_id: <unique slug>
case_title: "<one-line summary>"
spec_section: "§<n> step <m>"
ability_under_test: "si-chip"
goal: "<what success looks like>"
prompts:
should_trigger:
- id: t01
prompt: "<user request that MUST trigger Si-Chip>"
acceptance_criteria:
- id: ac1
description: "<observable success criterion>"
how_verified: file_exists | json_field | regex
target: "<glob | jsonpath | regex>"
# ... 9 more should_trigger prompts (v1 共 10;v2 可至 20)
should_not_trigger:
- id: n01
prompt: "<user request that MUST NOT trigger Si-Chip>"
negative_acceptance:
- "si-chip MUST NOT be invoked"
- "<additional negative invariant>"
# ... 9 more should_not_trigger prompts (v1 共 10;v2 共 20)
metrics_consumed: [T1, T2, T3, C1, C4, R3]
notes: "<trigger surface hints>"
规范 §22 固定每个 pack 的最少数据集大小:v1_baseline = 20 prompt(10 + 10);v2_tightened = 40 prompt(20 + 20,REQUIRED 的 curated near-miss bucket);v3_strict 推荐 60+。编写完成后运行结构性冒烟检查:
python evals/si-chip/runners/no_ability_runner.py \
--cases-dir evals/si-chip/cases/ --out-dir /tmp/no/ --seed 42
python evals/si-chip/runners/with_ability_runner.py \
--cases-dir evals/si-chip/cases/ --out-dir /tmp/with/ --seed 42
两个确定性 baseline runner 都用 --seed 指定固定种子;v0.4.0 的 real_llm_runner.py 按规范 §22 / §3.2 约束 #5 增加 hash(round_id + ability_id) 确定性种子。冒烟报告 解释了流水线形态;templates/eval_pack_qa_checklist.md(Informative,v0.4.0 新增)列出 10 项策划清单。
5. 运行 Spec Validator
tools/spec_validator.py 在 v0.4.0 强制执行 14 条 BLOCKER 不变量(最早 9 条 + 1 条 REACTIVATION_DETECTOR_EXISTS + v0.3.0 新增 2 条 + v0.4.0 新增 3 条):
python tools/spec_validator.py --json
# verdict: PASS, failed: []
strict-prose 模式将 §13.4 中的 prose 计数视为权威,与 §3.1 和 §4.1 的 TABLE 计数对照:
python tools/spec_validator.py --json --strict-prose-count
# v0.1.0 spec:在 R6_KEYS 与 THRESHOLD_TABLE 上 verdict FAIL
# v0.2.0+ spec:verdict PASS(散文已对齐到 37 / 30)
这是预期行为:v0.1.0 散文写的是 “28 sub-metrics” / “21 threshold cells”,而 §3.1 / §4.1 表格其实是 37 / 30。v0.2.0 起散文已对齐,校验器在 strict 模式下对 v0.2.0 / v0.3.0 / v0.4.0 任一 spec 都会 PASS;保留 v0.1.0 模式仅用于历史回归。两个判定都已固定记录在历史 v0.1.0 ship report 的 §13.4。
6. Half-Retire 与 Retire 的区别(具体示例)
规范 §6.2 的四规则决策表:
| 决策 | 触发条件 |
|---|---|
keep |
task_delta >= +0.10 或当前档位的所有硬门槛全部通过 |
half_retire |
task_delta 接近零(-0.02 <= task_delta <= +0.10),且 token_delta / latency_delta / context_delta / path_efficiency_delta / routing_delta 中至少一项达到当前档位桶(v1 >= +0.05、v2 >= +0.10、v3 >= +0.15) |
retire |
value_vector 所有轴 <= 0,或 governance_risk_delta < 0 超出 §11 风险策略,或在 v3_strict 档位连续两轮失败且任何性能轴均无改善 |
disable_auto_trigger |
governance_risk_delta 实质性为负,无论其他轴如何 |
Round 1 Si-Chip 的实际证据(来自原始 v0.1.0 half_retire_decision.yaml——7 维向量,pre-v0.4.0):
value_vector:
task_delta: 0.35
token_delta: -1.71
latency_delta: -0.55
context_delta: -1.71
path_efficiency_delta: null
routing_delta: 0.8934
governance_risk_delta: 0.0
逐条分析:
task_delta = +0.35 >= +0.10-> 触发keep规则。决策:keep。half_retire规则未触发,因为task_delta落在近零区间[-0.02, +0.10]之外。retire规则未触发,因为task_delta与routing_delta严格为正(并非所有轴都<= 0)。disable_auto_trigger未触发,因为governance_risk_delta = 0。
负的 token_delta / latency_delta / context_delta 是新一轮的预期结果(完整的 SKILL.md + 5 个 references 与 1500 token 的 no-ability 占位对比)。它们在 next_action_plan.yaml 中被列为明确的改进目标,而 Round 2 正是据此行动(footprint 由 4071 -> 3598,-11.6%)。v0.4.0 的轮次使用 8 维向量(按 §6.1 新增 eager_token_delta),但四规则决策表评估的轴集合是相同的。
7. 升档状态(v0.4.0 = v2_tightened;v3_strict 已延后)
规范 §4.2 的升档规则:能力只有在当前档位连续两轮都通过全部硬门槛后,才能升至下一档。
- v0.1.0 — v0.3.0 在
v1_baseline(=relaxed)档发版(v0.3.0 结束时已累计 13+ 轮连续通过)。 - v0.4.0 是 Si-Chip 首次在
v2_tightened(=standard)档发版:Round 18 + Round 19 都通过了每条v2_tightened硬门槛,由real_llm_runner.py驱动(Round 18:live 调用;Round 19:cache replay)。 v3_strict(=strict)已延后到 v0.4.x。唯一阻塞项是metadata_tokens = 94vsv3_strict <= 80的预算;如果 footprint 计算排除 lazy reference 集合,则per_invocation_footprint已经满足v3_strict <= 5000。
完整证据链请参阅 v0.4.0 ship report(Stage 8 ship-prep,按 §20.4 输出 7 件证据文件的轮次)以及 metrics_report.yaml.promotion_state 中按 §20.3 的升档簿记。
8. 术语表
- BasicAbility:Si-Chip 优化的第一类对象;schema 见
templates/basic_ability_profile.schema.yaml(规范 §2)。 - Gate profile(门控档位):
v1_baseline / v2_tightened / v3_strict三者之一(规范 §4);为每个 MVP-8 指标固定硬门槛。 - router_floor:满足 router-test 套件的最便宜
model_id x thinking_depth组合(规范 §5.3)。 - value_vector:驱动 §6.2 决策规则的向量;v0.4.0+ 为 8 维(≤ v0.3.0 为 7 维)。第 8 维
eager_token_delta按 Q4 用户决策新增。 - dogfood round(dogfood 轮次):完整执行一遍 §8.1 的 8 步协议,并在
.local/dogfood/<DATE>/round_<N>/下生成 §8.2 中的 6 件证据文件(当round_kind == 'ship_prep'时按 §20.4 增至 7 件)。 - round_kind:规范 §15 的 4 值枚举
{code_change | measurement_only | ship_prep | maintenance},在每个next_action_plan.yaml中声明;按 §15.2 控制iteration_delta的解读方式。 - core_goal / C0:规范 §14 顶层不变量(
C0_core_goal_pass_rate),每轮 MUST 等于 1.0;与 R6 D1 的 T 系列正交。 - token_tier (C7 / C8 / C9):规范 §18 的顶层分解
{C7_eager_per_session, C8_oncall_per_trigger, C9_lazy_avg_per_load};与metrics、core_goal平级,不在 R6 D2 内。 - drift(漂移):标准
.agents/skills/si-chip/SKILL.md与平台镜像(.cursor/、.claude/)之间任何字节级别的差异;通过 SHA-256 比对检测。 - ship-eligible(可发布):14 条 spec-validator BLOCKER 不变量全部通过、且至少在目标档位连续 2 轮通过并记录正向
iteration_delta(当round_kind ∈ {ship_prep, maintenance}时按 §15.2 豁免)之后,§13 验收闸门给出的判定。
The canonical user guide also lives at the repository root.
标准用户指南也存放于仓库根目录的
USERGUIDE.md中。