Demo
This page walks through the actual evidence Si-Chip produced when it
self-shipped v0.4.0 on 2026-04-30. v0.4.0 is the first Si-Chip release at
the v2_tightened (= standard) gate — Round 18 + Round 19 are the two
consecutive v2_tightened passes that unlocked promotion. The historical
v0.1.0 evidence (Round 1 + Round 2; v1_baseline; 2026-04-28) is preserved
in chapter 8 as an additive baseline.
本页面逐章呈现 Si-Chip 在 2026-04-30 自我交付 v0.4.0 时实际产出的证据。
v0.4.0 是 Si-Chip 首次在 v2_tightened(= standard)档位发版——
Round 18 + Round 19 是解锁升档的两轮连续 v2_tightened 通过。历史的 v0.1.0
证据(Round 1 + Round 2;v1_baseline;2026-04-28)作为对照基线保留在第 8 章。
// CHAPTER 01 //
1. Two consecutive v2_tightened passes (v0.4.0)
| Round | T1 pass_rate | T2 pass_k | trigger_F1 | metadata_tokens | per_invocation_footprint | wall_clock_p95 (s) | router_floor |
|---|---|---|---|---|---|---|---|
| 18 (live) | 1.0 best cell, 0.9719 mean | 1.0 best cell | 1.0 | 94 (Stage 8 frozen) | 4726 | 17.0028 | composer_2/fast |
| 19 (replay) | same (cache replay) | same (cache replay) | 1.0 | 94 | 4726 | 17.5726 | composer_2/fast |
Round 18 was the first dogfood-side evals/si-chip/runners/real_llm_runner.py
invocation against claude-haiku-4-5 + claude-sonnet-4-6 via Veil
litellm-local at 127.0.0.1:8086 — $0.20 spend, 640 calls, 16-min
wall-clock; honest k=4 sampling unblocked T2_pass_k from the deterministic
SHA-256 PROXY 0.5478 lower bound it had been stuck at since Round 1.
Round 19 replayed the cache at 100% hit ($0; ~20 ms wall-clock) for the
second consecutive v2 pass.
1. 连续两轮 v2_tightened 通过(v0.4.0)
| 轮次 | T1 pass_rate | T2 pass_k | trigger_F1 | metadata_tokens | per_invocation_footprint | wall_clock_p95(秒) | router_floor |
|---|---|---|---|---|---|---|---|
| 18(live) | 1.0 best cell,0.9719 mean | 1.0 best cell | 1.0 | 94(Stage 8 已冻结) | 4726 | 17.0028 | composer_2/fast |
| 19(replay) | 同上(cache replay) | 同上(cache replay) | 1.0 | 94 | 4726 | 17.5726 | composer_2/fast |
Round 18 是 dogfood 侧首次调用 evals/si-chip/runners/real_llm_runner.py,
通过 Veil litellm-local(127.0.0.1:8086)打到 claude-haiku-4-5 +
claude-sonnet-4-6 —— 共花 $0.20、640 次调用、16 分钟 wall-clock;honest
k=4 采样把 T2_pass_k 从 Round 1 起一直困在 SHA-256 PROXY 下界 0.5478 解放出来。
Round 19 用 100% 命中的 cache replay 完成第二轮 v2 通过($0、约 20 ms wall-clock)。
// CHAPTER 02 //
2. The 8-axis value vector (Round 19; v0.4.0)
v0.4.0 broke value_vector byte-identicality with the addition of an
8th axis eager_token_delta per the Q4 user decision (the FIRST §6.1
break since v0.1.0).
| Axis | Round 19 | Direction |
|---|---|---|
task_delta |
+0.95 | improvement vs no-ability baseline (real-LLM) |
token_delta |
+0.0 | unchanged at v0.4.0 ship |
latency_delta |
+0.0 | unchanged at v0.4.0 ship |
context_delta |
+0.0 | unchanged at v0.4.0 ship |
path_efficiency_delta |
null | not measured this round |
routing_delta |
+1.0 | improvement (full trigger_F1) |
governance_risk_delta |
0.0 | unchanged |
eager_token_delta (NEW @ v0.4.0) |
per token_tier block |
EAGER tokens / session decomposition |
Decision rule (spec §6.2): task_delta = +0.95 >= +0.10 → keep.
2. 8 维 value_vector(Round 19;v0.4.0)
v0.4.0 按 Q4 用户决策新增第 8 维 eager_token_delta,破坏了 §6.1 自 v0.1.0 以来的
字节级一致性(这是首次破坏)。
| 维度 | Round 19 | 方向 |
|---|---|---|
task_delta |
+0.95 | 相对 no-ability baseline 改善(real-LLM) |
token_delta |
+0.0 | v0.4.0 发版时未变 |
latency_delta |
+0.0 | v0.4.0 发版时未变 |
context_delta |
+0.0 | v0.4.0 发版时未变 |
path_efficiency_delta |
null | 本轮未测量 |
routing_delta |
+1.0 | 改善(trigger_F1 满分) |
governance_risk_delta |
0.0 | 未变化 |
eager_token_delta(v0.4.0 新增) |
由 token_tier 块描述 |
每会话 EAGER token 分解 |
判定规则(规范 §6.2):task_delta = +0.95 >= +0.10 → keep。
// CHAPTER 03 //
3. Real-LLM router-test sweep (8-cell MVP @ v0.4.0)
Round 18 ran the 8-cell MVP matrix via real_llm_runner.py against two real
models (and a replay vs deterministic baseline). Best-cell pass_rate hits
1.0 across all 4 trigger_basic cells; near_miss_FP_rate sits at 0.0 across
the entire matrix.
| model | thinking_depth | scenario_pack | T1 pass_rate (best) |
|---|---|---|---|
| claude-haiku-4-5 | fast | trigger_basic | 1.0 |
| claude-haiku-4-5 | fast | near_miss | 1.0 (FP=0) |
| claude-haiku-4-5 | default | trigger_basic | 1.0 |
| claude-haiku-4-5 | default | near_miss | 1.0 (FP=0) |
| claude-sonnet-4-6 | fast | trigger_basic | 1.0 |
| claude-sonnet-4-6 | fast | near_miss | 1.0 (FP=0) |
| claude-sonnet-4-6 | default | trigger_basic | 1.0 |
| claude-sonnet-4-6 | default | near_miss | 1.0 (FP=0) |
router_floor = composer_2/fast (the cheapest tuple where both packs reach
the v2_tightened pass_rate >= 0.82 hard threshold; recorded in
router_floor_report.yaml).
Si-Chip does not train router models (spec §5.2). The sweep evaluates existing
model x thinking-depthcombinations to find the cheapest cell that meets the gate, exactly as v0.1.0 did — only the harness backend swapped from deterministic SHA-256 simulation to real-LLM cache.
3. Real-LLM router-test 矩阵(8-cell MVP @ v0.4.0)
Round 18 用 real_llm_runner.py 跑了 8-cell MVP 矩阵,对接两个真实模型
(外加 replay 与确定性 baseline)。最佳 cell 的 pass_rate 在 4 个 trigger_basic
cell 上都达到 1.0;near_miss_FP_rate 在整个矩阵上都是 0.0。
| model | thinking_depth | scenario_pack | T1 pass_rate(best) |
|---|---|---|---|
| claude-haiku-4-5 | fast | trigger_basic | 1.0 |
| claude-haiku-4-5 | fast | near_miss | 1.0(FP=0) |
| claude-haiku-4-5 | default | trigger_basic | 1.0 |
| claude-haiku-4-5 | default | near_miss | 1.0(FP=0) |
| claude-sonnet-4-6 | fast | trigger_basic | 1.0 |
| claude-sonnet-4-6 | fast | near_miss | 1.0(FP=0) |
| claude-sonnet-4-6 | default | trigger_basic | 1.0 |
| claude-sonnet-4-6 | default | near_miss | 1.0(FP=0) |
router_floor = composer_2/fast(两个 scenario_pack 均达到 v2_tightened
pass_rate >= 0.82 硬门槛的最便宜组合;记录在 router_floor_report.yaml)。
Si-Chip 不训练 router 模型(规范 §5.2)。该扫描评估的是既有
model × thinking-depth组合,目的是找到达到 gate 阈值的最便宜单元——与 v0.1.0 的做法一致,只是 harness 后端从确定性 SHA-256 模拟换成 real-LLM cache。
// CHAPTER 04 //
4. Cross-platform sync (drift = 0; v0.4.0)
| Tree | Files | SHA-of-SKILL.md | Drift |
|---|---|---|---|
Source .agents/skills/si-chip/ |
21 (incl. DESIGN.md) |
identical | n/a (canonical) |
Mirror .cursor/skills/si-chip/ |
20 (no DESIGN.md) |
identical | DRIFT_ZERO |
Mirror .claude/skills/si-chip/ |
20 (no DESIGN.md) |
identical | DRIFT_ZERO |
Tarball docs/skills/si-chip-0.4.0.tar.gz |
21 (incl. DESIGN.md) |
identical (extracted) | reproducible |
Three-tree summary verdict: ALL_TREES_DRIFT_ZERO. Tarball SHA-256
2cfcce00f989faf2467014e638b0ea1fa67870b5a1ee6b0531942be5a4be21ab
(83060 bytes; deterministic options pinned via --mtime
'2026-04-30 00:00:00 UTC').
4. 跨平台同步(drift = 0;v0.4.0)
| 目录树 | 文件数 | SHA-of-SKILL.md | Drift |
|---|---|---|---|
源头 .agents/skills/si-chip/ |
21(含 DESIGN.md) |
identical | n/a(canonical) |
镜像 .cursor/skills/si-chip/ |
20(无 DESIGN.md) |
identical | DRIFT_ZERO |
镜像 .claude/skills/si-chip/ |
20(无 DESIGN.md) |
identical | DRIFT_ZERO |
Tarball docs/skills/si-chip-0.4.0.tar.gz |
21(含 DESIGN.md) |
identical(解压后) | reproducible |
三树汇总判定:ALL_TREES_DRIFT_ZERO。Tarball SHA-256
2cfcce00f989faf2467014e638b0ea1fa67870b5a1ee6b0531942be5a4be21ab
(83060 字节;确定性参数固定,--mtime '2026-04-30 00:00:00 UTC')。
// CHAPTER 05 //
5. The 14 spec invariants (v0.4.0)
The default python tools/spec_validator.py --json run reports verdict
PASS at v0.4.0 (9 historical + 1 REACTIVATION_DETECTOR_EXISTS +
2 v0.3.0 additive + 3 v0.4.0 additive = 14 BLOCKERs):
[BLOCKER] BAP_SCHEMA: PASS (§2.1)
[BLOCKER] R6_KEYS: PASS (37 keys per §3.1; ignores method-tag suffixes)
[BLOCKER] THRESHOLD_TABLE: PASS (30 cells per §4.1)
[BLOCKER] ROUTER_MATRIX_CELLS: PASS (mvp=8, intermediate=16, full=96)
[BLOCKER] VALUE_VECTOR_AXES: PASS (version-aware: 7 ≤ v0.3.0; 8 @ v0.4.0+)
[BLOCKER] PLATFORM_PRIORITY: PASS (Cursor -> Claude Code -> Codex)
[BLOCKER] DOGFOOD_PROTOCOL: PASS (8 steps + 6 evidence files; 7 when round_kind=ship_prep)
[BLOCKER] FOREVER_OUT_LIST: PASS (4 items; re-affirmed in §14.6/§18.7/§19.6/§20.6/§21.6/§22.7/§23.7)
[BLOCKER] REACTIVATION_DETECTOR_EXISTS: PASS (all 6 §6.4 trigger ids)
[BLOCKER] CORE_GOAL_FIELD_PRESENT: PASS (§14, v0.3.0)
[BLOCKER] ROUND_KIND_TEMPLATE_VALID: PASS (§15, v0.3.0)
[BLOCKER] TOKEN_TIER_DECLARED_WHEN_REPORTED: PASS (§18, v0.4.0)
[BLOCKER] REAL_DATA_FIXTURE_PROVENANCE: PASS (§19, v0.4.0)
[BLOCKER] HEALTH_SMOKE_DECLARED_WHEN_LIVE_BACKEND: PASS (§21, v0.4.0)
verdict: PASS
The --strict-prose-count mode passes against any v0.2.0 / v0.3.0 / v0.4.0
spec (since the §13.4 prose was reconciled to 37 / 30 at v0.2.0); it
intentionally fails on R6_KEYS and THRESHOLD_TABLE against the
historical spec_v0.1.0.md (28 / 21 prose) — both verdicts are pinned in
the v0.1.0 ship report under §13.4 for regression purposes.
5. 14 项规范不变量(v0.4.0)
默认的 python tools/spec_validator.py --json 运行结果在 v0.4.0 为 verdict
PASS(9 条历史 + 1 条 REACTIVATION_DETECTOR_EXISTS + v0.3.0 新增 2 条 +
v0.4.0 新增 3 条 = 14 BLOCKER):
[BLOCKER] BAP_SCHEMA: PASS (§2.1)
[BLOCKER] R6_KEYS: PASS (按 §3.1 共 37 key;忽略 method-tag 后缀)
[BLOCKER] THRESHOLD_TABLE: PASS (按 §4.1 共 30 cell)
[BLOCKER] ROUTER_MATRIX_CELLS: PASS (mvp=8, intermediate=16, full=96)
[BLOCKER] VALUE_VECTOR_AXES: PASS (版本敏感:v0.3.0 及以下 7 维;v0.4.0+ 8 维)
[BLOCKER] PLATFORM_PRIORITY: PASS (Cursor -> Claude Code -> Codex)
[BLOCKER] DOGFOOD_PROTOCOL: PASS (8 步 + 6 件证据;ship_prep 轮次 7 件)
[BLOCKER] FOREVER_OUT_LIST: PASS (4 项;§14.6/§18.7/§19.6/§20.6/§21.6/§22.7/§23.7 复申)
[BLOCKER] REACTIVATION_DETECTOR_EXISTS: PASS (§6.4 全部 6 个 trigger id)
[BLOCKER] CORE_GOAL_FIELD_PRESENT: PASS (§14,v0.3.0)
[BLOCKER] ROUND_KIND_TEMPLATE_VALID: PASS (§15,v0.3.0)
[BLOCKER] TOKEN_TIER_DECLARED_WHEN_REPORTED: PASS (§18,v0.4.0)
[BLOCKER] REAL_DATA_FIXTURE_PROVENANCE: PASS (§19,v0.4.0)
[BLOCKER] HEALTH_SMOKE_DECLARED_WHEN_LIVE_BACKEND: PASS (§21,v0.4.0)
verdict: PASS
--strict-prose-count 模式在 v0.2.0 / v0.3.0 / v0.4.0 任一 spec 上都 PASS
(§13.4 散文已在 v0.2.0 对齐到 37 / 30);只在历史 spec_v0.1.0.md(28 / 21
散文)上故意失败 R6_KEYS 与 THRESHOLD_TABLE——两个判定都已固定记录在
v0.1.0 ship report 的 §13.4,用于回归。
// CHAPTER 06 //
6. Reproduce the numbers locally
git clone https://github.com/YoRHa-Agents/Si-Chip.git
cd Si-Chip
# 1. Spec validator (14 invariants — verdict PASS)
python tools/spec_validator.py --json
# 2. Re-aggregate the included simulated baseline runs
python .agents/skills/si-chip/scripts/aggregate_eval.py \
--runs-dir evals/si-chip/baselines/with_si_chip \
--baseline-dir evals/si-chip/baselines/no_ability \
--skill-md .agents/skills/si-chip/SKILL.md \
--templates-dir templates \
--out /tmp/metrics_report.yaml
# 3. Confirm the v2_tightened packaging gate (metadata=94, body=4646, pass=true)
python .agents/skills/si-chip/scripts/count_tokens.py \
--file .agents/skills/si-chip/SKILL.md --both \
--budget-meta 100 --budget-body 5000 --json
# 4. (optional) Replay the Round 18 real-LLM cache at $0
# See .agents/skills/si-chip/scripts/real_llm_runner_quickstart.md
python evals/si-chip/runners/real_llm_runner.py --help
6. 在本地复现这些数字
git clone https://github.com/YoRHa-Agents/Si-Chip.git
cd Si-Chip
# 1. Spec validator (14 invariants — verdict PASS)
python tools/spec_validator.py --json
# 2. Re-aggregate the included simulated baseline runs
python .agents/skills/si-chip/scripts/aggregate_eval.py \
--runs-dir evals/si-chip/baselines/with_si_chip \
--baseline-dir evals/si-chip/baselines/no_ability \
--skill-md .agents/skills/si-chip/SKILL.md \
--templates-dir templates \
--out /tmp/metrics_report.yaml
# 3. Confirm v2_tightened packaging gate (metadata=94, body=4646, pass=true)
python .agents/skills/si-chip/scripts/count_tokens.py \
--file .agents/skills/si-chip/SKILL.md --both \
--budget-meta 100 --budget-body 5000 --json
# 4. (可选) 用 $0 回放 Round 18 real-LLM 缓存
# 详见 .agents/skills/si-chip/scripts/real_llm_runner_quickstart.md
python evals/si-chip/runners/real_llm_runner.py --help
// CHAPTER 07 //
7. What ships next (deferred from v0.4.0)
v3_strictpromotion (2 fresh rounds at v3 thresholds; single blocker ismetadata_tokens = 94vs<= 80).- Codex
.codex/native runtime (still bridge-only at v0.4.0 per spec §11.2; re-evaluable afterv3_strictis earned). - Broader IDE coverage (OpenCode / Copilot CLI / Gemini CLI; spec §11.2 deferred).
- Multi-tenant hosted API surface (spec §11.2 deferred).
See CHANGELOG.md
for the full v0.4.0 + v0.4.1 release notes (v0.4.1 is the doc-only patch
that produced this Pages tree).
7. 后续交付(从 v0.4.0 延后)
- 升档至
v3_strict(在 v3 阈值下完成 2 轮新的 dogfood;唯一阻塞项是metadata_tokens = 94vs<= 80)。 - Codex
.codex/原生 runtime(v0.4.0 仍按规范 §11.2 仅 bridge;待v3_strict达成后重新评估)。 - 更广义 IDE(OpenCode / Copilot CLI / Gemini CLI;规范 §11.2 延后)。
- Multi-tenant hosted API 表面(规范 §11.2 延后)。
完整的 v0.4.0 + v0.4.1 发版说明见
CHANGELOG.md
(v0.4.1 是产生本 Pages 树的纯文档 patch)。
// CHAPTER 08 //
8. Historical baseline — v0.1.0 ship (2026-04-28; v1_baseline)
The original Si-Chip ship was at v1_baseline with deterministic SHA-256
PROXY runners. Preserved here for regression / reproducibility.
| Round | pass_rate | trigger_F1 | metadata_tokens | per_invocation_footprint | wall_clock_p95 (s) | half_retire | router_floor |
|---|---|---|---|---|---|---|---|
| 1 | 0.85 | 0.89 | 78 | 4071 | 1.47 | keep | composer_2/default |
| 2 | 0.85 | 0.89 | 78 | 3598 (-11.6%) | 1.47 | keep | composer_2/default |
Round 2 also populated R4_near_miss_FP_rate = 0.05, slimming SKILL.md
body tokens from 2493 to 2020 (-18.97 %). The 7-axis value vector for
Round 1 is preserved below — note task_delta = +0.35 >= +0.10 triggers
the keep rule per spec §6.2:
value_vector: # v0.1.0 — v0.3.0 used 7 axes
task_delta: 0.35
token_delta: -1.71
latency_delta: -0.55
context_delta: -1.71
path_efficiency_delta: null
routing_delta: 0.8934
governance_risk_delta: 0.0
8. 历史基线 —— v0.1.0 发版(2026-04-28;v1_baseline)
最早的 Si-Chip 发版是在 v1_baseline 档位、使用确定性 SHA-256 PROXY runner。
此处保留以便回归 / 复现。
| 轮次 | pass_rate | trigger_F1 | metadata_tokens | per_invocation_footprint | wall_clock_p95(秒) | half_retire | router_floor |
|---|---|---|---|---|---|---|---|
| 1 | 0.85 | 0.89 | 78 | 4071 | 1.47 | keep | composer_2/default |
| 2 | 0.85 | 0.89 | 78 | 3598 (-11.6%) | 1.47 | keep | composer_2/default |
Round 2 同时填充了 R4_near_miss_FP_rate = 0.05,并将 SKILL.md 正文 token
数从 2493 压缩到 2020(-18.97 %)。Round 1 的 7 维 value vector 见下——按规范
§6.2,task_delta = +0.35 >= +0.10 触发 keep 规则:
value_vector: # v0.1.0 — v0.3.0 都是 7 维
task_delta: 0.35
token_delta: -1.71
latency_delta: -0.55
context_delta: -1.71
path_efficiency_delta: null
routing_delta: 0.8934
governance_risk_delta: 0.0