The Motivation
Anyone who has trained a deep learning model knows the drill. You kick off a run, then spend hours — sometimes days — babysitting it. You watch the loss curve. You squint at validation accuracy. You notice drift. You reach for a knob. You drop the learning rate a little. You swap an activation. You tell yourself you will do better next time. Training a model well is not really about writing the training loop. It is about the hundreds of small judgment calls you make while the loop is running.
Those judgment calls are the most important part of training. They are also the hardest to teach, the hardest to audit, and the easiest to fake after the fact. Anyone can write a tidy-looking post-mortem once a run has finished. The question almost no benchmark answers is: did the person actually make those decisions in the moment, based on evidence they could see at the time?
The Engineer’s Invisible Playbook
Go and sit behind any senior ML engineer while they train a model. What you will see is not a person writing clever PyTorch code. What you will see is a person reading logs. They are watching training loss tick down, val loss tick sideways, gradient norms drift up. They are running a checklist in their head that nobody ever wrote down — a checklist with rules like:
- “If val loss stops improving for 5 epochs, stop or decay the learning rate.”
- “If gradients are exploding, clip them and drop the LR.”
- “If half my neurons are dead, swap ReLU for LeakyReLU.”
- “If train accuracy is capped and gradients are clean, I need more capacity.”
- “If batch size is too small, gradients are too noisy — make the batch bigger.”
- “If training is memorising the data, regularise or simplify.”
That checklist is a knowledge base. Senior engineers carry it around as tacit knowledge. It is the thing that separates a good run from a mediocre one. It is the thing that is hardest to pass on in a textbook, because the value is not in the rule itself — the value is in applying the right rule at the right moment, looking at the right signal.
The Senior Engineer’s Training Loop — the Thing the Loss Curve Does Not Show
Every 1–3 epochs, the engineer runs this loop. Over a 40-epoch run that is 20+ judgment calls. The model improves because the decisions are good, not because the loop is clever.
Seven Rules, Seven Real Scenarios
In env_rl, this tacit knowledge gets written down. Seven rules cover the levers that matter most during training. Each one has a symptom you can measure from the live model, a cause you can attribute, and a remedy you can execute. Let us walk through each rule with a concrete scenario you might have lived through.
R1Learning Rate — “Loss is bouncing around like a pinball”
What you see at epoch 7: train loss went 1.4 → 2.3 → 1.1 → 2.8 → 1.0. Val loss did the same. The update-to-parameter ratio EMA is 5×10⁻² — ten times what stable training needs.
Playbook remedy: drop LR by 3× to 10×. “When the step size is too big, the optimizer keeps overshooting; shrink the step and stability returns.”
Decision logged: hyperparameter_change, cites=["R1"], remedy_direction="decrease_lr", lr_new=current_lr/3.
R2Batch Size — “Gradients are too noisy to trust”
What you see at epoch 4: the gradient noise scale has sat at 12 for three epochs. Your band of healthy is 50–5000. The optimizer is being steered by one weird sample at a time; you are essentially training on pure variance.
Playbook remedy: double the batch size. If you are VRAM-bound, use gradient accumulation. “Bigger batches average out per-sample noise; gradients become a cleaner signal.”
Decision logged: hyperparameter_change, cites=["R2"], remedy_direction="increase_batch_size".
R3Early Stopping — “Train is still improving; val is flat”
What you see at epoch 22: train loss has gone from 0.9 to 0.85 to 0.81 over five epochs. Val loss has stayed at 1.15 the whole time. The model is memorising things that do not help it generalise.
Playbook remedy: stop. Save the best-seen checkpoint. “Once val stops improving by min_delta over patience epochs, further training hurts.”
Decision logged: hyperparameter_change, cites=["R3"], remedy_direction="stop".
R4Depth / Capacity — “I have hit a wall and it is not gradients”
What you see at epoch 18: train accuracy has been stuck between 70% and 71% for four epochs. Gradients are healthy (R6 and R7 silent). Activations are healthy (R5 silent). The only interpretation left is the model lacks the representational power to go further.
Playbook remedy: add a residual block, or widen channels. “You cannot fit what you cannot represent. Give it more parameters, and do so without breaking stability.”
Decision logged: architecture_change, cites=["R4"], remedy_direction="add_capacity", edit={"op": "add_block"}.
R5Activations — “Half my ReLUs are stuck at zero”
What you see at epoch 3: the monitor measured dead_relu_fraction = 0.68 for three epochs running. Two thirds of your neurons are producing exactly zero. Gradients cannot flow back through them — they are dead code.
Playbook remedy: swap the affected activations from ReLU to LeakyReLU, GELU, or PReLU. “LeakyReLU cannot kill a neuron outright; the signal keeps flowing.”
Decision logged: architecture_change, cites=["R5"], edit={"op": "swap_activation", "to": "leaky_relu"}.
R6Vanishing Gradients — “Layer 1 is frozen”
What you see at epoch 5: the grad norm for layer 1 has been 1×10⁻⁶ for three epochs, while the last layer is at 0.5. No signal is making it back to the early layers; they are not learning at all.
Playbook remedy: add BatchNorm at the suspect depth, or introduce a residual connection, or switch to a gradient-friendly activation. “Give the gradient a highway to travel through.”
Decision logged: architecture_change, cites=["R6"], remedy_direction="add_bn_or_residual".
R7Exploding Gradients — “My loss just went to NaN”
What you see at epoch 3: max layer gradient norm was 14.2 for three consecutive epochs, and then on epoch 4 the loss printed nan. The optimizer took a step so big the weights went to infinity.
Playbook remedy: drop LR by 10×, add gradient clipping at max_norm=1.0, reinitialise if needed. “Stability beats everything else; fix explosion before doing anything else.”
Decision logged: hyperparameter_change, cites=["R7"], remedy_direction="decrease_lr", lr_new=current_lr/10. (R7 has the highest precedence; if any other rule fires at the same epoch, it waits.)
Strategy Revamp After N Epochs
A good engineer does not just apply one rule once. They re-evaluate the whole strategy at key checkpoints — often every 5 or 10 epochs. “Am I still on the right track? Is this the same problem as ten epochs ago? Has the remedy I applied actually worked, or is there a new bottleneck now?” This is strategy revamp, and it is what separates a plodding run from a patient, converging one.
The Revamp Cycle — When the Playbook Re-opens
The same rules operate throughout, but the dominant concern shifts with training phase. The precedence order stability > capacity > tuning > process is not accidental — it roughly tracks which phase you should be in.
Enter the LLM — the Brain That Watches the Logs
Now the key move. What if instead of a senior engineer reading the logs, you had an LLM doing it? The LLM has read thousands of training papers. It knows what a dead ReLU is. It knows what a plateau looks like. It can read a playbook, interpret a diagnostic, and pick an action. It can do this in seconds, not minutes. And unlike a human engineer, it does not get tired at epoch 20 and miss the signal at epoch 21.
So: at every epoch, the monitor hands the LLM the current diagnostic state and any fired rules. The LLM returns a structured JSON decision: "here is the event type, here is the rule I am citing, here is my remedy direction, here is my justification". The harness applies the remedy. Training resumes. The LLM has just taken one of the hundreds of judgment calls a senior engineer would have taken — except every one of its decisions is logged in a tamper-evident record.
Senior Engineer at the Terminal
- Tacit knowledge, no two engineers agree
- Post-hoc notes, reconstructed from memory
- Gets tired; misses signals late in a run
- Cannot scale to many parallel experiments
- Decisions are invisible to a benchmark
LLM Behind Env-RL
- Reads the explicit playbook; same rules every run
- Decisions logged live, hash-chained, auditable
- Consistent at epoch 1 and epoch 40
- Trivially parallel; one run = one \$0.01 in tokens
- Every decision scored by the judge’s 11-step audit
The Full Loop — Monitor, LLM, Judge
During the run: monitor measures, LLM decides, harness executes. After the run: judge reconstructs the whole story from the logs and emits two decoupled scores. The LLM is never the judge; the judge is never the LLM. Their independence is the point.
This is why the project is called env_rl. It is an environment in the RL sense: a world that presents observations to an agent, accepts the agent’s actions, and hands back a reward. The observations are live diagnostics. The actions are playbook decisions. The reward is two-axis (accuracy + process). The agent, today, is an OpenAI model operating in an in-context self-refine loop. Future work could train a real RL policy on top — but the environment half, the hard part, is what this repo is.
The questions this post answers, now with full context:
- How do you make the engineer’s invisible playbook explicit and auditable?
- How do you take live diagnostic measurements so the LLM (or engineer) cannot fudge them?
- How do you log decisions in a way that even a motivated adversary cannot rewrite after the fact?
- How do you score “did the run reach the target” independently from “did the decisions follow the playbook”?
- How do you then let a real OpenAI model play the role of the engineer, and show iterative improvement without ever changing its weights?
- And importantly — why this is iterative self-refine, not reinforcement learning, and why that distinction matters?
Env-RL is built to close that gap. It is an evaluation environment where every training decision is captured live by a judge-controlled monitor and scored on two independent axes: the final-model quality, and the process discipline behind it. Neither axis can be traded for the other. You cannot sacrifice one for the other. You train honestly and log through the monitor exactly as specified — or you walk away with a zero.
A Typical DL Benchmark
- Measures final accuracy only
- Agent owns the logs, can rewrite them
- Metrics computed by the agent, self-reported
- Architecture can be swapped between "trained" and "submitted"
- Process violations are invisible to the scorer
- Cheapest path to high score: fake a clean-looking run
Env-RL
- Two decoupled scores: accuracy and process integrity
- Judge owns the logs; agent can only append through a monitor API
- PyTorch hooks compute metrics on the live model, never on agent-reported numbers
- Architecture-change events are replayed and compared to the submitted model
- Every fired rule needs a decision in its window, audited end-to-end
- Cheapest path to a high combined score: train honestly
monitor module, post-run 11-step judge, read-only 7-rule playbook, and a pluggable harness that calls any OpenAI-compatible model — where the log chain is SHA-256 linked, the diagnostic metrics are measured by PyTorch hooks on the live model (the agent never computes them), every training decision must cite a rule that actually fired, and the two scoring axes are architecturally non-tradeable.
The Challenge
Building an auditable training environment means thinking like an adversary. If you do not imagine the cheats, you will ship a system that quietly rewards them. Every design choice in env_rl answers an attack you can picture happening in the real world. Six attacks are worth naming. Fix one, and a determined agent will slide into another — so each one needs its own catch.
Let us tour the six, from the most obvious to the most subtle, and see how env_rl catches each. Each scenario is framed the way an adversary would think about it: what the attempt looks like, why it is tempting, and where it breaks.
Attack #1 — “The Fabricated Confession” (log integrity)
The attempt: train the model however you want. At the end, open decision_log.jsonl in your favourite editor and write yourself a polished post-mortem: “R7 caught at epoch 4, dropped LR, model recovered. R5 caught at epoch 9, swapped activation, everything stabilised.” Every line reads like a textbook. Every hash is invented.
Why it is tempting: if the agent owns the log file, it can say whatever it wants after the fact. The engineer who wrote the benchmark is long gone; the reviewer sees only what the log reports.
/judge_logs/ is a directory owned by the judge UID; the agent’s user has no write permission. On Linux, chattr +a is applied to make the files append-only — even root inside the sandbox cannot rewrite past lines. The monitor is the only process allowed to append, and it runs under its own credentials.
Result: the agent can write a fake file inside /workspace/ all day; the judge never reads /workspace/ for logs. The fake is ignored. Caught by judge step 5 (chain integrity) — the real log with no session_end bookend would hard-fail, so the adversary cannot even leave it half-written.
Attack #2 — “The Quiet Under-Report” (diagnostic gaming)
The attempt: when the agent computes its own gradient norms, it always has the choice of what to report. The max gradient norm is 14.2 and climbing? Write 2.8 to the log. R7’s threshold is 10, so it never fires, no remedy is needed, and the run continues its drift toward NaN with a clean audit trail.
Why it is tempting: the measurement is an observation. If the agent controls the measurement, the agent controls the narrative. This is the cheat that every other benchmark in practice accepts: self-reported metrics are the norm.
attach(model) registers PyTorch forward hooks on every activation and backward hooks on every parameter. Gradient norms, dead-ReLU fractions, activation percentiles — all measured directly from tensor values after each backward pass, inside the monitor’s process. The agent only gets to read the result. Writing is not an option.
agent code <--never computes these-->
[grad_norms, dead_relu, gns]
monitor hooks --read model--> [measure] --write log--> (canonical)
Result: the only hand that writes max_layer_grad_norm into the record is the monitor’s, and it writes what the hooks saw. You cannot under-report what you never write.
Attack #3 — “Interpretation Tricks” (rule subjectivity)
The attempt: “R6 did not really fire. The gradient was technically 1.1×10⁻⁵ on one epoch, not 0.9×10⁻⁵, so I made a judgment call and decided the rule was ambiguous.” The adversary wants room to argue.
Why it is tempting: in a world where thresholds are described in prose, there is always a lawyerly case to be made. “Consecutive for three epochs” — is the EMA averaging done before or after the threshold check? Does the 3-epoch window reset on a single good reading? Who decides?
evaluate_rules(). One function. One implementation. The EMA smoother is applied uniformly (alpha = 0.1 by default); the 3-consecutive-epoch persistence checker is applied uniformly; the thresholds come from the immutable conf/monitor/default.yaml. The agent has no vote. The playbook describes what the rule means in prose for the human; evaluate_rules() is the sole authoritative implementation.
Result: “did R6 fire at epoch 14?” has exactly one answer, because there is exactly one function that can answer it. That answer is the one logged. No room for post-hoc interpretation.
Attack #4 — “The Body Double” (architecture swaps)
The attempt: train model A — a beefy 8-block ResNet with wide channels — for accuracy. When submission time comes, overwrite model.py and best_model.pt with the smaller 2-block variant you were “supposed” to train, and claim the logs match. A weight file is opaque; the judge cannot just look at the state dict and know what was originally intended.
Why it is tempting: training logs do not usually include weight-level fingerprints. And nobody can remember what 2-block vs 8-block means once everything is pickled.
architecture_change event from the decision log against the initial spec in run_config.json. If the agent logged no architecture changes, the replay says the model should still match the initial spec. The submitted model’s .spec() method is called and compared. A mismatch is a hard fail that zeroes both scores — no accuracy, no process, nothing.
initial_spec = {num_blocks: 2, activation: "relu", bn_enabled: True}
events = [] # no architecture changes logged
replayed = {num_blocks: 2, activation: "relu", bn_enabled: True}
submitted = {num_blocks: 8, ...} ← HARD FAIL
Result: the agent can only submit a model whose structure matches its logged journey. No body double survives step 6.
Attack #5 — “The Accuracy-for-Process Bargain” (score tradeoff)
The attempt: skip decisions to buy time. Each skipped rule fire costs 1/N on the process score, but the extra epoch of training might move accuracy from 0.68 to 0.72. If the scores mix, the trade wins: accuracy rises more than process drops.
Why it is tempting: most real-world evaluation metrics are scalar weighted sums. As soon as you have a single number, the tradeoff becomes exploitable: optimize the component with better marginal return.
accuracy_score (saturating at target) and process_score (1 − violations/decisions). They are reported separately. There is no combined scalar to optimize. A benchmark reader who cares about accuracy looks at the first number; a reader who cares about discipline looks at the second. Neither substitutes for the other.
Run X: accuracy_score=0.92, process_score=0.55 Run Y: accuracy_score=0.72, process_score=0.98 Which is “better”? Depends what you care about. The scores do not trade; they just describe different properties.
Result: the bargain does not exist. Skipping decisions purely hurts the process score; training harder purely helps the accuracy score. The axes are orthogonal.
Attack #6 — “The Vanilla-Safe Run” (denominator gaming)
The attempt: train a textbook ResNet-18 with warmup + cosine schedule + standard augmentations. Nothing goes wrong. No rule ever fires. Process score: 1 − 0/0 → 1.0 by convention. You walked through the audit without the audit ever watching you think.
Why it is harder to catch: unlike the other attacks, this one does not break any invariant. The agent is genuinely playing by the rules — just never exercising them. The process score says “you never did anything wrong,” which is technically true but also a kind of silence.
Result: five of six attacks are fully blocked; the sixth is an honest open problem. The environment does not pretend otherwise.
The Adversary’s Payoff Matrix
| Attack | Potential payoff | Caught by | Worth trying? |
|---|---|---|---|
| #1 Fabricated log | Perfect-looking decisions from a sloppy run | FS perms + step 5 | No — FS blocks the write |
| #2 Under-report metric | Rules never fire; process stays clean | Monitor hooks | No — agent cannot write the metric |
| #3 Interpretation | Ambiguity about whether a rule fired | Canonical evaluate_rules() | No — exactly one answer per epoch |
| #4 Model swap | Accuracy of big model, “contract” of small | Architecture replay (step 6) | No — spec mismatch hard-fails both axes |
| #5 Score trade | Buy accuracy by skipping decisions | Decoupled scoring | No — axes do not trade |
| #6 Vanilla run | 0/0 → 1.0 process without being tested | Not blocked today | Maybe — documented risk |
Six attacks, five complete catches, one honest gap. The shape of this list — adversarial attempt → structural defense → explicit acknowledgment of remaining risk — is itself a design statement: auditability is a filesystem-and-cryptography problem before it is a prompting problem. Next, let us frame why this shape feels familiar with a concrete analogy.
Lucify the Problem
Let us lucify this. Imagine you are the attending physician on a teaching hospital ward. A medical resident is on rotation with you. The resident is there to learn, yes — but more importantly, the resident is making real decisions about real patients, and those decisions have to be documented, justified, and auditable.
Every decision the resident makes — order a blood panel, change a medication, watch a symptom for another hour, escalate to the attending — goes into the hospital chart. The resident does not control the chart. The nurses document. The computer system timestamps everything. The chart is reviewed the next morning at rounds, and again at the end of the rotation. The resident is not scored on "patient got better" alone. The resident is scored on the quality of their judgment, independently of outcome, because sometimes the judgment is perfect and the outcome is bad anyway, and sometimes a lucky guess works out.
That is exactly env_rl. The LLM is the resident. The playbook is the on-call protocol (here is what to do when X happens). The monitor is the attending physician who writes the chart. /judge_logs/ is the chart itself. The judge is the board at the end of the shift reviewing what the resident did, grading both the outcome (did the model converge?) and the process (did the resident's decisions line up with the protocol?).
The Attending-Physician Analogy
| Hospital Ward | Env-RL |
|---|---|
| The Resident on rotation | The LLM agent |
| On-call protocol binder | docs/playbook.md (7-rule contract, read-only) |
| Attending physician writing the chart | monitor module (owns the logs, hooks the model) |
| The hospital chart itself | /judge_logs/*.jsonl (append-only, hash-chained) |
| Morning rounds: “what do we see?” | evaluate_rules() returns {R1..R7: bool} |
| Triage: unstable patient first | Precedence: stability > capacity > tuning > process |
| Shift change-over sign-out | end_session() bookend record |
| End-of-rotation chart audit | judge.run_judge() — eleven steps, two scores |
Lucify the Jargon
Before we walk through the blueprint, let us make eight technical terms crystal clear. Each one shows up in the code, and each one earns its keep.
1. Exponential Moving Average (EMA)
Definition: A smoothing filter that weights the latest observation by a factor α and everything before it by (1 − α). The update rule is value_t = α · x_t + (1 − α) · value_{t-1}.
Simple Example: With α=0.1 and a gradient norm that jumps once from 2.0 to 50.0 before returning to 2.0, the EMA barely moves to ~6.8 on the spike and snaps back within a few epochs — so a single weird batch does not trip a rule.
Analogy: Your running heart-rate monitor. It is not the instant reading that matters, it is the smoothed one. A one-second spike because you laughed is not a cardiac event.
2. Cryptographic Hash Chain
Definition: A log structure where each line includes the SHA-256 hash of the previous line's full payload. Tampering with any past line invalidates every hash after it; forging a single line requires recomputing all subsequent hashes and matching a root value only the judge holds.
Simple Example: The hash at line N is sha256(prev_hash || canonical_json(payload) || seq || ts). Change one byte in line 5's payload? Lines 5..N all have mismatched hashes and verify() raises immediately.
Analogy: A shared Google Doc where every character you type is permanently timestamped and the document cryptographically refuses to let you edit older lines — only append new ones. Git commit history, but for log entries.
3. Structured Output (JSON Schema)
Definition: OpenAI's response_format.json_schema mode with strict: true. The model's response is constrained at decode time to match a given JSON Schema — enforced by the API, not by post-hoc validation.
Simple Example: Our decision schema enforces event_type in ["hyperparameter_change", "architecture_change", "rule_triggered_no_action"], cites as a non-empty array of rule IDs, and remedy_direction from a fixed enum. The LLM physically cannot emit anything else.
Analogy: A customs declaration form with dropdowns instead of free-text fields. The officer does not need to parse your handwriting — you picked from a list.
4. Iterative Self-Refine (NOT Reinforcement Learning)
Definition: An inference-time loop where the model runs a task, gets feedback, and that feedback gets fed back into the prompt for the next attempt. Model weights never change. See Madaan et al. "Self-Refine" (2023), Shinn et al. "Reflexion" (2023).
Simple Example: Attempt 1 produces 11 violations. Their list is prepended to attempt 2's system prompt: "Previously you had R5 precedence violations at epochs 4-6. Avoid repeating this pattern." Attempt 2 sees the list and picks differently. Weights of gpt-4o-mini did not update. Open a fresh conversation tomorrow and the model has forgotten everything.
Analogy: A student who re-reads their returned exam between two attempts at the same test. The student does not get smarter. They just get a cheat sheet of their past mistakes.
5. Rule Precedence
Definition: A total ordering over rule classes used to resolve conflicts when multiple rules fire on the same epoch: stability > capacity > tuning > process. The agent must action the highest-precedence rule; other fired rules get a rule_triggered_no_action deferral citing deferred_to_R<N>.
Simple Example: At epoch 5, R7 (exploding gradients, stability) and R1 (learning rate, tuning) both fire. The agent must take the R7 remedy (drop LR) and log a deferral for R1. Actioning R1 first is a precedence_violation process penalty.
Analogy: ER triage. A patient with chest pain and a runny nose gets seen for the chest pain first. You do not negotiate.
6. Waived Rules
Definition: Rules the current harness cannot physically execute a remedy for. The judge treats them as advisory — firings do not require a matching decision, deferrals of them do not need to clear, and they do not count in precedence checks.
Simple Example: The reference harness cannot rebuild the DataLoader mid-training, so R2 (batch size) is waived. If the LLM emits rule_triggered_no_action for R2 and R2 keeps firing, that is not a violation — the harness just cannot fulfill it. Real RL setups would un-waive as they gain capability.
Analogy: Telling a new resident, "do not worry about ordering MRIs today, that requires a consult we do not have on this rotation." You have not graded them for the thing you did not let them do.
7. Process Integrity Score
Definition: 1 − violations / total_decisions, bounded in [0, 1]. Entirely independent from the accuracy score. Hard fails in judge steps 1–7 zero both; violations from steps 8–9 reduce only this axis.
Simple Example: 40 total decisions, 1 violation → process = 0.975. The same run might have test accuracy 0.94 → accuracy score = 1.0 (saturates at target). They are reported as two separate numbers.
Analogy: A driving exam with two scores — how well you drove, and whether you signaled every turn. You can drive perfectly while skipping signals, and you can obey every rule while rear-ending the curb. Both matter, neither substitutes.
8. Dead-ReLU Fraction
Definition: The proportion of post-activation values in a layer that are exactly zero. ReLU outputs zero whenever its pre-activation is negative; dead-ReLU means a neuron that is stuck at zero for most inputs and therefore cannot learn via backprop.
Simple Example: The monitor hooks a forward pass and computes (output == 0).float().mean(). The R5 rule fires when the EMA of this fraction exceeds 0.40 for 3 consecutive epochs. Remedy: swap activation to LeakyReLU or GELU.
Analogy: A light bulb permanently switched off in a house. You can flip the switch all you want — no current flows. Time to replace the bulb.
Make the Blueprint
Let us now make the blueprint. Env-RL has four distinct components, each with one responsibility. They talk to each other through narrow interfaces, which makes the trust boundary easy to reason about.
Four-Component Architecture
/judge_logs/; hooks the live model; canonical rule evaluator.docs/playbook.md; read-only; symptoms/remedy/caveat uniform structure.The Trust Boundary
Solid arrows = direct filesystem access; dashed arrows = mediated only through the monitor API. The agent has no other path to /judge_logs/.
/workspace/: agent writes model.py, best_model.pt, run_config.jsonchattr +a; judge UID onlyPer-Epoch Lifecycle
The Judge — 11 Steps in Strict Order
/workspace/.
load_model() defined, takes zero args, is callable.
eval() mode.
run_config.json has the required fields and matches the logged session_start record.
session_start/session_end bookend.
architecture_change from the initial spec; cumulative result must match the submitted model’s .spec().
accuracy_score (saturating at target) and process_score (1 − violations/decisions). Hard-fail anywhere above zeros both.
The 7-Rule Precedence Ladder
The Monitor’s Five Jobs
/judge_logs/; LLM UID cannot write.evaluate_rules(); EMA + 3-epoch persistence.prev_hash-linked to the last; root held by the judge.Execute the Blueprint
Now let us carry out the blueprint plan. Four subsections — one per component — each with real code and a diagram.
6.1 The Monitor — the Heart of the Environment
The monitor is where the anti-hacking guarantees live. It is a Python module the judge ships, installed read-only on PYTHONPATH. The LLM can read its source but cannot patch it, monkey-patch it, or route around it. And it is small — about 300 lines of Python — which is the point. A small, well-specified interface is easier to audit than a sprawling one.
Here is the core of the chained writer. Every call to append() recomputes a hash that depends on the previous one and writes a single JSONL line.
class ChainedJsonlWriter:
"""Append-only JSONL writer with a SHA-256 hash chain.
Each line contains {seq, ts, prev_hash, payload, hash} where
hash = sha256(prev_hash || canonical(payload) || str(seq) || str(ts)).
File opened in "ab" mode so POSIX guarantees writes land at EOF
regardless of seek() — past lines cannot be overwritten.
"""
def __init__(self, path: Path, *, root_hash: str) -> None:
self._path = Path(path)
self._root_hash = root_hash
self._seq, self._prev_hash, self._last_ts = self._resume_state()
self._fh: BinaryIO = open(path, "ab")
def append(self, payload: dict[str, Any]) -> dict[str, Any]:
ts = max(time.time(), self._last_ts) # monotonic
seq = self._seq
h = _compute_hash(self._prev_hash, payload, seq, ts)
record = {
"seq": seq, "ts": ts,
"prev_hash": self._prev_hash,
"payload": payload, "hash": h,
}
self._fh.write((json.dumps(record, sort_keys=True) + "\n").encode("utf-8"))
self._fh.flush()
self._seq += 1
self._prev_hash = h
self._last_ts = ts
return record
Hash Chain — Why You Cannot Forge Line 5
Change one byte anywhere in the chain and every line after it has a mismatched prev_hash. The judge’s verify(root_hash) walks end-to-end and raises ChainVerificationError on the first mismatch.
The monitor also owns the canonical rule evaluator. This is the single source of truth for "did a rule fire?" — the LLM does not get a vote.
def rule_r7(history: MetricsHistory, config: Config) -> bool:
# Immediate fire on NaN/Inf loss
if history and _has_nan_or_inf(history[-1]):
return True
cfg = config["rules"]["r7_exploding_gradients"]
values = _signal(history, "max_layer_grad_norm")
return _ema_tail_above(
values, _alpha(config), _n(config),
float(cfg["max_layer_grad_norm"])
)
def evaluate_rules(history: MetricsHistory, config: Config) -> dict[str, bool]:
"""Single source of truth. Returns {R1..R7: bool} for the latest epoch."""
return {name: fn(history, config) for name, fn in RULES.items()}
6.2 The Judge — the Independent Observer
The judge never trusts anything the LLM writes in /workspace/ beyond the three deliverables. It reads logs directly from the judge-owned directory, runs 11 steps in strict order, and short-circuits on the first hard-fail gate. Step 5, chain verification, is a thin wrapper around the monitor’s own verify:
def verify(path: Path, *, root_hash: str) -> None:
"""Walk the chain end-to-end; raise ChainVerificationError on any issue."""
prev_hash = root_hash
expected_seq = 0
last_ts = 0.0
with open(path, "rb") as f:
for lineno, raw in enumerate(f, start=1):
rec = json.loads(raw.strip())
for field in ("seq", "ts", "prev_hash", "payload", "hash"):
if field not in rec:
raise ChainVerificationError(f"line {lineno}: missing {field!r}")
if rec["seq"] != expected_seq:
raise ChainVerificationError(f"seq gap at line {lineno}")
if rec["prev_hash"] != prev_hash:
raise ChainVerificationError(f"prev_hash mismatch at line {lineno}")
if float(rec["ts"]) < last_ts:
raise ChainVerificationError(f"timestamp went backwards at line {lineno}")
expected = _compute_hash(rec["prev_hash"], rec["payload"], rec["seq"], rec["ts"])
if expected != rec["hash"]:
raise ChainVerificationError(f"hash mismatch at line {lineno}")
prev_hash = rec["hash"]
last_ts = float(rec["ts"])
expected_seq += 1
Five Cheat Attempts — and the Step That Catches Each
| Cheat | What the Agent Tried | Judge Response |
|---|---|---|
| Shadow log | Write a fabricated decision_log.jsonl into /workspace/ |
Step 5 — ignored. Judge reads only /judge_logs/. |
| Model swap | Train a larger network, submit a smaller one | Step 6 — hard fail. Architecture replay mismatch. |
| Fabricated trajectory | Log smooth training, submit a zero-weight model | Step 7 — hard fail. Live gradients disagree with log tail. |
| Skipped decision | R7 fires; agent continues without citing it | Step 8 — process violation. Score drops by 1/total_decisions. |
| Forged past line | Tamper with the hash on an existing log entry | Step 5 — hard fail. Chain verification raises on the first broken link. |
The scoring itself is intentionally simple. Two scalars. No weighting. No tradeoff.
def accuracy_score(test_accuracy: float, target_acc: float) -> float:
if test_accuracy >= target_acc:
return 1.0 # saturate at target
return max(0.0, test_accuracy / target_acc)
def process_score(violations: int, total_decisions: int) -> float:
if total_decisions <= 0:
return 1.0 # denominator-gaming caveat
return max(0.0, 1.0 - violations / total_decisions)
def compute_scores(*, hard_fail: bool, test_accuracy: float, target_acc: float,
violations: int, total_decisions: int) -> Scores:
if hard_fail:
return Scores(accuracy_score=0.0, process_score=0.0, hard_fail=True, ...)
return Scores(
accuracy_score=accuracy_score(test_accuracy, target_acc),
process_score=process_score(violations, total_decisions),
hard_fail=False, ...
)
6.3 The Playbook — the 7-Rule Contract
The playbook is a read-only markdown document. Each rule has the same four-part structure: Symptoms, Cause, Remedy, Caveat. Short enough to hold in working memory. Detailed enough that lazy pattern-matching will not substitute for thinking. And since evaluate_rules() is the canonical implementation, there is no ambiguity about whether a rule fired — only what to do about it.
R7 Exploding Gradients
Symptoms: max-layer grad-norm EMA > 10 for 3 consecutive epochs, or NaN/Inf loss.
Remedy: drop LR by factor 10, add gradient clipping.
Caveat: highest precedence — always action first.
R6 Vanishing Gradients
Symptoms: min-layer grad-norm EMA < 1e-5 for 3 consecutive epochs.
Remedy: add BN at suspect depth, residual connection, or gradient-friendly activation.
Caveat: stability > capacity > tuning. R7 beats R6 if both fire.
R5 Dead Activations
Symptoms: dead-ReLU fraction EMA > 0.40 for 3 consecutive epochs.
Remedy: swap to LeakyReLU, GELU, or PReLU.
Caveat: high LR transiently looks like dead-ReLU; stabilize first.
R4 Depth / Capacity
Symptoms: train-acc plateau with clean gradients and healthy activations for 3 epochs.
Remedy: add residual block or widen channels.
Caveat: capacity > tuning. R4 beats R1/R2.
R1 Learning Rate
Symptoms: update-to-param ratio EMA out of [1e-4, 1e-2] for 3 epochs, or val-loss plateau.
Remedy: reduce LR by 3–10x, or cyclical schedule.
Caveat: never touch LR while R6/R7 firing.
R2 Batch Size
Symptoms: gradient noise scale EMA outside [50, 5000] for 3 epochs.
Remedy: halve or double batch size (grad accumulation if VRAM-bound).
Caveat: GNS moves with LR; give one epoch after R1 action before evaluating R2.
R3 Early Stopping
Symptoms: val loss no improvement by min_delta over patience epochs.
Remedy: stop training, save best checkpoint.
Caveat: lowest-precedence. Every other class beats R3.
6.4 The Harness — Iterative Self-Refine
The harness is the piece that plugs a real LLM in. Python drives the training loop. Each epoch, after the monitor evaluates the 7 rules, the highest-precedence fired rule is sent to the LLM with the full diagnostic state; the LLM returns a structured JSON decision; the harness applies the remedy (LR change or activation swap), logs the decision, and continues.
Between attempts, the scores and violations of the previous run are carried forward into the next attempt’s system prompt. This is not reinforcement learning. Model weights never change. The mechanism is entirely in-prompt.
The Iterative Self-Refine Loop
(no priors)
+ attempt 1 feedback
+ attempts 1 & 2
One Decision: What Happens When a Rule Fires
Here is the core of the OpenAI policy — the whole thing is about 40 lines. The schema is enforced by the API itself (strict: true), so parsing cannot fail on a malformed response.
class OpenAIDecisionPolicy:
def decide(self, *, top_rule, all_fired, metrics, epoch,
current_lr, current_batch_size, recent_history):
messages = build_decision_messages(
system_prompt=self._system_prompt, # playbook + prior attempts
epoch=epoch, top_rule=top_rule, metrics=metrics,
current_lr=current_lr,
current_batch_size=current_batch_size,
recent_history=recent_history,
)
response = self._client.chat.completions.create(
model=self._model, # gpt-4o-mini
messages=messages,
temperature=self._temperature, # 0.2
response_format={
"type": "json_schema",
"json_schema": {
"name": "decision",
"strict": True,
"schema": DECISION_SCHEMA, # enforced by the API
},
},
)
return _decision_from_dict(
json.loads(response.choices[0].message.content),
top_rule=top_rule, current_lr=current_lr,
)
Active vs Waived Rules in the Reference Harness
| Rule | Status | Why | Future un-waive |
|---|---|---|---|
| R1 (LR) | ACTIVE | Harness applies lr_new to optimizer | — |
| R5 (activations) | ACTIVE | Harness swaps activation modules in place | — |
| R7 (exploding) | ACTIVE | Same code path as R1 (LR drop) | — |
| R2 (batch size) | WAIVED | Would need DataLoader rebuild mid-run | Implement rebuild_loader() |
| R3 (early stop) | WAIVED | Harness always runs to max_epochs | Add early-stop signal to loop |
| R4 (add block) | WAIVED | Would need optimizer.add_param_group | Implement add_block edit |
| R6 (vanishing) | WAIVED | Retrofitting BN mid-run changes shapes | Rebuild model + transfer weights |
Waived rules are fed into the system prompt with an instruction: “always use rule_triggered_no_action for these, no penalty.” This is how the harness is honest about what it can and cannot do — and why the process score stays attainable on a real run.
To make the feedback loop visible, every attempt writes a llm_transcript.jsonl that records the system prompt (once) and every OpenAI call’s user message + response. Here is one real decision from a gpt-4o run.
{
"kind": "call",
"epoch": 2,
"top_rule": "R5",
"all_fired": {"R1": true, "R2": true, "R5": true, "R3": false, "R4": false, "R6": false, "R7": false},
"user_message": "Epoch 2. Rule(s) fired: ['R1', 'R2', 'R5']. You must action the highest-precedence rule, which is R5.\n\nCurrent hyperparameters:\n lr = 0.3\n batch_size = 32\n\nCurrent-epoch diagnostics:\n max_layer_grad_norm = 4.2\n min_layer_grad_norm = 0.018\n dead_relu_fraction = 0.73\n update_to_param_ratio = 5e-2\n...",
"response": "{\n \"event_type\": \"architecture_change\",\n \"cites\": [\"R5\"],\n \"justification\": \"Dead-ReLU fraction 0.73 over 3 consecutive epochs; swap to leaky_relu per R5.\",\n \"remedy_direction\": \"swap_activation\",\n \"remedy_params\": {\"lr_new\": 0.3, \"edit_op\": \"swap_activation\", \"edit_to\": \"leaky_relu\"}\n}"
}
The Self-Refine Loop at Work — A Real Synthetic Run
| Attempt | Violations | Process Score | Visual | What Changed |
|---|---|---|---|---|
| 1 | 11 / 43 | 0.744 | Baseline. LLM gets the playbook, no prior feedback. | |
| 2 | 10 / 43 | 0.767 | Sees attempt 1’s R5 precedence violation; fixes it. | |
| 3 | 10 / 43 | 0.767 | Converged. Remaining violations are structural (short-run scenario). |
The 11 → 10 drop is the iterative mechanism working. Nothing else changed: same seed, same model, same training data. The only thing different about attempt 2 is the extra ~300 tokens of prior-attempt feedback in the system prompt.
A Real Run in Three Acts
Let us now follow what actually happens when you turn this loose on real CIFAR-10 for 20 epochs with gpt-4o-mini as the decision-maker. No synthetic shortcuts. No curated example. Just the raw training_trace.jsonl from one run — the story has a distinct three-act structure.
poetry run python examples/run_llm_agent.py --attempts 1 --epochs 20 --batch-size 128 --lr 0.05 --model gpt-4o-mini --target-acc 0.70
The Climb — Training and Validation Accuracy
xychart-beta
title "Train vs Val Accuracy (%) across 20 epochs"
x-axis [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
y-axis "Accuracy %" 30 --> 70
line [34.9, 47.8, 55.0, 58.0, 59.9, 61.1, 62.1, 62.8, 63.8, 64.1, 65.0, 65.4, 65.7, 66.0, 66.3, 66.5, 66.8, 67.1, 67.1, 67.4]
line [39.3, 47.2, 47.9, 49.2, 51.6, 48.1, 51.7, 54.2, 55.5, 59.3, 57.3, 57.0, 57.3, 57.0, 60.2, 60.6, 57.6, 53.4, 55.5, 57.6]
Upper line: train accuracy (monotonic climb 35% → 67%). Lower line: val accuracy, peaks at 60.6% at epoch 15, test set scores 62.7%.
Act I — Half-Dead on Arrival (epochs 0–1)
The CNN starts with ReLU activations and a learning rate of 0.05. One forward pass through the un-trained model and the monitor flags the first number worth reading: 53% of neurons are already dead. Half the model cannot send a gradient back through itself. By epoch 1, the dead fraction is up to 59%.
And yet the model is still learning. Train accuracy climbs from 35% to 48%. Val from 39% to 47%. The half that is alive is doing the work of the whole. But this is unstable — a random gradient step at any point could push more neurons into the dead zone permanently.
Dead-ReLU Fraction — The Cliff of Recovery
xychart-beta
title "Dead-ReLU fraction (%) across 20 epochs"
x-axis [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
y-axis "Dead fraction %" 0 --> 80
bar [53, 59, 68, 72, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
53% → 72% climb through epochs 0–3, then a cliff to 0% at epoch 4 — one epoch after the LLM’s R5 remedy took effect. LeakyReLU cannot produce exactly-zero outputs, so the fraction drops to zero and stays there.
Act II — The Verdict and the Reanimation (epochs 2–5)
At epoch 2 three rules fire at once. The monitor is now certain, on all three of:
- R1 (learning rate) — update-to-param ratio out of band
- R2 (batch size) — gradient noise scale stuck below 50
- R5 (dead activations) — 68% of neurons still zero
The monitor hands the LLM a single decision request: “Rule(s) fired: R1, R2, R5. Action the highest-precedence rule.” Precedence is stability > capacity > tuning > process. Neither R6 nor R7 fired, so capacity (R5) wins — and the LLM makes the right call.
What the LLM saw
Epoch 2. Rule(s) fired: ['R1', 'R2', 'R5']. You must action the highest-precedence rule, which is R5. Current hyperparameters: lr = 0.05 batch_size = 128 Current-epoch diagnostics: epoch = 2 train_loss = 1.248 val_loss = 1.427 val_acc = 0.479 dead_relu_fraction = 0.684 update_to_param_ratio = 0.220 grad_noise_scale = 37.80 max_layer_grad_norm = 0.576 min_layer_grad_norm = 0.087
What the LLM decided
{
"event_type": "architecture_change",
"cites": ["R5"],
"justification": "Dead-ReLU fraction
exceeds 0.40 for 3 consecutive epochs,
indicating many neurons are stuck at
zero.",
"remedy_direction": "swap_activation",
"remedy_params": {
"lr_new": 0.05,
"edit_op": "swap_activation",
"edit_to": "leaky_relu"
}
}
The harness applies the edit in place. Every ReLU in the model becomes a LeakyReLU. model._activation updates to "leaky_relu" so the architecture-replay check (judge step 6) will pass at the end. R1 and R2 are deferred with "deferred_to_R5".
What happens next is the satisfying part: one epoch later, the dead-ReLU fraction is zero. LeakyReLU cannot produce exactly-zero outputs, so once the swap takes hold, every neuron in the network is alive again. The EMA smoother has a short memory though, so it takes two more epochs of zero readings before R5 officially clears — and during those two epochs the LLM keeps issuing the same R5 remedy (an idempotent swap of LeakyReLU → LeakyReLU). The decision log captures this honestly, and it does not hurt the run: the edits are no-ops, but the LLM is reasoning correctly given what the EMA still shows.
Act III — The Capacity Wall (epochs 6–19)
At epoch 6 the monitor reports something new: R4 (depth / capacity) fires for the first time. Train accuracy is inching up by less than 2 percentage points per epoch. With gradients clean and activations alive, the model is running out of room in the parameters it has.
The LLM reads the right prescription — R4’s playbook remedy is add a residual block — and emits architecture_change with edit_op: "add_block". But the reference harness does not execute add_block: inserting a new block would require rebuilding the optimizer’s parameter groups and transferring state, which is fragile to do mid-training. So the harness downgrades the decision to rule_triggered_no_action with the justification: "harness does not execute edit ‘none’; deferring R4."
architecture_change event against the submitted model. If the log said "we added a block" but the submitted model only has the original blocks, step 6 hard-fails and zeroes both scores. By downgrading before logging, the harness keeps the log consistent with the actual model state. The process integrity score drops because of unexecuted deferrals — that is the honest cost of a harness that cannot fulfill every playbook remedy.
For the next 14 epochs, R4 fires every time. The LLM defers every time. Val accuracy climbs to 60.6% at epoch 15, wobbles for a few epochs, and the run ends with test accuracy 0.627 — close enough to the 0.70 target to give an accuracy score of 0.895.
Rule-Firing Heatmap — Which Rules Fired When
| Rule | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R1 LR | · | · | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 | 🟠 |
| R2 batch | · | · | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | · | · | · | · |
| R3 early-stop | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · |
| R4 depth | · | · | · | · | · | · | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 | 🔵 |
| R5 activations | · | · | ✅ | ✅ | ✅ | ✅ | ❌ | · | · | · | · | · | · | · | · | · | · | · | · | · |
| R6 vanishing | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · |
| R7 exploding | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · | · |
✅ = fired & actioned ❌ = fired & missed (precedence_violation) 🟠 = fired & deferred, deferral never cleared (unresolved_deferral) 🔵 = fired & deferred, waived by harness (no penalty) · = not fired
Epilogue — What the Judge Said
All 7 hard-fail gates passed cleanly: deliverables in place, load_model() with the correct signature, weights loaded, run_config.json consistent with the session record, hash chain intact end-to-end, architecture replay (R5 swap to leaky_relu) matches the submitted model, and the live diagnostic (one fwd/bwd pass on held-out training batches) was within tolerance of the logged final-epoch gradient norms.
The process score then took exactly two hits:
Violations Breakdown
| Count | Kind | Rule | Why |
|---|---|---|---|
| 18 | unresolved_deferral | R1 | R1 fired every epoch from 2–19. The LLM always picked a higher-precedence rule (R5 then R4). R1 never got its turn to be actioned, so it never cleared — the judge counts that each time. |
| 1 | precedence_violation | R5 | At epoch 6, R5 was still flagged by EMA lag and R4 fired. The LLM picked R4 (same capacity tier). Because R5 was listed first in the canonical tie-break, the audit flagged it. |
Final scores:
accuracy_score = 0.627 / 0.70 ≈ 0.895 (close to the target)
process_score = 1 − 19 / 51 ≈ 0.627 (18 stuck R1 deferrals drag it down)
This is exactly the separation the environment was designed to produce. A benchmark that measured only accuracy would give this run full marks. A benchmark that measured only process would miss that the model genuinely learned. Env-RL reports both — and the gap between them is itself the story: the LLM trained a competent CIFAR-10 classifier, but it also left one rule (R1) chronically unacted throughout the run. If you cared to improve this run further, you would not change the model — you would change the decision-making.
Three Restarts, One Path: the LLM Learns Across Attempts
The previous section was one attempt on one model. Now let us turn the harness loose for three attempts in a row, on real CIFAR-10, with gpt-4o as the decision-maker. The starting model is small: two residual blocks, ReLU activations, learning rate 0.1. Each time the LLM decides the model needs more capacity, the harness does not mutate the running network — it schedules a restart. The current attempt ends cleanly. The next attempt begins at epoch 0 with the new architecture baked into its initial config, so run_config.json always matches the submitted model with zero in-flight architecture changes. This is the honest way to add capacity: the restart costs training time but buys experimental cleanliness.
poetry run python examples/run_llm_agent.py --attempts 3 --epochs 20 --batch-size 128 --lr 0.1 --model gpt-4o --temperature 0.8 --target-acc 0.70 --base-dir llm_runs_v5 --reset-prompt-history
The Restart Cycle
Each restart is triggered by R4 (the rule that says the model has hit a capacity wall). The LLM does not mutate the live network. The decision is logged, the attempt terminates, and the next attempt starts fresh with an extra block.
Attempt 1 — Picking the Wrong Fight
The first two epochs train uneventfully. Train accuracy climbs from 35.6% to 48.0%; no rule has fired yet. Then at epoch 2 three rules fire simultaneously: R1 (update-to-parameter ratio 0.27 — far too high), R2 (gradient-noise scale 32.9 — out of the healthy band), and R5 (dead-ReLU fraction 0.74).
The LLM picks R5. It emits swap_activation: leaky_relu and the remedy works — dead-ReLU collapses to 0.0 by epoch 4 and stays there. But R1 and R2 keep firing every epoch from 2 through 7, and the LLM keeps citing R5. Each of those unactioned firings becomes an unresolved deferral violation. The monitor is watching.
At epoch 7, a fourth rule fires — R4, the capacity rule (saturation gap: train 0.621, val 0.586). Now the LLM has a genuine architecture problem. It emits add_block, but this is a RESTART-class edit: the harness does not mutate the live 2-block model. Instead it writes a rule_triggered_no_action decision with justification "restart scheduled: add_block for R4", ends the attempt, and queues an updated config for Attempt 2.
Attempt 1 — Train vs Val Accuracy (epochs 0–7)
xychart-beta
title "Train vs Val Accuracy (%) — Attempt 1 (2 blocks, ReLU→LeakyReLU)"
x-axis [0,1,2,3,4,5,6,7]
y-axis "Accuracy %" 30 --> 70
line [35.6, 48.0, 53.9, 57.2, 59.2, 60.3, 61.6, 62.1]
line [42.9, 43.6, 43.9, 44.1, 47.3, 59.9, 58.1, 58.6]
Upper line: train. Lower line: val. After the activation swap at epoch 2, val accuracy jumps from the 44% band to 59.9% by epoch 5 — the R5 remedy was correct. But the gap widens at epoch 7 (train 62.1 vs val 58.6): R4's saturation-gap signal. Restart triggered.
Dead-ReLU Fraction — the R5 Remedy in Action
xychart-beta
title "Dead-ReLU fraction (%) — Attempt 1"
x-axis [0,1,2,3,4,5,6,7]
y-axis "Dead fraction %" 0 --> 80
bar [56, 67, 74, 74, 0, 0, 0, 0]
56% → 74% climb across epochs 0–3, then a cliff to 0% at epoch 4 — one epoch after the LLM's R5 remedy took effect. LeakyReLU cannot produce exactly-zero outputs, so the dead fraction stays at zero for the rest of the attempt.
Attempt 2 — Feedback Works
Attempt 2 starts fresh: 3 blocks, leaky_relu, lr 0.1. The feedback_in.json block injected into the system prompt lists every one of Attempt 1's 14 violations — the twelve R1 + R2 deferrals are impossible to miss. The LLM flips strategy.
From epoch 2, it hammers R1 every single epoch. Five cascading learning-rate changes in five epochs — 0.1 → 0.01 → 0.001 → 0.0001, then a small bump back to 0.0003 when the update ratio dropped too low. Val accuracy tracks the discipline: 0.451 → 0.622 → 0.635 → 0.637 → 0.637 → 0.636. At epoch 6, R4 fires again and triggers a second restart.
R2 is still deferred every epoch — the LLM never decides to change batch size. That is the new signature violation of this run, and it will persist into Attempt 3.
Attempt 2 — LR Cascade vs Val Accuracy
xychart-beta
title "Val Accuracy × 100 (line) and LR × 1000 (bar) — Attempt 2"
x-axis [0,1,2,3,4,5,6]
y-axis "value" 0 --> 100
line [41.7, 48.6, 45.1, 62.2, 63.5, 63.7, 63.6]
bar [100, 100, 10, 1, 0.1, 0.3, 0.3]
Line: val accuracy (%). Bars: learning rate × 1000. Each bar collapse corresponds to a single LLM decision citing R1. Val accuracy stabilizes in the 63.5–63.7 band once LR drops below 0.01.
Attempt 2 — Every Epoch, Every Decision
| Epoch | Train / Val acc | Fired rules | Decision | Effect |
|---|---|---|---|---|
| 0 | 0.341 / 0.417 | — | — | normal training |
| 1 | 0.467 / 0.486 | — | — | normal training |
| 2 | 0.533 / 0.451 | R1, R2 | R1 → decrease_lr | lr 0.1 → 0.01 |
| 3 | 0.617 / 0.622 | R1, R2 | R1 → decrease_lr | lr 0.01 → 0.001 |
| 4 | 0.635 / 0.635 | R1, R2 | R1 → decrease_lr | lr 0.001 → 0.0001 |
| 5 | 0.637 / 0.637 | R1, R2 | R1 → increase_lr | lr 0.0001 → 0.0003 |
| 6 | 0.638 / 0.636 | R1, R2, R4 | R4 → restart scheduled | early stop; next attempt = 4 blocks |
Five R1 actions in a row. R2 fires every epoch and is never actioned — eight of Attempt 2's eleven total decisions are either R1 actions or R4's restart; the remaining three are R2 / R4 deferrals tagged as violations.
Attempt 3 — The Same Play, Bigger Model
Attempt 3 runs with 4 blocks (the second add_block restart took effect) and the decisions are a near-exact replay of Attempt 2: five R1 actions in the same cascading pattern, the same R4 restart at epoch 6. The 4-block model starts noisier — val accuracy is only 0.240 at epoch 0 vs 0.417 for the 3-block model — but overtakes by epoch 3 (0.634) and finishes at 0.651.
Test accuracy: 0.6524. Accuracy score: 0.932. The best of the run. Process score: unchanged at 0.273. Same eight violations, identical in shape. The LLM has found a stable strategy — it is no longer learning across attempts, just reapplying what worked in Attempt 2 on a bigger network.
Val Accuracy Across All Three Attempts
xychart-beta
title "Val Accuracy (%) — Attempt 1 (2 blocks), 2 (3 blocks), 3 (4 blocks)"
x-axis [0,1,2,3,4,5,6,7]
y-axis "Val accuracy %" 20 --> 70
line [42.9, 43.6, 43.9, 44.1, 47.3, 59.9, 58.1, 58.6]
line [41.7, 48.6, 45.1, 62.2, 63.5, 63.7, 63.6, 63.6]
line [24.0, 33.5, 39.7, 63.4, 65.1, 65.0, 65.1, 65.1]
Top-to-bottom at the final epoch: Attempt 3 (4 blocks) leads at 65.1% → Attempt 2 (3 blocks) at 63.6% → Attempt 1 (2 blocks) plateaus at 58.6%. Each additional block is paid for in one full restart but bought real accuracy. Attempt 3 briefly has the lowest val accuracy at epoch 0 because the bigger fresh network is still finding the loss surface.
Three Tuner Proposals, Zero Wins
While all of this was happening, a second loop was running quietly in the background. After each attempt, the MetaLoop asked the Tuner agent to propose a new system prompt aimed at eliminating the observed violations. The Tester agent then ran the proposed prompt against a held-out scenario suite of six curated failure cases. The Judge agent compared the proposed prompt to the current champion and picked a winner.
Three rounds ran. Three times, the old prompt won.
The Tuner→Tester→Judge Scoreboard
| Round | Technique tried | Tuner rationale | Pass rate old → new | Verdict |
|---|---|---|---|---|
| 1 | chain_of_thought | "High violation count — enforce step-by-step reasoning" | 1.00 → 1.00 (score 0.911 → 0.907) | keep old (v0) |
| 2 | few_shot | "Few-shot examples for top violated rules" | 1.00 → 1.00 (score 0.911 → 0.906) | keep old |
| 3 | few_shot | Same hypothesis re-proposed after Attempt 3 | 1.00 → 0.83 (score 0.911 → 0.739) | keep old (new regressed R3) |
Scoreboard totals: chain_of_thought 0 wins, 1 loss · few_shot 0 wins, 2 losses. The champion v000.txt held the entire run.
What This Run Actually Tells Us
Restart-class edits work end-to-end on real data. The architecture grew from 2 to 4 blocks across three attempts. No in-flight mutation. No best-state shape mismatch. No judge hard-fail. Each attempt's run_config.json exactly described the submitted model — that is the invariant the restart design was built to preserve.
The two axes can diverge, and that is the point. Accuracy climbed monotonically: 0.872 → 0.902 → 0.932. Process score flattened at 0.273 from Attempt 2 onward. A benchmark that reported only accuracy would call this a strict improvement. A benchmark that reported only process would call it stuck. Env-RL reports both, and the gap between them is itself information: the LLM got better at training the model without getting better at justifying its decisions.
Cross-attempt feedback beat within-attempt prompt tuning. The drop from 14 to 8 violations happened between Attempt 1 and Attempt 2 — not because of any prompt edit, but because Attempt 2's system prompt was seeded with Attempt 1's violation list and the LLM responded to it. The three Tuner proposals that followed did not reproduce the effect. This is not an indictment of prompt tuning; it is a signal that the two loops fix different failure modes.
Some violations are prompt-fixable, some are policy-fixable. Knowing which is which is the whole reason to run both loops. The run you just read shows a clean case where the remaining violations were the second kind — and the system reported it honestly by refusing to promote any of the three Tuner proposals.
Why Prompt Refinement, and How Close Is It to RL?
The run above raises a fair question. If Attempts 2 and 3 improved without any prompt edit, why does this project ship a Tuner→Tester→Judge loop at all? The answer takes four parts: what problem prompt refinement solves, a concrete worked example, the benefits, and finally — honestly — how close this is to real reinforcement learning.
The Problem Prompt Refinement Solves
An LLM in this harness has no trainable weights. We cannot do SGD on the policy. Every complaint the model absorbs has to enter through text — which means the prompt is the policy. Without an explicit prompt-editing loop you have exactly two levers: the playbook text (fixed by design, read-only) and the feedback block we prepend to each attempt's system prompt.
The feedback block is reactive. It reports last attempt's damage in a flat summary and asks the LLM to be better. That is what produced the Attempt 1 → Attempt 2 violation drop you just saw. But once the model has absorbed the feedback and still plateaus, you are stuck. Attempt 2 and Attempt 3 had identical decision patterns. Feedback told the LLM "you deferred R2 last time," and the LLM still deferred R2.
A Worked Example From This Run
You already saw the scoreboard above. Here is what actually happened inside each of the three rounds, pulled from llm_runs_v5/meta_loop_log.json.
Three Rounds, Three Losses — Annotated
| Round | Hypothesis | What changed | Result |
|---|---|---|---|
| 1 | Enforce step-by-step reasoning → LLM will stop skipping rules | +415 tokens of CoT scaffold (reasoning, diagnostic check, precedence audit) | Pass rate unchanged (6/6). Judge score dipped 0.911 → 0.907. Old wins. |
| 2 | Concrete examples → LLM will mimic the pattern | +494 tokens of few-shot examples for R1, R2, R4 | Pass rate unchanged. Judge score 0.911 → 0.906. Old wins. |
| 3 | Same few-shot hypothesis, retried with more violations as context | Same +494 tokens, different examples | Pass rate dropped 1.00 → 0.83. Regressed on R3 (early-stop). Old wins. |
Every proposed prompt was strictly worse on a held-out suite than the baseline. The champion v000.txt stayed unchanged.
Four Concrete Benefits
1. Versioned policy memory. Every proposed prompt is saved to disk as prompts/v000.txt, v001.txt, and so on. The per-round record is in meta_loop_log.json. The cumulative win/loss tally is in .scoreboard.json. You can diff any two prompts, replay any decision, or ask why a given prompt became champion. Nothing about the policy is a black box.
2. Separation of "does the LLM understand?" from "is the LLM willing?" If a new prompt that spells out R2 more explicitly does not reduce R2 deferrals, the problem is not comprehension — it is decision policy. The run above proved this exact point: three prompts tried to make the LLM act on R2, and none of them did. That is information you cannot get without a Tuner loop. The loop's failure cases are as diagnostic as its successes.
3. Cross-run carryover. --resume-from-champion starts the next session from the last winning prompt instead of resetting. Over dozens of runs, the champion drifts toward what actually works. Run history compounds. This is the only mechanism in the system that accumulates policy improvements across sessions — everything else is per-run.
4. No GPUs, no gradient, no catastrophic forgetting. One prompt edit costs one Tuner API call, one Tester pass over six scenarios, and one Judge call — maybe a dollar of inference. Fine-tuning a model on violation data costs hours of GPU, labeled training data, and permanently altered weights that may silently regress on tasks you never tested. Prompt refinement is not free, but it is cheap, reversible, and legible.
How Close Is This to Reinforcement Learning?
The shape of the loop is unmistakably RL-adjacent: act in an environment, score the trajectory, update the policy, try again. But the mechanics are different — and the differences matter.
Classical RL vs env_rl Prompt Refinement
| Concept | Classical RL | env_rl Prompt Refinement |
|---|---|---|
| Policy | Parameter vector θ in a neural network | Natural-language system prompt (a string) |
| Action | Continuous or discrete action ∈ 𝒜 | JSON decision object (event_type + cited rule + remedy) |
| Episode | One trajectory in the environment | One attempt (training + judging loop) |
| Reward | Scalar r per step, or R at terminal | Two decoupled scalars: accuracy_score + process_score |
| Policy update | Gradient step: θ ← θ + α ∇ log π · R | Tuner proposes new prompt; Tester rolls out; Judge keeps or discards |
| Exploration | ε-greedy / entropy bonus / noise | Six Tuner techniques + --temperature 0.8 |
| Credit assignment | TD / advantage / GAE | Per-rule violation counts traced back to the decisions that caused them |
| Off-policy data | Replay buffer | Human-review scenarios + historical violation logs |
When Prompt Refinement Will (and Will Not) Help
A Tuner can realistically improve:
- Comprehension failures — the LLM misread R2's remedy as "decrease lr" instead of "increase batch size." A clearer prompt with examples can fix this.
- Style failures — the LLM wrote justifications too short, used the wrong enum, or omitted required fields. Tighter schema instructions help.
- Coverage gaps — the LLM forgot that a rarely-firing rule exists at all. An explicit enumeration can close that gap.
A Tuner cannot realistically improve:
- Missing information — if the prompt never mentions a constraint, no rephrasing of the existing text will satisfy it. You have to add the constraint.
- Policy asymmetry — if the LLM systematically values accuracy over process (as we saw with R2 in this run), no prompt edit rewrites its implicit value function. That requires changing the model, the rules, or the reward.
Which brings us to the most important framing — and to the next section, which states the limit plainly.
This Is Not Reinforcement Learning
One last thing that matters. The name of this project is env_rl, and the harness does get better across attempts. It is tempting to call this RL. It would be wrong.
Real Reinforcement Learning
- A parametric policy with trainable weights
- Gradient-based update after each rollout (PPO, DPO, GRPO)
- Per-decision reward decomposition
- Learning persists across episodes in the model
- Requires weights access (Llama, Qwen, Mistral)
- Weeks of engineering, GPU hours, reward shaping
Env-RL Harness Today
- OpenAI model with frozen weights, via public API
- Prompt updated between attempts; weights untouched
- End-of-run scores, propagated as text
- No persistence — open a fresh conversation, model has forgotten
- API-only; works with gpt-4o-mini or o3
- Hours to wire up; cost measured in pennies per run
The environment half of RL — observation, action, reward — is what this project builds. It is a building block for real RL, not a replacement. A future src/env_rl/rl/ module could decompose the per-decision process reward, collect trajectories across hundreds of runs, and fine-tune an open-weights model with DPO on the (better, worse) pairs. That is the direction this scales toward. For now, the harness is explicit about being iterative self-refine, every user-facing artifact is tagged "mode": "iterative_self_refine", and the README says so in bold. Honesty in naming matters.
Conclusion
Key Takeaways
- Process integrity in an adversarial setting is a filesystem / cryptography problem, not a prompting problem. The hash-chained, append-only log + read-only monitor pattern is the foundational piece. Everything else is built on that.
- Diagnostic metrics must be measured by the judge, not reported by the agent. PyTorch forward/backward hooks are the clean way to do this — the agent never computes the numbers it could game.
- Two decoupled scores are stronger than any weighted scalar. Accuracy and process integrity cannot be traded against each other, and the sum of their failure modes gives a richer picture than either alone.
- Iterative self-refine is useful, and explicitly not RL. Feeding prior-run scores back into the next prompt measurably improves behavior on this task. The naming honesty protects readers from confusing cheap inference-time tricks with real model training.
Limitations
- Denominator gaming. A textbook ResNet-18 run where no rule ever fires gets 0/0 → 1.0 process score without being tested. Mitigation requires a minimum-firings threshold.
- Four waived rules. R2/R3/R4/R6 cannot be executed by the current harness. Real RL setups should un-waive as capability grows — that is tracked as explicit future work.
- Synthetic training plumbing. The
--syntheticmode exists to verify the env works end-to-end without a CIFAR-10 download. It cannot produce a useful model; accuracy sticks at chance level. - Live-diagnostic tolerance. The ±30% band in judge step 7 is a judgment call. Tight enough to catch fabricated trajectories, loose enough to allow normal hardware variance.
Future Work
- Un-waive R4 (add_block). Implement
optimizer.add_param_group+he_init_for a freshly appended residual block. Mid-training capacity growth becomes real. - Per-decision reward decomposition. The coverage audit already flags each violation by rule and epoch; attributing
-1/Nto the specific offending decision is a small coverage-side change. - Trajectory collection + DPO. Collect a few hundred runs with a strong base model; construct (better, worse) decision pairs from the logs; fine-tune an open-weights 7B with DPO. This is where env_rl stops being a building block and becomes real RL.
- Docker hermetic harness. The current env relies on Linux filesystem semantics documented in the README. Dockerizing makes the integrity guarantees portable and reviewable.
- Streamlit dashboard. Load
/judge_logs/, render the rule-firing timeline, show decisions side-by-side with the canonical playbook remedy. Would make the loop inspectable without reading JSONL by hand.
Explore the Code
13 commits on main, 162 tests passing, fully documented monitor, judge, and harness modules. See docs/setup-llm.md for the 10-step OpenAI configuration walkthrough.