> note: this is the technical follow-up to the public post. this one is for the people who want the actual commands, the failed branches, the scheduler choices, the hardware, and the recovery logic.
# how i actually brought my ai back from the dead
the first post was the human-readable version. this is the one for the people who actually care how it worked.
what failed was a btrfs root filesystem. what almost disappeared with it was a persistent openclaw agent, kai, whose value lived far less in the base model than in the accumulated state around it: prompts, memory, identity files, transcripts, continuity, and all the small pieces of scaffolding that make a persistent agent feel like itself.
this post is the technical path from “btrfs cannot open the chunk tree” to “i can hold a live discord conversation with the recovered agent again.”
## the machines that mattered
### the broken machine
this was the source system, the thing i was trying to recover from. The relevant details were a btrfs root filesystem with subvolumes including `@`, `@home`, `@games`, and others, plus desktop hardware built around a ryzen 7 9800x3d and rx 7900 xtx. once the disk was pulled out into recovery flow, the target later showed up externally as `/dev/sdb2` in a sabrent 1 tb usb enclosure.
### the recovery workstation: framework laptop 16
the second machine mattered more than i expected, and i want to be explicit about that because it genuinely changed what was feasible. this was not just a side laptop. once plugged in and under sustained load, it functioned as a real second compute node.
captured hardware details:
- framework laptop 16 (amd ryzen ai 300 series)
- hardware sku: `FRAGAMCP09`
- cpu: amd ryzen ai 9 hx 370 with radeon 890m
- 12 cores / 24 threads
- max boost: 5.15 ghz
- ram: ~62 gi usable from 64 gb installed
- swap: 128 gi zram
- internal storage: wd_black sn7100 2 tb nvme
- kernel: `6.19.11-1-cachyos`
- firmware: `03.04`
it was plugged into ac through the carving run, ran for about 13 hours wall time, and spent roughly 10 of those hours with all 24 threads pegged at 100% cpu.
that machine let me keep the recovery target separate from the analysis machine, run the carver hard without also trying to do everything else on the same box, keep transcripts, logs, and parallel reasoning sessions alive while scans were running, and treat the recovery like a live engineering effort instead of a one-shot hail mary. if i post this publicly, i probably will tag framework because, honestly, the laptop earned its place in the story.
## the scheduler and tuning angle
the laptop was running cachyos, which means sched-ext was in play.
captured post-run defaults:
- cpufreq governor: `powersave`
- energy performance preference: `balance_performance`
- amd pstate mode: `active`
- default sched-ext scheduler: `scx_lavd`
an important nuance here is that on modern amd pstate, `powersave` does not mean “stuck slow.” it still boosts aggressively under load, and this box could still climb to 5.15 ghz.
during the heavy carving phases, i switched the sched-ext scheduler from `scx_lavd` to `scx_p2dq`. that was the right move for the job. `scx_lavd` is latency-oriented and great for interactive use, while `scx_p2dq` is the better fit for long-running, throughput-heavy, heavily parallel cpu work. that is exactly what this recovery became. there is some uncertainty about the exact moment of the switch, because i do not have a kernel log proving when it happened, but the recollection and supporting notes are solid enough that i am comfortable saying it was part of the recovery tuning story.
## the trigger, as best i currently understand it
the filesystem had gotten into severe metadata and inode pressure. at that point, an image-backed workaround was used to temporarily give btrfs breathing room again.
in plain terms, a `.img` file was created and effectively used as extra storage to get the system unstuck. and, to be fair, it did work in the short term. the machine came back.
the problem was that it was a temporary maneuver with lifecycle cost, not a stable fix. that cost was not emphasized clearly enough, and the cleanup path was not emphasized clearly enough either. a few days later, on reboot, the filesystem failed hard with chunk tree and `open_ctree` errors.
that is the whole incident in miniature: something can be directionally helpful and still dangerously incomplete.
## first visible failure
the system failed on boot with errors including:
- `failed to read chunk tree: -2`
- `open_ctree failed: -2`
at that point this was clearly a low-level metadata problem, not a normal boot problem.
## phase 1: conventional recovery
the first phase was the obvious one. boot a live environment and see whether btrfs could still recover itself.
that included, at various points:
```bash
sudo mount -o ro,degraded,rescue=all /dev/sdb2 /mnt/old
sudo btrfs-find-root /dev/sdb2
sudo btrfs-find-root -a /dev/sdb2
sudo btrfs inspect-internal dump-super -a /dev/sdb2
sudo btrfs restore -i -v -t 1263568306176 -r 256 \
--path-regex '^(/|/var|/var/lib|/var/lib/openclaw(/.*)?)