AetherShell 1.5 and agentic-eval: making “the shell for AI agents” a measured claim
Every shell that wants to work with AI agents claims it was built for them. We decided to stop claiming it and start measuring it. So this release does two things: it introduces a brand-new open-source tool, agentic-eval, that scores any program for how well it works with AI agents — and it ships AetherShell 1.5, which uses that tool to harden and benchmark itself.
On a zero-to-ten score across four areas, AetherShell scores 9.6. Nushell scores 2.3, PowerShell 2.2, and the traditional Bash, Zsh, and Fish shells score 1.4.
| Shell | Score (0–10) |
|---|---|
| AetherShell | 9.6 |
| Nushell | 2.3 |
| PowerShell | 2.2 |
| Bash / Zsh / Fish | 1.4 |
You can reproduce it yourself with a single command: cargo run --example shell_agentic_eval --features real-tokens. The rest of this post explains how that number is built, and why we trust it.
agentic-eval: a yardstick for agentic programs
agentic-eval is a small, self-contained Rust library that scores a program — a command, a script, or any text an agent reads or writes — on the four things that decide an agent’s cost and trust:
- Token efficiency. The tokens an agent actually pays for: standing context, input, output, and retries, measured with real tokenizers and averaged over a session.
- Determinism. Is the output the same every run, so an agent can parse, cache, and compare it?
- Reliability. How often does it succeed, and when it fails, are the errors clear enough for the agent to fix itself?
- Safety. Of everything a program could damage, how much is gated behind approval under an agent policy?
It needs no live model, no network, and no special setup. It is just a repeatable measurement.
Nine measures in the first release
This first release of agentic-eval ships nine measures across those four areas. These five matter most at scale:
- Output scaling. How output grows as results get bigger — the cost of paging five hundred rows, not three.
- Prompt-cache efficiency. A stable prefix is paid for once and then reused cheaply. A mostly-stable prefix is about four times cheaper over a long session, and deterministic output is what makes that possible.
- Error quality. Grades how actionable an error is, not just whether it is.
- Reversibility. The share of dangerous actions that can be undone.
- Exfiltration risk. Flags programs that both read local data and have a way to send it out.
It works on real tools
Two things make agentic-eval useful out of the box. It ships with a classifier for about two hundred common command-line tools, so it knows that rm deletes, curl reaches the network, and sudo escalates — and it treats anything unknown as risky by default. And it describes itself: an agent can ask it for its full structure instead of reading documentation.
AetherShell 1.5: fewer tokens, smaller footprint
agentic-eval did not just grade AetherShell. It showed us where to improve it.
Smaller output
AetherShell’s agent output format, AECON, already prints column names once and factors out repeated values. This release adds prefix compression: when the values in a column share a common start, such as file paths, it writes that start once and removes it from every row. On path-heavy listings that cuts tokens by 44 to 69 percent, with no loss of information.
Less context up front
This was the big one. AetherShell’s agent server used to list all eleven hundred built-in commands every time an agent connected — about forty-nine thousand tokens sitting in the agent’s memory for the whole session. Now it offers just three discovery tools by default, which cuts that to about two hundred forty tokens, roughly two hundred times smaller. Safety checks are unchanged, and the full listing is still available when you want it.
Where the token savings come from
Against traditional shells, AECON uses about 2.8× fewer tokens. Against PowerShell, the gap depends on which output an agent parses — from about 1.4× for display-only tables up to three times for the standard JSON format, because JSON repeats every field name on every row while AECON names each column once.
| PowerShell output | vs AetherShell |
|---|---|
| Display table (not reliably parseable) | 1.4× |
| Compact JSON (hand-tuned) | 1.6× |
| Standard JSON (the common form) | 2.4 to 3.0× |
Security hardening
We audited AetherShell against the major security frameworks — CVE, NIST FIPS, MITRE ATT&CK, and CMMC — and fixed everything that was safe to automate:
- Zero known vulnerabilities in dependencies, down from seven.
- SHA-256 integrity checks in place of MD5, which can be forged.
- The native plugin loader is blocked by default in agent mode unless explicitly allowed.
- A network allowlist can restrict where the shell is allowed to connect.
- A strict FIPS mode enforces approved algorithms only.
The scorecard, in full
AetherShell leads on every measure. Its reliability score is seven rather than ten on purpose: the test set includes programs that fail, so the number stays honest.
| Shell | tokens | scaling | determinism | reliability | errors | safety | reversibility | Score |
|---|---|---|---|---|---|---|---|---|
| AetherShell | 10.0 | 10.0 | 10.0 | 7.0 | 10.0 | 10.0 | 10.0 | 9.6 |
| Nushell | 7.1 | 6.4 | 0.0 | 0.0 | 5.0 | 0.0 | 0.0 | 2.3 |
| PowerShell | 5.9 | 6.4 | 0.0 | 0.0 | 5.0 | 0.0 | 0.0 | 2.2 |
| Bash / Zsh / Fish | 3.6 | 2.6 | 0.0 | 0.0 | 5.0 | 0.0 | 0.0 | 1.4 |
Tokens, scaling, and safety are measured for every shell. Determinism, reliability, errors, and reversibility are measured on AetherShell and scored as zero for the others, because traditional shells have no stable output, no machine-readable errors, and no rollback. Every number runs in a small open-source crate, so it is reproducible.
Try it
# The cross-shell benchmark (real GPT-4 tokenizer)
cargo run --example shell_agentic_eval --features real-tokens
# The prefix token saving on path-heavy output
cargo run --example prefix_gain --features real-tokens
# Standing-context cost: compact discovery vs all tool specs
cargo run --example standing_context --features real-tokens
We think the right way to earn the title “the shell for AI agents” is to publish a number anyone can reproduce, and then keep improving it. Feedback and tough benchmarks are welcome.
Check out AetherShell on GitHub.
Per aspera ad astra.
Subscribe for new dispatches
Research updates, technical deep-dives, and announcements from the frontier of embodied AI — delivered to your inbox.
Check your inbox to confirm your subscription.
