Every other eval tool watches your agent after it runs. AGeval judges each step as it runs — against four layers of evaluation memory — and hands back a trustworthy verdict the agent can act on: warn, escalate, or block before bad output ever reaches a user.
One line to integrate · session.evaluate_step() for the verdict, import ageval.auto for zero-code capture
A real-shaped run — each step scored against memory the instant it happens.
LangSmith, Langfuse, Braintrust and Arize all do the same thing: observe, ingest, score later. By the time you see the number, the bad output already shipped.
You learn a run was bad after it finished. The dashboard is an autopsy: useful for debugging, useless for stopping the failure that already happened.
A no-LLM, in-process verdict on every step against the agent's memory. The agent gets allow / warn / escalate / block in milliseconds — and can repair, route to review, or stop before output reaches a user.
The verdict isn't a guess — it's scored against everything the agent has done before. The more it runs, the sharper the verdict gets.
Not toy demos. A fleet of real business agents — credit analysts pulling SEC EDGAR 10-Ks, pharmacovigilance bots scanning openFDA recalls, logistics planners hitting live transit data — plus 20 elaborate multi-step workflows, all scored end-to-end.
Real business processes — several live tool stages that feed each other, an LLM synthesis, and sometimes a real side-effect action. Each run is a ≥4-step trajectory scored against the golden path. Watch one play through its pipeline.
Deterministic reliability and efficiency metrics, three independent scorers (rules, LLM judge, custom), error classification, backtracking and token economy — computed on every episode, and ranked by what dragged the score so you see why.
Wrap your loop for tracing, or ask for a verdict mid-run.
from ageval import AgentSession
s = AgentSession(agent_id="credit_v1")
# ask BEFORE you run a step
v = s.evaluate_step("process_payment",
{"amount": 4200})
if v.action == "escalate":
route_to_human(v.explain()) # caughtDrop in one call, run your agent, and watch each step get a live verdict — then see the provenance behind every score in the dashboard.
Open the dashboard