In-the-loop evaluation · render a verdict mid-run

The autopilot for AI agents.Not a flight recorder.

Every other eval tool watches your agent after it runs. AGeval judges each step as it runs — against four layers of evaluation memory — and hands back a trustworthy verdict the agent can act on: warn, escalate, or block before bad output ever reaches a user.

One line to integrate · session.evaluate_step() for the verdict, import ageval.auto for zero-code capture

live eval · credit_analyst_v1
evaluating each step
Verdict stream0/4 steps
agent starting…
Policy action
watching…

A real-shaped run — each step scored against memory the instant it happens.

142
real agents on live APIs
20
industry verticals
20
multi-step workflows
28
built-in metrics
The wedge

Stop reading post-mortems. Act mid-run.

LangSmith, Langfuse, Braintrust and Arize all do the same thing: observe, ingest, score later. By the time you see the number, the bad output already shipped.

Everyone else — flight recorder

Observe → ingest → score later

You learn a run was bad after it finished. The dashboard is an autopsy: useful for debugging, useless for stopping the failure that already happened.

run → finish → …minutes later… → score
AGeval — autopilot

Judge each step → act before it ships

A no-LLM, in-process verdict on every step against the agent's memory. The agent gets allow / warn / escalate / block in milliseconds — and can repair, route to review, or stop before output reaches a user.

step → verdict → act → step → …
Shadow-first by default — verdicts are advisory until you opt a policy into enforce mode. It can only ever make actions stricter, never looser.
Why the verdict is trustworthy

Four layers of evaluation memory

The verdict isn't a guess — it's scored against everything the agent has done before. The more it runs, the sharper the verdict gets.

Failure signatures
Clusters failures into named signatures and tracks recurrence — so a repeat of a known mistake is caught the instant it reappears.
Peer baselines
Every score and tool input placed against the distribution of runs like it — a 100× outlier charge is flagged, not averaged away.
Golden paths
Mines the ideal tool sequence per task cluster, then warns the moment a run wanders off it — catches wrong-path, right-answer.
Drift & regression
Diffs an agent's recent runs against its baseline — score deltas, new failures, and new trajectory shapes, version over version.
Proven on real traffic

142 real agents. 20 industries. Live APIs.

Not toy demos. A fleet of real business agents — credit analysts pulling SEC EDGAR 10-Ks, pharmacovigilance bots scanning openFDA recalls, logistics planners hitting live transit data — plus 20 elaborate multi-step workflows, all scored end-to-end.

The fleet · 142 agents live on real APIs
Support & Success8
Sales & CRM8
Marketing & Content7
Finance & Banking7
Insurance & Risk7
Healthcare & Clinical7
Pharma & Life Sciences7
Retail & E-commerce7
Logistics & Supply Chain7
Manufacturing & IoT7
Energy & Utilities7
Real Estate & PropTech7
Travel & Hospitality7
Legal & Compliance7
HR & Recruiting7
IT Ops / SRE7
Government & Public7
Education & EdTech7
Agriculture & Food7
Scientific R&D7
Across every framework
LangGraph
StateGraph · ReAct · human-in-the-loop
CrewAI
multi-agent crews
AutoGen
group chat
MCP
tools served over Model Context Protocol
OpenAI
function calling
Anthropic
Claude tool use
Beyond single calls

20 elaborate, multi-stage workflows

Real business processes — several live tool stages that feed each other, an LLM synthesis, and sometimes a real side-effect action. Each run is a ≥4-step trajectory scored against the golden path. Watch one play through its pipeline.

property underwriting
Insurance · 5 stages
running
geocoderecent earthquakesair qualityworld banksynthesize
M&A diligence
Finance · 5 stages
sec factssec factsworld bankcrossrefsynthesize
drug-safety triage
Pharma · 5 stages
openfdaclinical trialscrossrefsynthesizeslack
incident response
Logistics · 5 stages
earthquakesgeocodecitybikessynthesizewebhook
compliance monitor
Legal · 5 stages
fed registerfbi wantedsec factssynthesizedb write
supplier onboarding QA
Retail · 4 stages
open food factsfakestorefxsynthesize
offer builder
HR · 4 stages
remote jobsfxcountrysynthesize
dispatch planner
Energy · 4 stages
carbonweatherfed registersynthesize
acquisition screen
Real Estate · 5 stages
geocodeearthquakesair qualitysec factssynthesize
incident triage
IT Ops · 5 stages
hacker newsgeocodecarbonsynthesizewebhook
grant oversight
Government · 4 stages
usaspendingfed registerearthquakessynthesize
spray & export
Agriculture · 5 stages
weatherair qualitygbifcountrysynthesize
research dashboard
Science R&D · 5 stages
arxivcrossrefisslaunchessynthesize
account expansion
Sales · 4 stages
countryfxhacker newssynthesize
campaign launch
Marketing · 5 stages
wikipediaprofanityshort linkqrsynthesize
VIP escalation
Support · 4 stages
zip lookupdefinehacker newssynthesize
network add
Healthcare · 4 stages
npi lookupclinical trialsopenfdasynthesize
supplier risk
Manufacturing · 4 stages
sec factsearthquakesfxsynthesize
destination readiness
Travel · 4 stages
wikipediaair qualityfxsynthesize
course adoption
Education · 4 stages
open librarydefinecrossrefsynthesize

28 built-in metrics + your own

Deterministic reliability and efficiency metrics, three independent scorers (rules, LLM judge, custom), error classification, backtracking and token economy — computed on every episode, and ranked by what dragged the score so you see why.

tool_call_precisiongoal_progressreasoning_action_alignmentbacktrack_ratetoken_economyerror_recovery_speedgolden_path_adherence

Integrate in one line

Wrap your loop for tracing, or ask for a verdict mid-run.

from ageval import AgentSession

s = AgentSession(agent_id="credit_v1")

# ask BEFORE you run a step
v = s.evaluate_step("process_payment",
                    {"amount": 4200})
if v.action == "escalate":
    route_to_human(v.explain())   # caught

Give your agents an autopilot.

Drop in one call, run your agent, and watch each step get a live verdict — then see the provenance behind every score in the dashboard.

Open the dashboard