This page compares runtime evidence workflows for .NET agents. The estimates are practical planning numbers for agent interaction cost and prompt context.
Bash CLI is the source of truth for running commands. GliderTrace keeps that runtime evidence structured, bounded, stored, and reusable across later agent steps.
| Workflow | LLM + bash CLI | GliderTrace | Why it matters |
|---|---|---|---|
Failing test triage trace_run_tests | Calls: 1-3 test commands plus log filtering and manual reruns. Tokens: High: raw test output, stack traces, and unrelated build messages often pile up. Correctness / safety: The test result is authoritative, but the agent must extract failures, artifacts, and useful stacks itself. | Calls: 1 MCP call to run tests and store stdout, stderr, TRX, and normalized findings. Tokens: Low to medium: compact failure summaries with artifact references. Correctness / safety: The agent can act on the first useful failure without losing access to the full local evidence. | Most debugging time is spent separating the actual failure from surrounding command noise. |
Local repro command trace_run trace_get_session trace_query_events | Calls: Run the command, inspect output, then rerun or paste logs as needed. Tokens: Medium to high: stdout and stderr are unstructured and easy to over-paste. Correctness / safety: Useful evidence can disappear from context when the agent moves to code inspection. | Calls: 1 call to run the workspace-scoped command, then focused session reads when needed. Tokens: Low to medium: session summaries and filtered events keep evidence reusable. Correctness / safety: Stored sessions preserve exit code, stdout, stderr, and normalized runtime findings. | Runtime failures often need several edits; the evidence should survive beyond one terminal scrollback. |
Long-running app or service trace_start trace_stop | Calls: Start a process, watch logs manually, stop it, and copy relevant output. Tokens: High: long-running logs include repeated status lines and unrelated messages. Correctness / safety: Manual start/stop flows are easy to desynchronize from the evidence the agent sees. | Calls: Start a stored session, stop it later, then inspect the finalized evidence. Tokens: Low to medium: the agent reads summaries and artifacts instead of full logs by default. Correctness / safety: The session boundary records which process produced which evidence. | Service debugging needs a repeatable run boundary, not a loose terminal transcript. |
CI or existing artifact review trace_import_artifacts trace_import_otlp trace_export | Calls: Download artifacts, inspect files, paste selected snippets, and hand-build a summary. Tokens: High: TRX, logs, counters, and traces are too large for direct context. Correctness / safety: Important artifact references can be lost when only snippets are pasted. | Calls: Import artifacts into a session, then export an agent-ready summary. Tokens: Low: compact summaries reference local artifacts instead of inlining them. Correctness / safety: The agent can reason from deterministic summaries while raw artifacts stay local by default. | CI evidence should be structured enough to guide a fix without flooding the model. |
Runtime counters and traces trace_attach trace_counters trace_hotspots | Calls: Run diagnostics CLIs, capture files, then summarize samples or counters manually. Tokens: Medium to high: diagnostic outputs and trace artifacts need filtering before they help. Correctness / safety: The agent can confuse collection commands, artifact paths, and interpreted findings. | Calls: Collect or import bounded artifacts, then ask for counters or hotspot summaries. Tokens: Low to medium: summaries keep artifact identity attached to the finding. Correctness / safety: Runtime evidence is tied to stored sessions and local artifact references. | Performance and runtime investigations need summaries, not raw binary artifacts in prompt context. |
Before/after verification trace_compare_sessions trace_export | Calls: Run commands before and after, compare logs manually, and write a summary. Tokens: High: the model must keep two noisy outputs in context at once. Correctness / safety: Manual comparisons can miss persisting failures or introduce false positives. | Calls: Compare two stored sessions and export fixed, new, and persisting evidence. Tokens: Low: comparison output is already grouped for an agent. Correctness / safety: The agent can distinguish fixed issues from new regressions before reporting success. | A fix is not done until the runtime evidence changed in the expected direction. |