Inside the Fackel harness: parallel agent lanes in a terminal

June 1, 2026 · 8 min read · 📚 ai-agents

security ai agents pentesting python advanced

Table of Contents

Fackel is an AI-powered offensive security assistant. Under the hood it runs multiple specialist agents in parallel, streams their reasoning live, pauses for human approval before active scanning, and persists every finding into a queryable knowledge graph.

This post is about the terminal harness that makes that workflow usable. (For the brain behind it—the LangGraph orchestrator, ReAct agents, and the LLM-as-a-judge that routes between phases—see the earlier post.)

The Fackel harness: parallel agent lanes, a live pipeline stepper, and inline approval gates running in the terminal.

The idea worth building

The interesting thing about Fackel isn’t the command list. It’s this: several specialist agents reasoning at once, inside a single interactive terminal you stay in control of.

That sentence hides three hard problems. Concurrent agents fight over one terminal. A human approval step has to interrupt work that’s running on another thread. And a long agent run fills its context window, so you need to see that happening and do something about it. The harness is the layer that solves those three problems—everything below is how.

Why a worker thread and a queue, not async

The first decision is the concurrency model. A scan is long-running and spawns work of its own; the terminal has to stay responsive. The obvious reach is asyncio, but Fackel runs LangGraph in sync mode, where parallel specialists execute on threads, not coroutines. Bolting an event loop on top of that buys nothing and complicates cancellation.

So the harness uses a plain producer/consumer split:

 main thread                         worker thread (daemon)
┌─────────────┐    queue.Queue      ┌──────────────────────┐
│ prompt +    │◀───────────────────│ orchestrator.run()   │
│ Rich Live   │   (phase, type,     │ → parallel specialist│
│ (sole owner)│    data) events     │   threads emit events│
└─────────────┘                     └──────────────────────┘
       │  __approval__ / __done__ / __error__ / __cancelled__

A scan runs in a daemon worker thread. Every event—an agent’s reasoning token, a tool result, a phase boundary—is pushed onto a queue.Queue. The main thread drains that queue and is the sole owner of the Rich Live display and the prompt. This single-owner rule matters: Rich’s live region and prompt_toolkit both assume one writer. Letting worker threads draw to the terminal directly would interleave escape sequences and corrupt the frame. The queue is the seam that keeps rendering on one thread and work on another.

The ContextVars line that makes it work

There’s one load-bearing detail. The streaming wiring—the event callback and the cooperative cancel flag—is bound through contextvars inside a run_session context manager. A raw threading.Thread does not inherit its parent’s ContextVars, so the worker would start blind: no callback to emit through, no cancel flag to watch. The fix is one line:

ctx = contextvars.copy_context()
worker = threading.Thread(target=lambda: ctx.run(_worker), name="fackel-scan", daemon=True)

Copying the context after the run_session bindings are in place means the worker—and the specialist threads LangGraph spawns from it—all see the same callback and cancel flag. Miss this, and the screen stays empty while the scan runs fine in the dark.

Parallel specialists, and why the screen has lanes

OSINT and vulnerability scanning don’t run as one big agent. They fan out. Each phase dispatches a set of focused specialist sub-agents via LangGraph’s Send, and they run in parallel:

@dataclass(frozen=True)
class Specialist:
    """A focused OSINT sub-agent: a domain, a task focus, and its tools."""
    name: str
    focus: str
    tool_names: tuple[str, ...]

OSINT splits into specialists like dns_infra (DNS, WHOIS, ASN, IP reputation) and subdomains (enumeration, wildcard filtering, takeover detection, TLS SAN harvesting). Vuln scanning splits into surface, nuclei, web_injection, app_config, and more—each with its own tool subset and focus string. They execute concurrently and fan back in through dedicated merge channels in the shared state.

Concurrency is great for latency and terrible for a scrolling log: five agents emitting reasoning at once produces an unreadable interleave. So each specialist binds a lane before it streams:

with streaming.lane(name):
    streaming.emit("osint", "lane_start", {"name": name})
    ...

Every event carries that lane tag, and the renderer keeps per-lane state—one _LaneState per agent, never a shared buffer. Parallel specialists render in separate lanes (the table you see in the demo above); sequential phases collapse to a single main lane with the richer thinking-panel view. There’s one Rich Live area per phase; when the phase ends, finalized summaries are printed above the live region and the transient lane table vanishes. You watch the work happen, then keep a clean transcript of what happened.

The honest trade-off: per-tool approval (--approve-tools) forces the monolithic, non-parallel path. Parallel branches can’t coherently share a single approval interrupt stream, so asking to approve every tool call costs you the lanes. That’s a real limitation, not a bug—two features that genuinely conflict.

Inline approval gates without freezing the render loop

Before any active scanning, a human-in-the-loop gate pauses execution and shows you the discovered targets. The wrinkle: the worker thread decides it needs approval, but only the main thread can read keyboard input. The handshake is a small dataclass carrying a threading.Event:

@dataclass
class _PendingApproval:
    data: dict[str, Any]
    kind: str  # "gate" | "tool"
    event: threading.Event = field(default_factory=threading.Event)
    result: Any = None

The worker enqueues an __approval__ item and blocks on box.event.wait(). The main thread sees it, tears down the live region, renders the gate (or the specific tool and its arguments), reads a yes/no with prompt_toolkit.confirm, writes the answer back into the box, and sets the event. The worker unblocks with its decision. The same mechanism serves both the phase-level gate and per-tool approval—kind is the only difference. Approval never blocks the render loop, because the render loop is what answers it.

Cooperative cancellation, not a kill

Pressing Ctrl-C during a scan doesn’t kill the worker. It sets a cancel flag and keeps draining the queue:

except KeyboardInterrupt:
    cancel.set()
    self._console.print("stopping…")
    continue

The flag rides the same ContextVar the specialist threads inherited, so they notice it at their next checkpoint and unwind cleanly; the worker then reports __cancelled__ back through the queue. A hard kill would leave the LangGraph checkpoint and the persistence layer in an unknown state. Cooperative cancellation costs a moment of latency and buys a session you can trust afterward—the scan either completes, fails, or cancels, and the REPL survives all three. At the prompt (not mid-scan), Ctrl-C just clears the line and Ctrl-D exits.

The context meter

LLM context windows are finite, and a long scan with chatty tools fills them. The harness tracks a running token estimate from the streamed events and renders it as a compact bar in the bottom toolbar:

_COUNTED_EVENTS = frozenset({"token", "reasoning", "reasoning_trace", "tool_result", "summary"})

Crucially, it reuses the orchestrator’s own text_tokens estimator rather than inventing a second count, so the meter lines up with the trimming guard-rail the pipeline already enforces. /context breaks the total down per phase; /compact summarises prior findings into session memory and resets the live meter, so a multi-scan session doesn’t drag its full history forward.

The session is a workbench

When a scan finishes, the harness doesn’t exit—it hands you back a prompt over everything that just happened. That’s the part you spend the most time in:

Command	What it does
`/scan <target> [--no-active] [--approve-tools]`	Run a scan
`/ask <question>`	Ask the last scan’s knowledge graph in natural language
`/scans` · `/diff <old> <new>`	List persisted scans · diff two of them (new / resolved / changed assets)
`/graph [scan_id]`	Export the knowledge graph as Mermaid
`/context` · `/compact`	Inspect the live token meter · summarise findings into session memory
`/model [provider] [name]`	View or switch the LLM (persists to `.env`)

Bare text with no leading slash is treated as /ask, because once you’ve run a scan the most common thing you want to do is interrogate it. A command error never crashes the REPL—every handler is wrapped, so a bad /diff argument prints a red line and returns you to the prompt instead of taking the session down.

A few smaller choices earn their place too:

/model persists. Switching provider or model writes the choice to .env, so it survives the session. When the new provider needs a key that isn’t set, the harness prompts for it without echo—on a fresh PromptSession, so the password flag never contaminates the shared REPL prompt and starts masking everything you type. Better to ask for the key now than let the next scan fail on a missing credential.
Nerd Font with a fallback. Glyphs are Nerd Font by default; set FACKEL_NERD_FONT=0 and every glyph falls back to a width-1 ASCII symbol that keeps table columns aligned. The TUI shouldn’t break because your terminal lacks a patched font.

What I’d do differently

Per-tool approval shouldn’t cost the lanes. The cleanest fix is a single serialized approval channel that parallel branches can all funnel through, so you keep both human-in-the-loop granularity and the parallel view. It’s a real piece of work—coordinating one interrupt stream across fan-out branches—which is exactly why it isn’t done yet.

The queue protocol is stringly-typed. Events are (phase, type, data) tuples with sentinel heads like __done__ and __approval__. It works, but a small typed envelope would make the producer/consumer contract explicit and catch a malformed event at the boundary instead of deep in the renderer.

The token meter is an estimate. It reuses the orchestrator’s estimator for consistency, but it’s still a heuristic, not the provider’s real accounting. Reconciling it against actual usage from the trace would make the bar trustworthy enough to drive hard limits.

Why the harness exists

Offensive security is iterative. You don’t run a scan and leave—you investigate, compare against last week, ask questions, refine a hypothesis, and scan again. Most of that loop happens between scans, in the part a one-shot tool throws away.

The whole point of Fackel’s cockpit is to make that loop feel natural while keeping the operator in control: agents do the parallel grunt work and stream their thinking, but you approve the active steps, watch the context budget, and decide what to chase next.

git clone https://github.com/flaviomilan/fackel.git
cd fackel && uv sync --python 3.12
cp .env.example .env   # set OPENAI_API_KEY

fackel                 # interactive harness

If that sounds useful, try it and let me know what breaks. Open source under Apache 2.0: github.com/flaviomilan/fackel.

Previous in series

LangGraph, CrewAI, and Agno: getting started with AI agents in Python