The problem with "agent"

The word "agent" has made AI engineering illegible.

When someone says "we built an agentic system," they could mean anything from a while loop in Python to a custom serving engine with modified attention kernels. The word collapses every layer of the stack into one meaningless term.

Most AI innovation being announced right now is new arrangements of client-side orchestration. The genuinely novel engineering — the kind that changes what's physically possible — happens below the network boundary, in the serving and inference layers almost no one talks about.

And at the bottom of everything: the model itself. Passive. Static. Numbers on disk. It doesn't run. It doesn't decide. It doesn't do anything. Everything that feels intelligent is happening in the layers above it.

The full stack, precisely

Client side — your machine / CPU

User / application

Browser, CLI, mobile app, parent process

Orchestration layer

LangChain, LlamaIndex, DSPy, Prefect, custom Python

Agent loop

AgentExecutor, ReActAgent, AutoGen, custom while loop

Prompt builder / context manager

PromptTemplate, ChatMemoryBuffer, MemGPT/Letta, Tiktoken

Tool dispatcher

ToolExecutor, function call parsers, MCP, custom registry

↓

Seam — network boundary

Transport: HTTP REST, gRPC, WebSockets · Payload: raw text or token IDs

Tokenizer: HuggingFace Tokenizers, Tiktoken, SentencePiece (client or server side)

↓

Server side

Serving — server CPU

Serving engine

vLLM, SGLang, TGI, TensorRT-LLM

BlockManager · continuous batching · KV cache allocator · REST/gRPC endpoint

Inference — server GPU

Inference engine

PyTorch, llama.cpp, TensorRT, ONNX Runtime

FlashAttention · Triton kernels · custom CUDA kernels

The model

LLM

Passive. No execution. No state. No side effects. A mathematical function defined entirely by its weights. Everything above this line is infrastructure built around it.

GGUF · safetensors · PyTorch .bin — loaded into VRAM once, read-only at inference time