The problem with "agent"
When someone says "we built an agentic system," they could mean anything from a while loop in Python to a custom serving engine with modified attention kernels. The word collapses every layer of the stack into one meaningless term.
Most AI innovation being announced right now is new arrangements of client-side orchestration. The genuinely novel engineering — the kind that changes what's physically possible — happens below the network boundary, in the serving and inference layers almost no one talks about.
And at the bottom of everything: the model itself. Passive. Static. Numbers on disk. It doesn't run. It doesn't decide. It doesn't do anything. Everything that feels intelligent is happening in the layers above it.
The full stack, precisely
User / application
Browser, CLI, mobile app, parent process
Orchestration layer
LangChain, LlamaIndex, DSPy, Prefect, custom Python
Agent loop
AgentExecutor, ReActAgent, AutoGen, custom while loop
Prompt builder / context manager
PromptTemplate, ChatMemoryBuffer, MemGPT/Letta, Tiktoken
Tool dispatcher
ToolExecutor, function call parsers, MCP, custom registry
Seam — network boundary
Transport: HTTP REST, gRPC, WebSockets · Payload: raw text or token IDs
Tokenizer: HuggingFace Tokenizers, Tiktoken, SentencePiece (client or server side)
Serving — server CPU
Serving engine
vLLM, SGLang, TGI, TensorRT-LLM
BlockManager · continuous batching · KV cache allocator · REST/gRPC endpoint
Inference — server GPU
Inference engine
PyTorch, llama.cpp, TensorRT, ONNX Runtime
FlashAttention · Triton kernels · custom CUDA kernels
The model
LLM
Passive. No execution. No state. No side effects. A mathematical function defined entirely by its weights. Everything above this line is infrastructure built around it.
GGUF · safetensors · PyTorch .bin — loaded into VRAM once, read-only at inference time