Reference

Glossary

The vocabulary of modern AI, in plain English. Bookmark this page.

Token
The atomic unit an LLM reads and writes — a sub-word chunk produced by a tokenizer. Pricing and context limits are measured in tokens.
Embedding
A dense numeric vector that represents the meaning of a piece of text, image, or audio. Similar things have similar vectors.
Transformer
The neural network architecture (2017) based on self-attention that powers every modern LLM.
Attention
A mechanism that lets each token in a sequence weigh how much every other token matters when computing its representation.
Context window
The maximum number of tokens an LLM can consider at once when generating a response.
Temperature
A sampling parameter that controls randomness. 0 = deterministic, higher = more creative and varied.
Top-p / nucleus sampling
An alternative to temperature: sample from the smallest set of tokens whose cumulative probability exceeds p.
Hallucination
When an LLM produces fluent text that is factually wrong or fabricated.
Fine-tuning
Continuing to train a pretrained model on a smaller, task-specific dataset to specialize its behavior.
RLHF
Reinforcement Learning from Human Feedback — aligning a model with human preferences by training a reward model from comparisons.
RAG
Retrieval-Augmented Generation — fetching relevant documents at query time and giving them to the LLM as context.
Vector database
A database optimized for similarity search over embeddings (e.g. pgvector, Pinecone, Qdrant, Weaviate).
Diffusion model
A generative model that creates images by learning to reverse a gradual noising process, guided by a prompt.
Multimodal
A model that handles multiple input/output modalities — text, image, audio, video — in one system.
Agent
An LLM running in a loop with access to tools and a goal, capable of planning and taking actions.
Function calling
A model capability where the LLM outputs a structured tool invocation (name + JSON arguments) for your app to execute.
Prompt injection
An attack where untrusted input contains instructions that override the developer's system prompt.
Inference
Running a trained model to produce an output. Cheap per call compared to training.
Parameter
A learnable weight inside a neural network. Frontier LLMs have hundreds of billions to trillions of parameters.
Quantization
Compressing model weights to lower precision (e.g. 4-bit) to run faster and on smaller hardware with minimal quality loss.