Reference

Glossary

The vocabulary of modern AI, in plain English. Bookmark this page.

Token: The atomic unit an LLM reads and writes — a sub-word chunk produced by a tokenizer. Pricing and context limits are measured in tokens.
Embedding: A dense numeric vector that represents the meaning of a piece of text, image, or audio. Similar things have similar vectors.
Transformer: The neural network architecture (2017) based on self-attention that powers every modern LLM.
Attention: A mechanism that lets each token in a sequence weigh how much every other token matters when computing its representation.
Context window: The maximum number of tokens an LLM can consider at once when generating a response.
Temperature: A sampling parameter that controls randomness. 0 = deterministic, higher = more creative and varied.
Top-p / nucleus sampling: An alternative to temperature: sample from the smallest set of tokens whose cumulative probability exceeds p.
Hallucination: When an LLM produces fluent text that is factually wrong or fabricated.
Fine-tuning: Continuing to train a pretrained model on a smaller, task-specific dataset to specialize its behavior.
RLHF: Reinforcement Learning from Human Feedback — aligning a model with human preferences by training a reward model from comparisons.
RAG: Retrieval-Augmented Generation — fetching relevant documents at query time and giving them to the LLM as context.
Vector database: A database optimized for similarity search over embeddings (e.g. pgvector, Pinecone, Qdrant, Weaviate).
Diffusion model: A generative model that creates images by learning to reverse a gradual noising process, guided by a prompt.
Multimodal: A model that handles multiple input/output modalities — text, image, audio, video — in one system.
Agent: An LLM running in a loop with access to tools and a goal, capable of planning and taking actions.
Function calling: A model capability where the LLM outputs a structured tool invocation (name + JSON arguments) for your app to execute.
Prompt injection: An attack where untrusted input contains instructions that override the developer's system prompt.
Inference: Running a trained model to produce an output. Cheap per call compared to training.
Parameter: A learnable weight inside a neural network. Frontier LLMs have hundreds of billions to trillions of parameters.
Quantization: Compressing model weights to lower precision (e.g. 4-bit) to run faster and on smaller hardware with minimal quality loss.