A comprehensive reference of AI, machine learning, and large language model terminology.
An AI system that can autonomously take actions, make decisions, and use tools to accomplish tasks. Agents combine an LLM with the ability to execute code, browse the web, or interact with external services to complete complex multi-step workflows.
AI systems that operate autonomously over extended periods, pursuing multi-step goals by planning, taking actions, and adapting to feedback without constant human supervision. Agentic AI moves beyond single-turn question-answering toward persistent task execution — browsing the web, writing and running code, managing files, and calling APIs to accomplish complex real-world objectives.
A development paradigm where AI agents write, test, debug, and refactor code with minimal human intervention. The agent iterates on solutions, runs tests, and fixes issues autonomously within a coding environment.
The research field focused on ensuring AI systems pursue goals that match human intentions and values. Misaligned AI might achieve its stated objective while causing unintended harm — the classic "paperclip maximizer" thought experiment illustrates this risk. Techniques like RLHF, Constitutional AI, and interpretability research all contribute to alignment.
Low-quality, generic, or mass-produced AI-generated content that lacks originality, accuracy, or human insight. The term describes blog posts, images, or videos that are clearly machine-generated without meaningful curation or editing. As AI content floods the web, distinguishing human-crafted content from AI slop has become a growing concern for publishers, search engines, and readers.
Application Programming Interface. A set of protocols and tools that allows different software applications to communicate. In the LLM context, APIs enable developers to send prompts and receive completions from hosted models programmatically.
A mechanism in neural networks that allows the model to focus on relevant parts of the input when generating each output token. Self-attention is the core innovation of the Transformer architecture, enabling models to capture long-range dependencies in text.
An asynchronous API mode offered by providers like OpenAI and Anthropic that lets you submit large numbers of prompts in a single file and receive results later, typically within 24 hours. Batch API requests are priced at a significant discount (often 50%) compared to synchronous requests and are ideal for offline workloads such as data labeling, content generation pipelines, and bulk evaluations.
Sending multiple prompts or requests to an LLM at once instead of one at a time. Batch processing is often offered at a discount by providers and is useful for non-time-sensitive workloads like data labeling, content generation, or bulk analysis.
A prompting technique that encourages the model to break down complex problems into intermediate reasoning steps before arriving at a final answer. This approach significantly improves performance on math, logic, and multi-step reasoning tasks.
Command Line Interface. A text-based interface for interacting with software. In the AI context, CLI tools allow developers to interact with LLMs, run agents, and manage AI workflows directly from the terminal.
An approach to AI safety developed by Anthropic where the model is trained to follow a set of principles (a "constitution") that guides its behavior. The model critiques and revises its own outputs to align with these principles during training.
The practice of filling the entire available context window with relevant documents, examples, or data to improve model performance — as opposed to using retrieval-augmented generation. With models like Gemini 3.1 Pro supporting 2M token contexts, context stuffing entire codebases or document libraries has become practical. The trade-off is higher API cost and potential "lost in the middle" degradation for very long contexts.
A class of generative AI models that learn to create data (images, audio, or video) by learning to reverse a gradual noising process. During training the model sees clean data progressively corrupted with noise; at inference it starts from pure noise and denoises step by step to produce the final output. Stable Diffusion, DALL-E 3, and Sora are well-known examples.
A training technique where a smaller "student" model is trained to mimic the outputs of a larger, more capable "teacher" model. Distillation transfers knowledge from a frontier model into a smaller, faster, cheaper model that approximates its performance. Many smaller open-source models are distillations of larger proprietary models, enabling efficient local deployment.
A numerical vector representation of text, images, or other data in a high-dimensional space. Embeddings capture semantic meaning, allowing similar concepts to be located near each other. They are fundamental to search, recommendation systems, and RAG pipelines.
A capability in advanced reasoning models (such as Claude 3.7 Sonnet and OpenAI o3) where the model allocates extra compute to an internal "scratchpad" before producing a final answer. During this phase the model works through the problem step by step, similar to Chain of Thought but happening inside the model rather than in the visible output. Extended thinking significantly improves performance on hard math, science, and coding problems.
A prompting technique where a small number of example input-output pairs are included in the prompt to demonstrate the desired behavior or format. This helps the model understand the task without any additional fine-tuning.
The process of further training a pre-trained model on a specific dataset to specialize it for a particular task or domain. Fine-tuning adjusts model weights to improve performance on targeted use cases while leveraging the general knowledge from pre-training.
A capability that allows LLMs to generate structured outputs that invoke predefined functions or APIs. Instead of generating free-form text, the model produces JSON-formatted function calls with appropriate arguments, enabling reliable integration with external tools and services.
GPT-Generated Unified Format. A binary file format used to package quantized language models for efficient local inference, primarily with the llama.cpp runtime. GGUF replaced the older GGML format and supports metadata, multiple quantization levels (Q4_K_M, Q8_0, etc.), and is the standard format for sharing open-weight models on HuggingFace for CPU and GPU inference.
Techniques that connect LLM outputs to verifiable external sources of information, reducing hallucinations and improving factual accuracy. Grounding can involve retrieval from databases, web search, or citation of specific documents.
When an LLM generates information that sounds plausible but is factually incorrect, fabricated, or not supported by its training data. Hallucinations are a key challenge in deploying LLMs for factual applications and are mitigated through techniques like RAG and grounding.
Integrated Development Environment. A software application that provides comprehensive tools for software development, including code editing, debugging, and building. Modern IDEs increasingly integrate AI assistants for code completion, generation, and explanation.
The ability of a language model to learn new tasks or patterns from examples provided within the prompt itself, without any weight updates or fine-tuning. The model temporarily "learns" from the context window during inference. In-context learning is the mechanism behind few-shot prompting and is a key emergent capability of large models.
The process of running a trained model to generate predictions or outputs from new inputs. In the LLM context, inference is when the model processes a prompt and generates a response. Inference costs, speed, and efficiency are critical factors in production deployments.
Techniques used to bypass the safety filters and behavioral guidelines built into AI models, causing them to produce outputs they were trained to refuse. Common methods include roleplay framing, hypothetical scenarios, token smuggling, and adversarial prompts. Jailbreaking is an active area of red-teaming research, and providers continuously update their models to be more robust.
The high-dimensional mathematical space of internal representations that a neural network uses to encode data. Each point in latent space corresponds to a compressed, abstract representation of an input. Concepts with similar meanings cluster near each other, enabling operations like arithmetic on embeddings ("king - man + woman = queen"). Diffusion models, VAEs, and other generative models operate by sampling from or navigating latent space.
Low-Rank Adaptation. A parameter-efficient fine-tuning technique that adds small trainable matrices to the model's layers instead of updating all parameters. LoRA dramatically reduces the memory and compute needed for fine-tuning while maintaining quality comparable to full fine-tuning.
Massive Multitask Language Understanding. A widely-used benchmark that tests models across 57 academic subjects ranging from humanities to STEM. MMLU scores are commonly used to compare general knowledge and reasoning capabilities of different LLMs.
A competitive advantage that protects a company's AI product from being easily replicated by competitors. In the AI industry, moats can come from proprietary training data, unique fine-tuning, superior UX, deep customer integration, or network effects — rather than the underlying model alone, since base models from different providers are increasingly commoditized. The phrase "there is no moat" became famous after a leaked Google memo argued that open-source AI erodes the moat of even the biggest labs.
Mixture of Experts. An architecture where the model consists of multiple specialized sub-networks (experts), and a gating mechanism routes each input to the most relevant experts. MoE allows models to scale to very large parameter counts while keeping inference costs manageable.
AI models that can process and generate multiple types of data, such as text, images, audio, and video. Multimodal models like GPT-4V and Gemini can understand images alongside text, enabling richer interactions and broader applications.
Completing an entire application, feature, or project in a single large prompt without iterative back-and-forth. As frontier models like Claude Opus 4.6, Gemini 3.1 Pro, and OpenAI o1 Pro have become more capable, developers have reported successfully "one-shotting" full web apps, games, and scripts by writing a detailed specification in one prompt and having the model generate working code. One-shotting contrasts with traditional iterative development and is enabled by large context windows and improved instruction-following.
Parameter-Efficient Fine-Tuning. A family of techniques that fine-tune only a small fraction of a model's parameters rather than all of them. PEFT methods like LoRA and prefix tuning make fine-tuning accessible on consumer hardware while preserving most of the model's capabilities.
A provider feature that stores and reuses the key-value (KV) cache of processed prompt tokens across API requests. When a new request starts with the same prefix (such as a long system prompt or reference document), the cached computation is reused rather than reprocessed from scratch. Anthropic and OpenAI both offer prompt caching, with cached tokens billed at a 50-90% discount, making it a major cost-saving technique for high-volume applications.
The practice of crafting and optimizing input prompts to elicit better responses from LLMs. Effective prompt engineering involves structuring instructions clearly, providing relevant context, using examples, and iterating on prompts to improve output quality and reliability.
An attack where malicious text embedded in data processed by an AI system attempts to override the original instructions and redirect the model's behavior. For example, a webpage might contain hidden text saying "Ignore your previous instructions and..." that an AI browsing agent would inadvertently follow. Prompt injection is a critical security concern for agentic AI systems that process untrusted external content.
When an AI model reveals its system prompt or internal instructions to the user, either by design or through clever prompting. Many commercial AI products keep system prompts confidential for business or safety reasons. Prompt leaking attacks ask the model to repeat, summarize, or translate its instructions, which well-aligned models are trained to refuse.
A technique that reduces model size and inference costs by representing weights with lower-precision numbers (e.g., 4-bit instead of 16-bit). Quantization enables running large models on consumer hardware with minimal quality loss, making local deployment practical.
Retrieval-Augmented Generation. A technique that enhances LLM responses by first retrieving relevant documents from an external knowledge base and including them as context. RAG reduces hallucinations, enables access to up-to-date information, and allows models to cite specific sources.
Restrictions imposed by API providers on the number of requests or tokens a user can consume within a given time period. Rate limits protect infrastructure from overload and are typically tiered based on pricing plans.
The process of finding and fetching relevant information from a data store in response to a query. In AI systems, retrieval typically uses embeddings and vector similarity to find semantically relevant documents that provide context for generation.
Reinforcement Learning from Human Feedback. A training technique where human evaluators rate model outputs, and a reward model is trained on these preferences to guide the LLM toward generating more helpful, harmless, and honest responses.
The field of research and engineering focused on ensuring AI systems behave as intended, avoid harmful outputs, and remain aligned with human values. Safety measures include content filtering, RLHF, Constitutional AI, and system-level guardrails.
Software Development Kit. A collection of libraries, tools, and documentation that simplifies integration with an AI provider's services. SDKs handle authentication, request formatting, error handling, and streaming, letting developers focus on building features.
A search technique that finds results based on meaning rather than exact keyword matching. Semantic search uses embeddings to represent queries and documents as vectors, returning results that are conceptually similar even if they use different words.
State of the Art. Refers to the best-performing model or technique on a given benchmark or task at a given time. Achieving SOTA results is a common goal and metric in AI research publications and model announcements.
A method of delivering LLM responses incrementally as tokens are generated, rather than waiting for the full response. Streaming improves perceived latency for end users and enables real-time display of responses in chat interfaces.
Software Engineering Benchmark. A benchmark that evaluates AI models on their ability to resolve real-world GitHub issues by generating correct code patches. SWE-bench is a key metric for measuring coding and agentic capabilities of LLMs.
A special instruction set provided at the beginning of a conversation that defines the model's behavior, persona, constraints, and response format. System prompts persist throughout the conversation and take priority over user messages for defining model behavior.
A parameter that controls the randomness of model outputs. Lower temperatures (e.g., 0.0) make responses more deterministic and focused, while higher temperatures (e.g., 1.0+) increase creativity and variability. Temperature affects token selection probabilities during generation.
The fundamental unit of text that LLMs process. A token can be a word, part of a word, a punctuation mark, or a special character. Most English words are 1-3 tokens. Token counts determine API costs and context window usage.
The process of splitting raw text into discrete units (tokens) that a language model can process. Different models use different tokenization schemes — byte-pair encoding (BPE), WordPiece, SentencePiece — each with different vocabularies and splitting rules. Tokenization affects context window usage, API costs, and how models perceive character-level patterns in text.
The algorithm that converts raw text into tokens and vice versa. Different models use different tokenizers (e.g., tiktoken for GPT, SentencePiece for Llama), which affects token counts and how text is represented internally.
The ability of an LLM to interact with external tools, APIs, and services to accomplish tasks beyond text generation. Tool use enables models to perform web searches, execute code, query databases, and interact with third-party services.
Also known as nucleus sampling. A parameter that controls output diversity by limiting token selection to the smallest set of tokens whose cumulative probability exceeds P. Top-P of 0.9 means the model considers tokens in the top 90% probability mass, filtering out unlikely tokens.
The process of teaching a model by exposing it to large amounts of data and adjusting its internal parameters to minimize prediction errors. Pre-training on massive text corpora gives LLMs their general capabilities, while fine-tuning specializes them for specific tasks.
The date after which no new information was included in a model's pre-training dataset. The model has no direct knowledge of events, publications, or discoveries that occurred after this date unless the information is provided in the context window or via retrieval-augmented generation. Always check a model's cutoff when asking about recent events.
The neural network architecture that underlies virtually all modern LLMs. Introduced in 2017, Transformers use self-attention mechanisms to process input sequences in parallel, enabling efficient training on massive datasets and strong performance on language tasks.
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings. Vector databases enable fast similarity search over millions of embeddings and are essential infrastructure for RAG systems and semantic search applications.
A casual, intuition-driven approach to AI-assisted programming where the developer describes what they want in plain language and iterates based on the feel of the output rather than writing every line manually. Popularized by the rise of agentic coding tools like Claude Code and Cursor. The term captures the shift from traditional programming to directing an AI collaborator through natural language and gut instinct.
A multimodal model that combines visual understanding with language generation. VLMs can analyze images, charts, screenshots, documents, and video frames alongside text, enabling tasks like image captioning, visual question answering, OCR, and document understanding. Examples include GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro. The visual encoder converts images into token embeddings that the language model processes alongside text.
Using a model to perform a task without providing any examples in the prompt. The model relies entirely on its pre-training knowledge to understand and complete the task based on the instruction alone. Zero-shot performance is a key measure of model generalization.