Well, it has been a little over two years since I first published this AI Glossary. Amazing how time flies. The field has moved fast enough that significant portions needed a ground-up rewrite. LLMs, agentic workflows, and retrieval-augmented generation have gone from research curiosities to production infrastructure.
This glossary is designed for executives, operators, and practitioners who need to understand AI terminology clearly without wading through academic papers. Definitions are practical and grounded in how these concepts actually show up in production systems.
A
Agents / Agentic AI. AI systems that can take sequences of actions autonomously to accomplish a goal. Unlike a chatbot that responds to a single prompt, an agent plans, executes tools, evaluates results, and adjusts. Modern agents built on LLMs can write code, run commands, browse the web, and interact with APIs. The key distinction from traditional automation: agents handle ambiguity and can recover from unexpected results.
Attention mechanism. The core architectural innovation behind transformers (and therefore modern LLMs). Attention allows a model to weigh the relevance of each word in a sequence relative to every other word when generating a prediction. The phrase "pay attention to what matters" is literally what these mechanisms do, at scale, across millions of parameters.
Artificial intelligence (AI). Software systems that perform tasks typically associated with human intelligence: reasoning, pattern recognition, language understanding, and decision-making. The term is broad. When people say "AI" today they usually mean machine learning, and specifically large language models.
Artificial neural network (ANN). A computational architecture loosely inspired by the human brain. Layers of interconnected nodes (neurons) transform input data through weighted connections. Training adjusts those weights so the network produces correct outputs. Deep neural networks have many layers; hence "deep learning."
Augmented generation. See Retrieval-Augmented Generation (RAG).
Autoregressive model. A model that generates output one token at a time, where each token is conditioned on all previously generated tokens. GPT-style models are autoregressive. The model does not plan the full output before starting — it commits to each token sequentially.
B
Backpropagation. The algorithm used to train neural networks. During training, the network makes a prediction, the error is calculated (the loss), and backpropagation computes how much each weight contributed to that error. Weights are then adjusted in the direction that reduces the loss. This process repeats across millions of examples.
BERT (Bidirectional Encoder Representations from Transformers). A transformer model from Google that reads text in both directions simultaneously. Unlike GPT-style models (which predict the next token), BERT is optimized for understanding context. Common use: embeddings, classification, and search.
Bias (model). A systematic error in model outputs caused by patterns in training data or model design. Types include: historical bias (data reflects past discrimination), measurement bias (data collection errors), and representation bias (underrepresentation of certain groups). Bias is distinct from variance and from the informal meaning of "unfair prejudice," though the outcomes can overlap.
Benchmark. A standardized test used to evaluate model performance on a specific capability. Common LLM benchmarks include MMLU (knowledge), HumanEval (code), and HellaSwag (commonsense reasoning). Benchmarks have known limitations — models can be trained to score well without improving on the underlying capability.
C
Chain-of-thought (CoT) prompting. A prompting technique where the model is asked to reason step-by-step before answering. "Think through this problem before giving your final answer." CoT prompting reliably improves performance on math, logic, and multi-step reasoning tasks. It is an emergent capability of large enough models.
Classification. A supervised learning task where the model assigns inputs to discrete categories. Binary classification (spam/not-spam). Multi-class classification (news topic: sports/business/tech/politics). Classification models output probabilities for each class.
Clustering. An unsupervised learning approach that groups similar data points without predefined labels. K-means, hierarchical clustering, and DBSCAN are common algorithms. Useful for customer segmentation, anomaly detection, and data exploration.
Context window. The maximum amount of text an LLM can process in a single interaction — both the input and the output together. Early GPT models had 4,096 token windows (~3,000 words). Current models range from 128K to over 1M tokens. Larger context windows enable longer documents, multi-turn reasoning, and in-context learning across entire codebases.
Convolutional neural network (CNN). A neural network architecture optimized for grid-like data, particularly images. Convolutional layers learn spatial patterns (edges, textures, shapes) by scanning small filters across the input. CNNs dominated computer vision before transformers began competing in that space.
D
Data augmentation. Techniques that artificially expand a training dataset by creating modified versions of existing examples. For images: rotation, flipping, cropping, color adjustment. For text: back-translation, synonym replacement, paraphrasing. Reduces overfitting and improves model robustness.
Deep learning. Machine learning using neural networks with many layers (hence "deep"). The depth allows the network to learn hierarchical representations — pixels become edges become shapes become objects. Deep learning powers modern computer vision, NLP, and audio processing.
Diffusion model. A generative model that learns to produce data (images, audio, video) by learning to reverse a process of gradually adding noise. DALL-E, Stable Diffusion, and Midjourney are diffusion models. They produce high-quality, diverse outputs but are computationally expensive.
Dropout. A regularization technique where random neurons are temporarily set to zero during training. Forces the network to develop redundant representations and reduces overfitting. Not used during inference.
E
Embedding. A numerical representation of a piece of data (word, sentence, image) as a vector in high-dimensional space. Embeddings capture semantic relationships: "king" and "queen" are close; "king" and "car" are far. Embeddings are the foundation of semantic search, RAG, and similarity-based retrieval.
Emergent capability. A capability that appears in a model only above a certain scale, not present in smaller versions. CoT reasoning, multi-step code generation, and in-context learning are examples. Emergent capabilities are difficult to predict and have surprised researchers. They are the primary reason scale continues to matter.
Ensemble learning. Combining predictions from multiple models to improve accuracy and robustness. Random forests are an ensemble of decision trees. Ensemble methods reduce variance (high variance = model memorizes training data) and bias (high bias = model is too simple).
F
Few-shot learning. Providing a model with a small number of examples in the prompt before asking it to perform a task. "Here are three examples of X. Now do X for this new input." Few-shot prompting allows LLMs to adapt to new tasks without retraining. Contrast with zero-shot (no examples) and fine-tuning (examples used in training).
Fine-tuning. Further training a pre-trained model on a specific dataset to adapt it for a particular task or domain. Less expensive than training from scratch. Common approaches: supervised fine-tuning (SFT), RLHF, LoRA, and QLoRA. Fine-tuning does not reliably add new knowledge — it primarily adjusts behavior and style.
Foundation model. A large model trained on broad data that can be adapted for many downstream tasks. GPT-4, Claude, Gemini, and Llama are foundation models. The term was coined at Stanford to describe the paradigm shift from task-specific models to general-purpose models.
G
Generative AI. AI systems that produce new content — text, images, code, audio, video — rather than classifying or predicting from existing data. LLMs, diffusion models, and voice synthesis systems are all generative AI. The distinction from discriminative AI: generative models learn the underlying distribution of data; discriminative models learn boundaries between categories.
GPT (Generative Pre-trained Transformer). OpenAI's model family. "Generative" means it produces text. "Pre-trained" means it was trained on large datasets before fine-tuning. "Transformer" is the underlying architecture. GPT-3 (2020) demonstrated that scale alone could produce surprisingly capable language models. GPT-4 (2023) remains widely deployed in production.
Gradient descent. The optimization algorithm used to train neural networks. The "gradient" is the direction and magnitude of the steepest ascent in the loss landscape. "Descent" means we move in the opposite direction to reduce loss. Stochastic gradient descent (SGD) uses random subsets of data (mini-batches) for efficiency.
H
Hallucination. When an LLM produces plausible-sounding but factually incorrect output. The model "knows" how to produce confident-sounding text; it does not always know whether that text is true. Hallucinations occur because LLMs are trained to predict likely text, not to verify facts. Mitigation approaches: RAG, tool use, grounding in external data, and explicit uncertainty prompting.
HHRL (Reinforcement Learning from Human Feedback). A training methodology where human raters rank model outputs, and the model is trained to produce outputs ranked higher. Critical to aligning LLMs with human preferences. Used by OpenAI, Anthropic, and Google to make models more helpful, harmless, and honest. The "H" in the name comes from the three properties.
Hyperparameter. A configuration value set before training begins, not learned during training. Learning rate, batch size, number of layers, and dropout rate are hyperparameters. Choosing good hyperparameters is part of the model development process.
I
In-context learning. The ability of LLMs to adapt to new tasks based on examples or instructions in the prompt, without any weight updates. A form of few-shot learning. Emerged at scale — smaller models cannot do it reliably. One of the key properties that makes LLMs general-purpose tools.
Inference. Running a trained model on new inputs to produce outputs. Distinct from training (which updates weights). Inference is what happens when you send a message to Claude or call the OpenAI API. Inference cost is separate from training cost and is the primary ongoing expense of deploying LLMs in production.
Instruction tuning. Fine-tuning a model on examples of instructions and their correct responses. Teaches the model to follow directions rather than just continue text. Most modern LLMs are instruction-tuned on top of their pre-training. InstructGPT was an early influential instruction-tuned model.
L
Large language model (LLM). A neural network trained on large amounts of text data to understand and generate human language. "Large" refers to the number of parameters (weights) — modern models have tens of billions to trillions. LLMs are the foundation of ChatGPT, Claude, Gemini, and Copilot. They are statistical models of language, not databases of facts.
Latency. The time from sending a request to receiving a response. For LLMs: time-to-first-token (TTFT) and tokens-per-second (TPS) are the key metrics. Latency is affected by model size, hardware, batching, and serving infrastructure. High latency blocks real-time applications.
LoRA (Low-Rank Adaptation). A parameter-efficient fine-tuning method that trains a small number of additional weights rather than updating the full model. Dramatically reduces the compute and memory cost of fine-tuning. Common for adapting large models to specific tasks without full retraining.
M
Machine learning (ML). A subset of AI where systems learn from data rather than being explicitly programmed. The model finds patterns in training data and applies them to new inputs. Deep learning is a subset of ML. Most modern "AI" is machine learning.
Model. In ML, a mathematical function that maps inputs to outputs, defined by its architecture and learned parameters (weights). "The model" is the artifact that results from training. A deployed model is called in production to make predictions or generate content.
Multi-modal model. A model that processes multiple types of input (text, images, audio, video). GPT-4V, Claude 3, and Gemini Ultra are multi-modal. Enables use cases like analyzing charts, describing images, and processing documents with mixed content.
N
Natural language processing (NLP). The field of AI focused on enabling computers to understand, generate, and manipulate human language. Classification, summarization, translation, question answering, and sentiment analysis are NLP tasks. Modern NLP is dominated by transformer-based LLMs.
Neural network. See Artificial neural network.
Normalization. Scaling input features to a consistent range to stabilize training. Layer normalization is used within transformers. Without normalization, gradient magnitudes can vary wildly across layers, making training unstable.
O
Overfitting. When a model performs well on training data but poorly on new data. The model has memorized the training examples rather than learning the underlying pattern. Signs: high training accuracy, low validation accuracy. Mitigations: more data, regularization (dropout, weight decay), early stopping.
P
Parameters. The weights and biases of a neural network that are learned during training. GPT-3 has 175 billion parameters. More parameters generally means more capacity — but not always more accuracy, and always more compute cost. "Parameter count" is a rough proxy for model capability.
Prompt. The input provided to an LLM. Everything the model sees before it generates a response. Prompt design (prompting) is the practice of structuring inputs to elicit better outputs. System prompts, few-shot examples, and chain-of-thought instructions are all parts of prompting.
Prompt engineering. The practice of designing inputs to language models to improve output quality. Part craft, part systematic experimentation. Best practices: be specific, show examples, specify output format, decompose complex tasks, use chain-of-thought for reasoning tasks.
R
RAG (Retrieval-Augmented Generation). A pattern where an LLM is paired with a retrieval system. When answering a question, the system first retrieves relevant documents from a database (using embeddings and vector search), then passes those documents to the LLM as context. RAG reduces hallucination, allows models to use up-to-date information, and is the foundation of most enterprise AI applications.
Reinforcement learning (RL). A learning paradigm where an agent learns by taking actions in an environment and receiving rewards or penalties. Used in game-playing AI (AlphaGo, OpenAI Five) and, through RLHF, in LLM alignment. The agent optimizes for cumulative reward over time.
Regularization. Techniques that reduce overfitting by constraining the model during training. Dropout, L1/L2 weight penalties, and early stopping are regularization methods. They prevent the model from fitting noise in the training data.
S
Semantic search. Search based on meaning rather than exact keyword matching. Uses embeddings to represent both queries and documents in the same vector space. "What caused the 2008 financial crisis" returns documents about mortgage-backed securities and bank failures, not documents that contain those exact words.
Supervised learning. Training a model on labeled examples — input/output pairs. The model learns to predict the correct output for new inputs. Classification and regression are supervised tasks. Requires curated labeled data, which is expensive to produce at scale.
System prompt. Instructions provided to an LLM before the user's first message that configure its behavior for the entire conversation. Used to set persona, restrict topics, specify output format, and provide context. System prompts are the primary mechanism for deploying LLMs in production applications.
T
Temperature. A parameter that controls the randomness of LLM outputs. Temperature 0 makes the model always pick the most likely next token (deterministic, repetitive). Higher temperatures introduce more randomness (creative, diverse, sometimes incoherent). Temperature 0.7-1.0 is typical for conversational use. Temperature 0 for code generation and structured output.
Token. The basic unit of text that LLMs process. Roughly 0.75 words on average in English, though tokenization varies by model. "Tokenization" splits input text into tokens before the model processes it. Token count determines context window usage and API pricing.
Transformer. The neural network architecture that underlies virtually all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need." Key innovation: the self-attention mechanism, which allows the model to consider the full context of the input when processing each token. GPT, BERT, Claude, and Gemini are all transformer-based.
U
Unsupervised learning. Training a model on unlabeled data. The model finds structure, patterns, and relationships without explicit guidance. Clustering and dimensionality reduction are unsupervised techniques. LLM pre-training is a form of self-supervised learning (a variant of unsupervised).
V
Vector database. A database optimized for storing and querying high-dimensional vectors (embeddings). Enables fast similarity search — "find the 10 embeddings most similar to this query embedding." Pinecone, Weaviate, Chroma, and pgvector (Postgres extension) are common options. Essential infrastructure for RAG-based applications.
Vision transformer (ViT). A transformer architecture applied to images. Images are split into patches, each patch is embedded as a vector, and the transformer processes the sequence of patches. ViTs have largely replaced CNNs for high-performance image classification and vision-language tasks.
W
Weights. The numerical parameters of a neural network that are adjusted during training. A model "knowing" something means those facts are encoded (imperfectly) across billions of weights. Weight updates during training are small incremental adjustments, not discrete storage of facts.
Z
Zero-shot learning. Asking a model to perform a task without providing any examples. "Translate this sentence to Spanish." Zero-shot performance is a measure of how well the model's pre-training generalizes. Modern LLMs perform well zero-shot on a wide range of tasks, which is one of the core properties that makes them useful.
Questions on any of these concepts as they apply to manufacturing or PE portfolio operations: hello@coldironlabs.com