{"id":5347,"date":"2026-03-04T20:10:43","date_gmt":"2026-03-05T01:10:43","guid":{"rendered":"https:\/\/dft.wiki\/?p=5347"},"modified":"2026-06-08T09:41:08","modified_gmt":"2026-06-08T13:41:08","slug":"acronyms-jargon-and-architecture-of-llm-and-generative-ai","status":"publish","type":"post","link":"https:\/\/dft.wiki\/?p=5347","title":{"rendered":"Acronyms, Jargon, and Architecture of LLM and Generative AI"},"content":{"rendered":"<p>This glossary covers the foundational concepts, file structures, and technical operations needed to work with Large Language Models (LLMs).<\/p>\n<h3>Core Concepts &amp; Definitions<\/h3>\n<ul>\n<li><strong>LLM (Large Language Model)<\/strong>\n<ul>\n<li>A neural network trained on large text datasets to predict the next token in a sequence.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Transformer<\/strong>\n<ul>\n<li>The neural network architecture powering modern LLMs (used for Chat, Code, Speech, and more).<\/li>\n<\/ul>\n<\/li>\n<li><strong>Attention<\/strong>\n<ul>\n<li>A mechanism that allows the model to weigh the importance of different tokens in a sequence.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Multi-Head Attention<\/strong>\n<ul>\n<li>Multiple attention layers running in parallel, each capturing different aspects of the input.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Inference<\/strong>\n<ul>\n<li>The process of running a trained model to generate output, as opposed to training or fine-tuning it.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Parameters (e.g., 7B, 13B, 70B)<\/strong>\n<ul>\n<li>The &#8220;size&#8221; of a model, measured in billions of parameters. More parameters generally mean stronger reasoning but require more hardware resources.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Embedding<\/strong>\n<ul>\n<li>A numerical vector representation of text that allows a model to process semantic meaning.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Tokenization &amp; Context<\/h3>\n<ul>\n<li><strong>Tokens<\/strong>\n<ul>\n<li>The smallest unit of data a model processes. One token is roughly 3\u20134 characters or 0.75 words.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Context Window<\/strong>\n<ul>\n<li>The maximum number of tokens a model can consider in a single prompt (e.g., 64k or 128k tokens).<\/li>\n<\/ul>\n<\/li>\n<li><strong>Temperature<\/strong>\n<ul>\n<li>A setting that controls output randomness. Higher values produce more varied responses; lower values make the model more predictable.<\/li>\n<\/ul>\n<\/li>\n<li><strong>System Prompt<\/strong>\n<ul>\n<li>A hidden set of instructions that defines the assistant&#8217;s persona, constraints, and behavior.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Hardware &amp; Performance<\/h3>\n<ul>\n<li><strong>VRAM (Video RAM)<\/strong>\n<ul>\n<li>Dedicated memory on a GPU, required to load and run model weights.<\/li>\n<\/ul>\n<\/li>\n<li><strong>CUDA<\/strong>\n<ul>\n<li>NVIDIA&#8217;s parallel computing platform and API for GPU acceleration.<\/li>\n<\/ul>\n<\/li>\n<li><strong>ROCm<\/strong>\n<ul>\n<li>AMD&#8217;s open-source software stack for GPU-accelerated computing.<\/li>\n<\/ul>\n<\/li>\n<li><strong>CPU Offloading<\/strong>\n<ul>\n<li>A technique that moves parts of the model to system RAM and the CPU when VRAM is insufficient.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Accelerate<\/strong>\n<ul>\n<li>A library that handles device mapping, weight streaming, and CPU offloading.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Model Training &amp; Adaptation<\/h3>\n<ul>\n<li><strong>Pretrained Model<\/strong>\n<ul>\n<li>A base model trained on a general dataset before any task-specific tuning.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Fine-tuning<\/strong>\n<ul>\n<li>Further training a pretrained model on a smaller, task-specific dataset to specialize its behavior.<\/li>\n<\/ul>\n<\/li>\n<li><strong>LoRA (Low-Rank Adaptation)<\/strong>\n<ul>\n<li>A lightweight fine-tuning method that adds small, trainable adapter layers to the model instead of updating all parameters.<\/li>\n<\/ul>\n<\/li>\n<li><strong>QLoRA<\/strong>\n<ul>\n<li>A memory-efficient version of LoRA that uses 4-bit quantization, enabling fine-tuning on consumer hardware.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Quantization<\/strong>\n<ul>\n<li>The process of compressing model weights to reduce memory usage and increase speed, usually with a minor quality trade-off. Models are commonly reduced from 16-bit to 4-bit or 8-bit precision.<\/li>\n<li>Common formats: Q4_K_M, Q8_0, 4-bit GPTQ, AWQ.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>File Formats<\/h3>\n<ul>\n<li><strong>GGUF<\/strong>\n<ul>\n<li>The standard format for llama.cpp, optimized for CPU and GPU inference.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Safetensors<\/strong>\n<ul>\n<li>A secure, fast tensor format that prevents arbitrary code execution, replacing the older <code>.bin<\/code> format.<\/li>\n<\/ul>\n<\/li>\n<li><strong>GPTQ \/ AWQ<\/strong>\n<ul>\n<li>Quantized formats optimized for fast, GPU-based inference.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Checkpoint<\/strong>\n<ul>\n<li>A file containing a model&#8217;s trained weights (e.g., <code>.safetensors<\/code>, <code>.gguf<\/code>).<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>The Anatomy of a Model Directory<\/h3>\n<p>When downloading a model from Hugging Face, you will typically find these files:<\/p>\n<ul>\n<li><code>config.json<\/code>\n<ul>\n<li>Defines the model architecture: layers, attention heads, vocabulary size, and model type.<\/li>\n<\/ul>\n<\/li>\n<li><code>.safetensors<\/code> (preferred) or <code>.bin<\/code>\n<ul>\n<li>The trained weights, the model&#8217;s &#8220;knowledge.&#8221;<\/li>\n<\/ul>\n<\/li>\n<li><code>tokenizer.json<\/code>\n<ul>\n<li>Defines the vocabulary and rules for converting text into tokens.<\/li>\n<\/ul>\n<\/li>\n<li><code>tokenizer_config.json<\/code>\n<ul>\n<li>Tokenizer settings such as padding and maximum length.<\/li>\n<\/ul>\n<\/li>\n<li><code>special_tokens_map.json<\/code>\n<ul>\n<li>Maps functional tokens like &lt;bos&gt; (start), &lt;eos&gt; (end), and &lt;pad&gt;.<\/li>\n<\/ul>\n<\/li>\n<li><code>generation_config.json<\/code>\n<ul>\n<li>Optional default inference settings (temperature, top_p, penalty).<\/li>\n<\/ul>\n<\/li>\n<li><code>preprocessor_config.json<\/code>\n<ul>\n<li>Required for multimodal models (vision\/speech) to resize or normalize inputs.<\/li>\n<\/ul>\n<\/li>\n<li><code>model_index.json<\/code>\n<ul>\n<li>Metadata used by the Hugging Face Hub and its pipelines.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Diffusers (Multimodal Pipelines)<\/h3>\n<p>For image, video, or audio generation, models use a modular directory structure called a <strong>Pipeline<\/strong>:<\/p>\n<ul>\n<li><code>model_index.json<\/code>\n<ul>\n<li>The blueprint for the entire pipeline.<\/li>\n<\/ul>\n<\/li>\n<li><code>vae\/<\/code>\n<ul>\n<li>The Variational Autoencoder, which encodes images into a latent representation and decodes them back to pixels.<\/li>\n<\/ul>\n<\/li>\n<li><code>unet\/<\/code> \/ <code>text_encoder\/<\/code>\n<ul>\n<li>Components that guide the generation process based on text prompts.<\/li>\n<\/ul>\n<\/li>\n<li><code>scheduler\/<\/code>\n<ul>\n<li>Defines the noise schedule and the number of steps taken during inference.<\/li>\n<\/ul>\n<\/li>\n<li><code>safety_checker\/<\/code> \/ <code>feature_extractor\/<\/code>\n<ul>\n<li>Optional components for detecting and filtering NSFW content.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This glossary covers the foundational concepts, file structures, and technical operations needed to work with [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[],"class_list":["post-5347","post","type-post","status-publish","format-standard","hentry","category-ai"],"_links":{"self":[{"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/posts\/5347","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dft.wiki\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5347"}],"version-history":[{"count":4,"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/posts\/5347\/revisions"}],"predecessor-version":[{"id":5606,"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/posts\/5347\/revisions\/5606"}],"wp:attachment":[{"href":"https:\/\/dft.wiki\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5347"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dft.wiki\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5347"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dft.wiki\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5347"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}