{"id":5347,"date":"2026-03-04T20:10:43","date_gmt":"2026-03-05T01:10:43","guid":{"rendered":"https:\/\/dft.wiki\/?p=5347"},"modified":"2026-03-04T20:10:43","modified_gmt":"2026-03-05T01:10:43","slug":"acronyms-jargon-and-architecture-of-llm-and-generative-ai","status":"publish","type":"post","link":"https:\/\/dft.wiki\/?p=5347","title":{"rendered":"Acronyms, Jargon, and Architecture of LLM and Generative AI"},"content":{"rendered":"<p>This glossary organizes the foundational concepts, file structures, and technical operations required to navigate the ecosystem of Large Language Models (LLMs).<\/p>\n<h3>Core Concepts &amp; Definitions<\/h3>\n<ul>\n<li><strong>LLM (Large Language Model)<\/strong>\n<ul>\n<li>A neural network trained on massive text datasets to predict the next token in a sequence.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Transformer<\/strong>\n<ul>\n<li>The underlying neural network architecture powering modern LLMs (supporting Chat, Code, Speech, etc.).<\/li>\n<\/ul>\n<\/li>\n<li><strong>Attention<\/strong>\n<ul>\n<li>A mechanism that allows the model to weigh the relative importance of different tokens in a sequence.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Multi-Head Attention<\/strong>\n<ul>\n<li>Multiple attention layers running in parallel to capture different aspects of information.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Inference<\/strong>\n<ul>\n<li>The process of running a trained model to generate output (as opposed to training or fine-tuning).<\/li>\n<\/ul>\n<\/li>\n<li><strong>Parameters (e.g., 7B, 13B, 70B)<\/strong>\n<ul>\n<li>Refers to the &#8220;size&#8221; of the model in billions. More parameters generally equate to higher reasoning capabilities but require more hardware resources.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Embedding<\/strong>\n<ul>\n<li>The mathematical vector representation of text that allows models to process semantic meaning.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Tokenization &amp; Context<\/h3>\n<ul>\n<li><strong>Tokens<\/strong>\n<ul>\n<li>The smallest unit of data the model processes. 1 token is approximately 3\u20134 characters or 0.75 words.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Context Window<\/strong>\n<ul>\n<li>The maximum number of tokens a model can &#8220;remember&#8221; or consider in a single prompt (e.g., 64k, 128k).<\/li>\n<\/ul>\n<\/li>\n<li><strong>Temperature<\/strong>\n<ul>\n<li>A hyperparameter that controls the randomness of output. Higher values increase &#8220;creativity,&#8221; while lower values make the model more deterministic.<\/li>\n<\/ul>\n<\/li>\n<li><strong>System Prompt<\/strong>\n<ul>\n<li>A hidden instruction set that defines the assistant\u2019s persona, constraints, and behavior.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Hardware &amp; Performance<\/h3>\n<ul>\n<li><strong>VRAM (Video RAM)<\/strong>\n<ul>\n<li>A dedicated memory on a GPU is required to load and run model weights.<\/li>\n<\/ul>\n<\/li>\n<li><strong>CUDA<\/strong>\n<ul>\n<li>NVIDIA\u2019s parallel computing platform and API for GPU acceleration.<\/li>\n<\/ul>\n<\/li>\n<li><strong>ROCm<\/strong>\n<ul>\n<li>AMD\u2019s open-source software stack for GPU-accelerated computing.<\/li>\n<\/ul>\n<\/li>\n<li><strong>CPU Offloading<\/strong>\n<ul>\n<li>A technique allowing parts of the model (layers) to run on system RAM + CPU when VRAM is insufficient.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Accelerate<\/strong>\n<ul>\n<li>A library that manages device_map calculations, weight streaming, and efficient CPU offloading.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Model Training &amp; Adaptation<\/h3>\n<ul>\n<li><strong>Pretrained Model<\/strong>\n<ul>\n<li>A &#8220;base&#8221; model trained on a general corpus of data before any task-specific tuning.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Fine-tuning<\/strong>\n<ul>\n<li>The process of further training a pretrained model on a specific, smaller dataset to specialize its behavior.<\/li>\n<\/ul>\n<\/li>\n<li><strong>LoRA (Low-Rank Adaptation)<\/strong>\n<ul>\n<li>A lightweight fine-tuning method that injects small, trainable adapter layers into the model instead of updating all parameters.<\/li>\n<\/ul>\n<\/li>\n<li><strong>QLoRA<\/strong>\n<ul>\n<li>A highly efficient version of LoRA that uses 4-bit quantization to allow fine-tuning on consumer-grade hardware.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Optimization<\/strong>\n<ul>\n<li>The process of compressing model weights to reduce memory usage and increase speed, typically with a slight trade-off in quality.<\/li>\n<li>Frequently, models are quantized from 16-bit to 4-bit or 8-bit, for example.<\/li>\n<li>Common Quantization Types:\u00a0Q4_K_M, Q8_0, 4-bit GPTQ, AWQ.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>File Formats<\/h3>\n<ul>\n<li><strong>GGUF<\/strong>\n<ul>\n<li>The standard format for llama.cpp; optimized for CPU+GPU inference.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Safetensors<\/strong>\n<ul>\n<li>A secure, fast tensor format that prevents arbitrary code execution (replacing the older, &#8220;unsafe&#8221; <code>.bin<\/code> format).<\/li>\n<\/ul>\n<\/li>\n<li><strong>GPTQ \/ AWQ<\/strong>\n<ul>\n<li>Specialized formats optimized for high-speed GPU-centric inference.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Checkpoint<\/strong>\n<ul>\n<li>The actual file containing the trained weights (e.g., .safetensors, .gguf).<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>The Anatomy of a Model Directory<\/h3>\n<p>When downloading a model (e.g., from Hugging Face), you will encounter these standard files:<\/p>\n<ul>\n<li><code>config.json<\/code>\n<ul>\n<li>Defines architecture: layers, heads, vocab size, and model type.<\/li>\n<\/ul>\n<\/li>\n<li>Found as <code>.safetensors<\/code> (preferred) or <code>.bin<\/code>\n<ul>\n<li>Trained weights, the &#8220;knowledge&#8221; of the model.<\/li>\n<\/ul>\n<\/li>\n<li><code>tokenizer.json<\/code>\n<ul>\n<li>Defines the vocabulary and rules for turning text into tokens.<\/li>\n<\/ul>\n<\/li>\n<li><code>tokenizer_config.json<\/code>\n<ul>\n<li>Settings for tokenizer behavior (padding, max length).<\/li>\n<\/ul>\n<\/li>\n<li><code>special_tokens_map.json<\/code>\n<ul>\n<li>Maps functional tokens like &lt;bos&gt; (start), &lt;eos&gt; (end), and &lt;pad&gt;.<\/li>\n<\/ul>\n<\/li>\n<li><code>generation_config.json<\/code>\n<ul>\n<li>Optional defaults for inference (temp, top_p, penalty).<\/li>\n<\/ul>\n<\/li>\n<li><code>preprocessor_config.json<\/code>\n<ul>\n<li>Required for multimodal models (vision\/speech) to resize or normalize input.<\/li>\n<\/ul>\n<\/li>\n<li><code>model_index.json<\/code>\n<ul>\n<li>Metadata used by the Hugging Face Hub and pipelines.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Diffusers (Multimodal Pipelines)<\/h3>\n<p>For Image, Video, or Audio generation, models use a modular directory structure called a **Pipeline**:<\/p>\n<ul>\n<li><code>model_index.json<\/code>\n<ul>\n<li>The blueprint for the entire pipeline.<\/li>\n<\/ul>\n<\/li>\n<li><code>vae\/<\/code>\n<ul>\n<li>The Variational Autoencoder; encodes images into latent space and decodes them back to pixels.<\/li>\n<\/ul>\n<\/li>\n<li><code>unet<\/code> \/ <code>text_encoder<\/code>\n<ul>\n<li>Components that guide the generation process based on prompts.<\/li>\n<\/ul>\n<\/li>\n<li><code>scheduler\/<\/code>\n<ul>\n<li>Defines the noise schedule and controls how many steps are taken during inference.<\/li>\n<\/ul>\n<\/li>\n<li><code>safety_checker<\/code> \/ <code>feature_extractor<\/code>\n<ul>\n<li>Optional tools to detect and block NSFW content.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This glossary organizes the foundational concepts, file structures, and technical operations required to navigate the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[],"class_list":["post-5347","post","type-post","status-publish","format-standard","hentry","category-ai"],"_links":{"self":[{"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/posts\/5347","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dft.wiki\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5347"}],"version-history":[{"count":1,"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/posts\/5347\/revisions"}],"predecessor-version":[{"id":5348,"href":"https:\/\/dft.wiki\/index.php?rest_route=\/wp\/v2\/posts\/5347\/revisions\/5348"}],"wp:attachment":[{"href":"https:\/\/dft.wiki\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5347"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dft.wiki\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5347"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dft.wiki\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5347"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}