DeepSeek has provided Python source code for running DeepSeek-V3-Base (about 1,400 lines of code in 5 .py files), along with a large number of massive safetensors files (about 688 GB) and .json config files. DeepSeek’s now-famous R1 “reasoning” model is based on V3.
Here, I ask Google Gemini 2.0 Flash Experimental in Google AI Studio to analyze and diagram this code for me, both before and after Gemini has seen DeepSeek’s related research papers (technical report and paper on reasoning and reinforcement learning). I also have Gemini help me run a small local model (DeepSeek-R1-Distill-Llama-8B) with much simpler code, and analyze the strange chain of thought (CoT) this small local model generates on my laptop. And while the Python source code covers using the model but not training it, Gemini also analyzes the DeepSeek papers for information about training.
Why look at DeepSeek?
One goal here is partly to help understand DeepSeek, which is interesting both for its claimed dramatically lower costs, and for several innovations highlighted below in the reports that Google Gemini generates for me.
- In part because DeepSeek-V3 reportedly cost less than $6 million to train (compared to $100 million for ChatGPT-4), and reportedly costs much less for businesses to run (in part due to model distillation) — though, not surprisingly, Microsoft and OpenAI are investigating whether DeepSeek may have distilled from OpenAI’s own model, and Dario Amodei from Anthropic questions DeepSeek’s true cost — it will be worth spending more time with DeepSeek. (And perhaps also with Alibaba’s Qwen 2.5, released days after DeepSeek’s release.)
- The release of DeepSeek may signal a “commodification of AI” and/or a 21st century “Sputnik moment” for the US (as part of an AI “Cold War” with China)
- On DeepSeek’s innovations, on the challenges it may present to companies such as OpenAI and Google, and on that $6 million training figure that’s received so much attention, see:
- Jeffrey Emanuel’s lengthy blog post on DeepSeek and Nvidia, which is said to have helped spark the huge drop in Nvidia’s market value)
- SemiAnalysis on DeepSeek server CapEx
- Transformer (Shakeel Hashim) on “How to think about DeepSeek”
- Lex Fridman remarkable 5 hour (!) video with Dylan Patel (SemiAnalysis) and Nathan Lambert (Ai2), “DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters” (transcript)
- IBM Technology video with Aaron Baughman, Chris Hay and Kate Soule, “DeepSeek facts vs hype, model distillation, and open source competition“
- As for specific innovations used by DeepSeek, highlighted in Gemini’s summaries of the code, see discussions below of:
- Mixture-of-Experts (MoE)
- Multi-Headed Attention (MLA)
- Multi-Token Prediction (MTP)
- FP8 quantization
- Custom Triton kernels
- Knowledge distillation
- Transformer attention layer has custom implementations for KV (key-value) caching
- Rotary Positional Embeddings (RoPE) instead of traditional positional embeddings
- If the reader is concerned about Josh Hawley’s proposed law with jail time for downloading DeepSeek, note that a downloader would at least be in good company: DeepSeek is hosted not only at HuggingFace and GitHub and the Google and Apple app stores, but also by IBM, Microsoft, and Amazon.
- The reader may be coming to this page with questions such as the following:
- What is the evidence for claims/fears of DeepSeek innovation?
- What is the evidence for claims/fears of dramatically reduced training cost and for greater efficiency?
- What about the claims that DeepSeek was “distilled” from existing models such as ChatGPT?
- Do the local models “phone home”?
- The information here helps answer such questions, but the answers themselves will need to come from others (see links above), or await further work I’m doing on DeepSeek. Soon-forthcoming pages at this site will cover:
- What is the DeepSeek training process? (with details on reinforcement learning and distillation; beyond the short section on training in this page)
- What are key DeepSeek R1 and V3 innovations? (with a focus, coming from Google Gemini’s analysis, on the balance or “synergy” achieved from the many features covered on this page: MoE, MLA, MTP, FP8 — some of this “balance/synergy as a key innovation” point is made in the conclusion to this page).
- How close do the DeepSeek innovations come to earlier work from other companies and researchers? (Google Gemini referred to this as the “origin story” of some of DeepSeek’s important features)
- What can be gleaned from other Deep Seek source code and research papers?
- (Some of this forthcoming analysis comes from Google Gemini examining intentionally-obfuscated DeepSeek source code, the purpose being not only to test LLM code-analysis abilities with reduced ability to rely on function names and comments, but also to view the DeepSeek code with a greater emphasis on the code itself rather than on helpful names it might contain.)
Why use Google Gemini to examine DeepSeek?
Another goal here, apart from interest in DeepSeek itself, is testing AI-generated code summaries, especially as it relates to the types of information that source-code examiners and expert witnesses use for example in patent litigation. See separate pages on AI summaries of source code (including obfuscated code), reverse-engineered (disassembled and decompiled) code listings, and running offline local LLMs for code summaries (offline local use will generally be crucial in source code examination under a court protective order).
We’ll see that the type of material generated by Google Gemini could (after very careful checking and supplementing) be used as one element in an expert witness’s report on patent infringement or non-infringement, for example.
The DeepSeek source code here, while much smaller than typically encountered in litigation-related source-code examination, is a good example in part because it is new enough that Gemini is unlikely to have already been trained on it (except to whatever extent DeepSeek-V3 code may partly reflect earlier DeepSeek code), so we are seeing its fresh code inspection.
As a test of AI code-analysis abilities, the first several rounds with Gemini below were conducted using only the source code, without the DeepSeek research papers (and also with AI Studio’s “Grounding with Google Search” button disabled). Below, Gemini makes at least one assertion based on the code which Gemini then has to walk back after seeing the documentation.
I had earlier used Google’s NotebookLM to summarize this code, but my test there included the DeepSeek research papers as well as the source code, and it is better test of code-analysis abilities to initially hold back the documentation and require that the AI focus only on the code (a further test of code-analysis ability comes with testing with obfuscated code, so that the LLM is kept from over-relying on symbolic names of functions, classes, and variables, and from over-relying on comments — Gemini’s analysis of obfuscated DeepSeek code will be covered soon on a separate page at this site).
Even though it’s already become commonplace, it’s worth sitting back and shaking one’s head in amazement at the naturalness and usefulness (despite occasional “hallucinations”, BS, and assorted idiocies) of using AI chatbots as assistants in code reading, debugging, and technical explanation. It is shocking how far these systems have come in only a few years, on a basis that is, at bottom, a neural network with next-token prediction, trained largely with predictive-error minimization, and with reasoning-like behavior from reinforcement learning.
Another benefit of examining this DeepSeek code is the chance to learn a fair amount about basic Large Language Model (LLM) operation (though more about running the model, and much less on training). From Gemini’s analysis of the DeepSeek code, we get a nice walk-through of a small but substantial piece of LLM-related code, with an opportunity to discuss Transformers, Attention, Mixture-of-Experts, embeddings, token sampling, chat templates, and more. As background to LLM operation, I highly recommend the O’Reilly book Hands-On Large Language Models: Language Understanding and Generation by Jay Alammar and Maarten Grootendorst (2024).
Note that the DeepSeek source code examined in most of this article is not essential to running DeepSeek models. There is of course the online DeepSeek chatbot. Further, only a few lines are needed to run a small local model, discussed below (though note bizarre chain of thought I saw in my tests, which I also discuss with the DeepSeek chatbot).
Contents
- Initial summary of DeepSeek Python source code
- Chronological summary of DeepSeek Python source code
- Really only 1,400 lines of code?
- Gemini goes out on a limb about chain of thought (CoT)
- Gemini re-evaluates its position after reading the DeepSeek doc
- Revised summary of DeepSeek code in light of DeepSeek doc
- Diagram of DeepSeek source code
- DeepSeek training & reinforcement learning (RL) paper
- Diagram of DeepSeek training
- Follow-up questions: HAI-LLM, LORA, RoPE, does it “phone home”?, etc.
- Compare with Claude data-handling abilities
- What about you, Gemini?
- Running small local DeepSeek model
- Poor results from small local DeepSeek model
- DeepSeek chatbot comments on local DeepSeek model results
- Conclusion by DeepSeek-R1, with support from Google Gemini 2.0 Flash Thinking Experimental
Below, my (AS = Andrew Schulman) requests and comments in italics. I’ve also added links to online explanations of technical terms — these links were not in Gemini’s responses.
Initial request for code summary of DeepSeek .py files
AS: Give me a high-level but detailed description of this code, including basic flow of requests from user input to the system’s output [I had uploaded 5 .py files [model.py, generate.py, kernel.py, fp8_cast_bf16.py, convert.py], and within a few seconds Gemini produces the following summary.]
Gemini: Okay, let’s break down this code. This project implements a large language model (LLM) with Mixture-of-Experts (MoE) capabilities, optimized for performance using techniques like FP8 quantization and custom Triton kernels. It includes components for model definition, weight conversion, and text generation. Here’s a high-level overview:
Core Functionality:
Model Definition (model.py): [AS: below, Gemini provides a commented list of each class in model.py]
- ModelArgs: A dataclass holding all the configurable hyperparameters for the LLM, including model size, MoE settings, and data types.
- ParallelEmbedding: An embedding layer that distributes the vocabulary across multiple GPUs for parallel processing.
- Linear, ColumnParallelLinear, RowParallelLinear: Custom linear layers that support FP8/BF16 data types and parallel computation
- RMSNorm: Layer normalization. [Implemented in model.py, not imported, but see PyTorch doc for RMSNorm (Root Mean Square layer normalization)]
- precompute_freqs_cis, apply_rotary_emb: Functions for implementing Rotary Positional Embeddings (RoPE).
- MLA: Multi-Headed Attention layer with LoRA support and efficient KV caching mechanisms. [See “query (q), key (k), and value (v) tensors” below]
- MLP: Multi-layer perceptron (feed-forward network).
- Gate: A gating network that routes input tokens to different experts within the MoE layer.
- Expert: An individual expert within the MoE layer, consisting of its own MLP.
- MoE: Mixture-of-Experts layer, combining a gating network, multiple experts, and a shared MLP.
- Block: A single layer of the transformer model, consisting of the attention layer, the feed-forward network (MLP or MoE), and normalization layers.
- Transformer: The main class defining the complete LLM. It holds the embedding layer, multiple transformer blocks, and an output projection head. It also handles the precomputation of RoPE frequencies. [AS: this is class Transformer in DeepSeek’s model.py, distinct from the HuggingFace Transformers library, which however is also used for AutoTokenize; see below.]
FP8 Conversion (fp8_cast_bf16.py):
- This script takes a directory containing FP8-quantized weights (along with scaling factors) and converts the weights to BF16 (bfloat16). It uses the weight_dequant kernel.
- It reads the safetensors files, dequantizes the FP8 weights, and saves the BF16 weights to new safetensors files.
- It also updates the model index JSON file to remove references to the scaling factors.
Checkpoint Conversion (convert.py):
- This script transforms a standard Hugging Face checkpoint into a custom checkpoint format that’s suitable for loading by the implemented model.
- It restructures the weights based on predefined mapping between names, and shards the weights across multiple GPUs based on the specified model-parallel factor.
- It creates safetensors files for each rank that can be used in distributed training/inference.
Triton Kernels (kernel.py):
- act_quant_kernel, act_quant: Implements FP8 quantization for activations using custom Triton kernels for fast execution.
- weight_dequant_kernel, weight_dequant: Implements FP8 dequantization using custom Triton kernels.
- fp8_gemm_kernel, fp8_gemm: Implements the matrix multiplication operation on FP8 inputs, accelerating the main computational bottleneck of transformer layers.
Text Generation (generate.py):
- This script loads the model and runs either in interactive mode or with batch input.
- sample: Samples token from model logits.
- generate: Generates new tokens, iteratively decoding until max length or EOS token is reached.
- The main function handles model loading, tokenization using the Hugging Face AutoTokenizer, and calls the generation functions. It handles interactive prompts or batch prompts from a file. It also supports running on multiple GPUs with DDP support.
Basic Request Flow (Interactive Mode):
- User Input: The user types a prompt in the console.
- Prompt Handling: The prompt is passed to generate.py. If running in distributed mode, the prompt is broadcast to other ranks.
- Tokenization: The prompt is converted into a sequence of numerical token IDs using the AutoTokenizer.
- Model Inference: The token IDs are passed to the Transformer model.
- Forward Pass: The model performs the forward pass through its layers. This involves:
- Embedding the input tokens.
- Applying rotary positional embeddings. [RoPE]
- Running the attention and feed-forward layers, possibly using the MoE mechanism.
- Normalizing the output.
- Projecting the hidden state to get logits.
- Token Sampling: The logits are used to sample the next token. If temperature > 0, the sample function is used for stochastic sampling; otherwise, argmax is used to get the token with the highest probability.
- Output Generation: The sampled token ID is appended to the sequence. The process is repeated until the desired length is reached or an EOS token is generated.
- Decoding: The generated token IDs are converted back into text using the AutoTokenizer.
- Output Display: The generated text is displayed on the console.
Basic Request Flow (Batch Mode):
- Input Reading: The script reads prompts from a provided input file.
- Tokenization: Each prompt is tokenized.
- Model Inference: The model runs with the batch of tokenized prompts.
- Output Generation: The model generates tokens for each input in parallel.
- Decoding & Display: The outputs are decoded into text and printed.
Key Points & Optimizations:
- FP8 Quantization: FP8 is used to drastically reduce the memory footprint of the model, increasing the number of parameters that can fit on a GPU, and potentially accelerating operations.
- Triton Kernels: Custom Triton kernels are used for FP8 quantization and dequantization, as well as for the matrix multiplication. This leads to much faster operations than native PyTorch.
- MoE (Mixture of Experts): Using a MoE layer increases the model size (number of parameters) without significantly increasing compute, by using a router that activates only a small fraction of experts at a time.
- Distributed Training/Inference: The code is designed to support running across multiple GPUs using torch.distributed, with embedding and linear layers sharded across GPUs to distribute workload and memory requirements.
- Efficient KV Caching: The attention layer has custom implementations for KV caching.
- RoPE: Rotary Positional Embeddings are used instead of traditional positional embeddings.
- Safetensors: safetensors are used to store model weights, which makes loading and saving models much faster.
In summary, this is a sophisticated project focused on creating an efficient large language model by combining the power of MoE, FP8 quantization, and custom Triton kernels with distributed processing capabilities.
Asking for a chronological summary of DeepSeek operation
AS: That’s pretty good, but can you take the details in your “Core Functionality” and inter-mingle them with the “Basic Request Flow (Interactive Mode)” section, so the details appear roughly in chronological order?
Gemini: Okay, let’s integrate the core functionality details into the interactive request flow, presenting a chronological view of what happens when a user interacts with the system:
Interactive Request Flow with Core Functionality Details: [AS: there were 9 steps above; there are 8 here because Prompt Handling and Tokenization have been combined into a single step; Model Initialization has been added as a separate step; and the separate Forward Pass step above has been combined below with Model Inference — which, like the Forward Pass step above, is the big deal here.]
1. User Input:
- The user types a prompt in the console within the generate.py script.
- If running in a distributed setup (world_size > 1), rank 0 receives the input and broadcasts it to all other ranks using dist.broadcast_object_list. This ensures all processes have the same prompt.
2. Prompt Handling and Tokenization (generate.py)
- The prompt is received by the active process(es) in generate.py.
- The AutoTokenizer (from the Hugging Face Transformers library), initialized using the ckpt_path, converts the text prompt into a sequence of numerical token IDs.
- If there is a history of messages, the tokenizer uses apply_chat_template to format the current prompt into a suitable chat prompt including this history of user and bot messages.
- This tokenization step also handles adding any special tokens (like the end-of-sequence eos_id).
3. Model Initialization (generate.py):
Before any generation can begin, the Transformer model is initialized. This involves:
- Loading the ModelArgs from a JSON configuration file (specified in the –config argument) to set the model’s hyperparameters (like dim, n_layers, n_heads, moe_inter_dim, etc.). This is handled by the ModelArgs dataclass defined in model.py.
- Instantiating the Transformer class (also in model.py).
- Within Transformer, the following happens:
- The ParallelEmbedding layer is initialized to handle token embeddings across distributed GPUs. It distributes the embedding matrix across GPUs by vocabulary index using part_vocab_size, ensuring parallel embedding lookups across ranks.
- Multiple Block layers are created, each containing:
- An MLA (Multi-Headed Attention) layer: This is where the attention computation happens, and it leverages custom linear layers (ColumnParallelLinear and RowParallelLinear) to optimize performance and memory management across distributed GPUs.
- A feed-forward network, which is either a MLP (Multi-Layer Perceptron) or a MoE (Mixture-of-Experts) layer based on the layer ID and configuration:
- MLP layers have ColumnParallelLinear, RowParallelLinear linear layers, and RMSNorm normalization.
- MoE layers employ a Gate (for routing), multiple Expert layers (each having Linear layers and activations), and a shared_experts MLP.
- RMSNorm normalization layers before both attention and feedforward networks
- A final RMSNorm layer
- A ColumnParallelLinear head layer that projects the output to the vocabulary size.
- Precomputation of the rotary positional embeddings frequencies (freqs_cis) using precompute_freqs_cis for more efficient computation within the attention layers.
Loading model weights from safetensors files for the current rank via load_model.
4. Model Inference (generate.py, Transformer.forward):
The sequence of token IDs is then passed to the Transformer.forward method.
Inside Transformer.forward:
- The input token IDs are converted to their corresponding vector representations using the ParallelEmbedding layer.
- The precomputed rotary positional embeddings (freqs_cis) are fetched based on the current sequence position.
- The code enters a loop that processes each Block within the Transformer:
- Attention Layer (MLA.forward):
- The input tensor is projected to query (q), key (k), and value (v) tensors using ColumnParallelLinear layers for wq and wkv_b which may use FP8/BF16 based on configuration. wkv_a also projects to an intermediate LoRA representation for key and value before wkv_b. These projections are potentially low rank as controlled by q_lora_rank and kv_lora_rank via low rank projections in wq_a and kv_norm.
- The query and key tensors are split into no-position-information and rope parts, and the positional information is added via the apply_rotary_emb function with the precomputed freqs_cis.
- An optimized attention mechanism computes attention scores based on either cached key/value tensors (attn_impl = ‘naive’) or a precomputed intermediate result (attn_impl = ‘absorb’), and combines them with the value tensors to get the final attention output.
- Feed-Forward Network (MLP.forward or MoE.forward):
- If the layer is a regular dense layer (layer_id < args.n_dense_layers), an MLP layer is used, processing the input using ColumnParallelLinear, RowParallelLinear, and F.silu.
- Otherwise, a MoE layer is used:
- The Gate module determines which experts should be used for the current input by computing weights and indices using custom linear layers and score functions (‘softmax’ or ‘sigmoid’).
- The selected Expert modules process the input, where each expert is a basic MLP. A shared MLP (shared_experts) is also applied to all inputs.
- The result of MLP or MoE is returned.
- The output of the final block is normalized using RMSNorm, and the last token in the sequence is projected to logits using the ColumnParallelLinear head layer.
- Attention Layer (MLA.forward):
5. Token Sampling (generate.py, sample):
- The logits (raw prediction scores for each token in the vocabulary) are then passed to the sample function.
- The sample function applies the temperature scaling and either deterministically (when temperature == 0) selects the token with the highest probability, or probabilistically selects a token with higher probability, according to the temperature.
6. Output Generation & Iteration (generate.py):
- The sampled token is appended to the current sequence of tokens.
- This loop continues (with a maximum iteration of max_new_tokens or until an eos_id is generated), iteratively predicting subsequent tokens based on previous tokens.
- If running in distributed mode, it will be the rank 0 process that will collect and complete the full token sequence to decode.
7. Decoding (generate.py): The final list of generated token IDs is converted back into human-readable text using the tokenizer’s decode method.
8. Output Display (generate.py): The generated text is then printed to the console. The entire process repeats when the user enters a new prompt.
Key Functionality Integration Summary:
- The interactive flow now clearly shows when key components like ModelArgs, ParallelEmbedding, MLA, MLP, MoE, RMSNorm, Linear, precompute_freqs_cis, and the custom Triton kernels, are actually used within the generation process.
- The use of FP8, MoE, and the Triton kernel optimized operations are interwoven throughout this process rather than being treated as a separate detail.
This chronological, integrated explanation provides a more concrete understanding of how each part of the code contributes to the overall process of text generation.
Really, less than 1,400 lines of code to do all this?
AS: Nice. Now, there are less than 1,400 lines of python code here. Is this really it? I know it’s relying on underlying libraries like Transformers and Triton, and that it requires safetensors weights files, but are these <1400 LOC really enough to do the entire input-process-output loop you’ve described? Is there some code missing maybe?
[Despite Gemini’s “That’s a very insightful question!” below, this was a somewhat naive-sounding question, and Gemini is perhaps merely humoring me. For one thing, clearly training the model the model is not inside this code (see later discussion of training). Further, the DeepSeek models can be run locally with far less code than this. See below using only about 50 lines of code to download, install, and run deepseek-ai/DeepSeek-R1-Distill-Llama-8B via the standard transformers package. On the other hand, “What code might be missing here?” is one of the key questions that source-code examiners need to ask. My “it’s relying on underlying libraries like Transformers” is a bit misleading, since while it imports AutoTokenizer from transformers, it has its own class Transformer. Note the natural-sounding parts of Gemini’s reply below, like “This is a HUGE dependency!”]
Gemini: That’s a very insightful question! You’re right to be skeptical about the apparent brevity of the code. It’s impressive how much functionality is packed into less than 1400 lines, but it’s crucial to understand that yes, this code is essentially enough to perform the end-to-end loop we described, but not without substantial reliance on external libraries and implicit setup.
Here’s a breakdown of why it works, what’s “hidden” or implied, and why the LOC count can be deceiving:
What’s Included (Explicitly):
- Core Model Architecture (model.py): This file defines the complete structure of the language model. It specifies the layers, connections, attention mechanism, and MoE routing. It’s self-contained in the sense that it describes the model mathematically and logically, but it doesn’t have the optimized implementations of every operation.
- Custom Optimizations (kernel.py): This provides the accelerated kernels for FP8 quantization, dequantization, and matrix multiplication using Triton. This is a key component of the custom implementation of the Transformer model and provides a dramatic speedup.
- Conversion Scripts (convert.py, fp8_cast_bf16.py): These scripts handle the task of taking model checkpoints from standard formats (e.g., Hugging Face) and preparing them for this custom implementation, including the weights and the scaling factors.
- Generation Logic (generate.py): This file orchestrates the text generation, including loading the model, handling user input/batch input, tokenizing the text, using the model for inference, and decoding the output.
What’s Implicitly Provided or Leveraged (External Libraries & Assumptions):
- Hugging Face Transformers (transformers):
- AutoTokenizer: This is a HUGE dependency! It’s the only way the project knows how to translate text into token IDs and back. It encapsulates tons of text processing, vocabulary management, and encoding/decoding logic. We simply instantiate and use this as an object — it is not re-implemented or included here.
- apply_chat_template: A convenient way to format prompts as expected by modern chat models, with minimal complexity.
- [AS: it probably ought to have more explicitly underlined how, while model.py imports from transformers, class Transformer itself is NOT imported, and comes directly from inside model.py. See list of classes in model.py shown below.]
- PyTorch (torch):
- Tensor Operations: The model relies heavily on PyTorch tensors and operations for computations, which include basic arithmetic operations, matrix manipulation, etc. All the tensor math and optimization functions are implicitly provided by PyTorch.
- Autograd: The scripts in here, while running inference, implicitly leverage PyTorch’s underlying autograd framework for gradient calculations (even though the model itself isn’t being trained in these files). This is why operations like torch.matmul can execute on custom types like torch.float8_e4m3fn.
- CUDA Support: PyTorch manages device allocation, memory management, and CUDA kernel launches.
- Distributed Processing (DDP): PyTorch’s distributed framework is used to enable running across multiple GPUs.
- Triton (triton):
- Kernel Compilation: Triton handles the JIT compilation of the custom kernel functions in kernel.py for efficient CUDA execution. This involves managing the compilation of custom CUDA kernels, thread management, and memory access patterns.
- Configuration & Autotuning: The Triton API simplifies autotuning of the kernels based on problem size.
- Safetensors (safetensors):
- Fast Weight Loading: Safetensors are used to store model weights, which are faster and more secure than pickle format.
- Operating System & Environment: The code also relies on the correct CUDA drivers being installed, correct Python setup, and the necessary environment variables for distributed processing.
- Implicit Setup: The code assumes that files like model.safetensors.index.json exist and are in the correct format.
Why the Code Looks Short:
- Abstraction: Libraries like PyTorch and Transformers provide high-level abstractions, letting us focus on the model’s logic rather than low-level implementation details of tensor manipulation, CUDA programming, etc.
- Specialized Kernels: Triton lets us write efficient, specialized GPU kernels for critical computations without having to write full CUDA code ourselves.
- Focus on Core Logic: The code concentrates on the unique parts of the system (the model architecture, the custom operations), and delegates everything else to external dependencies.
- Declarative Style: The model definition in model.py is very declarative, focusing on what the model is, rather than how every single operation should be implemented step-by-step.
Is Anything “Missing”?
No, not in the sense that the code is incomplete. It contains the full cycle of input to model inference to output; however, it relies heavily on libraries and a pre-existing ecosystem to operate. If you tried running this code in a bare-bones Python environment, you would quickly discover the need for all the dependencies and prerequisites. It’s a sophisticated approach that leverages the power of other libraries to minimize the LOC while maintaining complexity.
Analogy: Think of it like building a house. This code defines the blueprint for the house, the design of the walls, and the wiring. It doesn’t reinvent how to mix concrete, forge metal, or produce electricity. It relies on established systems and components to make it a fully functional home. [AS: okay, we didn’t need that for our expert report; but it might be useful in thinking how to explain something to a lay jury.]
In Conclusion: The <1400 LOC is indeed enough, given the heavy reliance on well-established libraries for lower-level functionality. It’s a testament to the power of abstraction and code reusability. The project is not trying to create PyTorch, CUDA or a tokenizer from scratch; instead, it builds upon their foundations to produce a sophisticated, high-performance language model.
[AS: Gemini probably ought to have referred to the .json files: in addition to config.json and tokenizer_config.json which Gemini referenced elsewhere, also see large tokenizer.json and model.safetensors.index.json . Also see large modeling_deepseek.py and configuration_deepseek.py not covered here.] [[TODO: explain why the 2,000 LOC in these two files provide a different way of running DeepSeek, and do not contradict what has been said about the 5 .py files examined here.]]
[AS: Below, I’m showing the classes (and function definitions) inside model.py. Note that each major class has its own forward() function — forward is the forward-pass function that gets a prediction from a neural network. Sorry about the dropped-in screenshot but WordPress doesn’t make it easy to preserve indents (even with its big ugly “Preformatted” font).]
Gemini goes a little too far out on a limb about Chain of Thought: “CoT is an Emergent Property”
[We’ll compare Gemini’s answer here, based only on the Python code, with its later revised answer after it has seen DeepSeek’s research papers. (To reiterate, Gemini here is being run with the “Grounding with Google Search” button in Google AI Studio disabled, so the answers should be based only on the Gemini 2.0 model augmented by the files that have uploaded.) We can also contrast with results below from smaller local DeepSeek model which does generate CoT text, but which we’ll see has a babbling rather than thoughtful quality.]
AS: DeepSeek-R1 has become famous as a “reasoning” model which produces very detailed chain of thought (CoT) text. There isn’t anything like that in this code, is there?
Gemini: You are absolutely correct. There’s nothing explicit within this code that directly implements or enforces Chain-of-Thought (CoT) reasoning. The code provides the infrastructure for a language model, but the reasoning abilities of any specific model trained on this architecture depend heavily on:
- Training Data: The primary driver for CoT capabilities is the training dataset. A model needs to be trained on data that explicitly includes examples of chain-of-thought reasoning (e.g., problems with step-by-step solutions), to learn this style of generation.
- Model Size & Capacity: Larger models with more parameters tend to be more capable of CoT. The model architecture described in the code (model.py) allows for large models and MoE architectures, but that alone isn’t sufficient.
- Fine-Tuning: Often, even after pretraining on vast text corpora, models are further fine-tuned on specific tasks (including CoT tasks), or using techniques like reinforcement learning from human feedback (RLHF) to improve their reasoning ability.
Why this code doesn’t implement CoT directly:
- Focus on Infrastructure: This code focuses on the model architecture (model.py) and the low-level optimized computations needed to make a large model practical. It’s more concerned with how to efficiently execute the model than what the model should be trained to generate.
- Generality: The code provides a generalized LLM architecture, not a model specialized for a specific task. It can be used for various tasks, CoT included (if properly trained).
- CoT is an Emergent Property: CoT is generally considered an emergent property that results from how the model is trained, not something that is programmed in directly via the model architecture itself.
How DeepSeek-R1 Achieves CoT Reasoning (Implied, but not in this Code):
- Pretraining on Large Datasets: DeepSeek-R1 would have been pretrained on a massive dataset that has examples of implicit reasoning patterns.
- Fine-Tuning Datasets with CoT: DeepSeek-R1 would likely have been fine-tuned (either supervised or via RL) on a dataset of explicit chain of thought demonstrations. This is where it would have explicitly learned to generate text that walks through the reasoning process.
- Instruction Following: The final stage would involve training the model for following specific instructions. DeepSeek-R1 may have been trained on instruction datasets that demonstrate when to use CoT, or to explicitly use CoT instructions to guide output generation.
What You See in the Code, and How it Can Be Used for CoT:
- Model Capacity: The ability to create a large model (via the MoE capabilities) is crucial. This code would enable large enough models to be able to encode more complex relationships between inputs, allowing more sophisticated reasoning to emerge.
- Optimizations: The optimized kernels in kernel.py, and the use of safetensors files, make it feasible to train and run large models, including a CoT model. Without these performance improvements, training and inference would be far less tractable.
- Generation Loop: The generate.py script and model inference cycle is the place where the model’s trained behavior becomes visible; the model can express its CoT capabilities here.
In Summary:
The provided code lays the foundation for a large language model, but it does not explicitly include chain-of-thought reasoning or anything that defines that pattern of reasoning. CoT is a result of how the model is trained on specific data, rather than something inherent in this model architecture or the generation code itself.
To get a DeepSeek-R1-like reasoning model using this code as a base, you would need to:
- Train your model (using an external training loop) on data that includes chain-of-thought reasoning examples.
- Fine-tune the model, potentially using reinforcement learning to encourage a more thoughtful generation process, and follow instructions, which would influence the decision to use chain-of-thought.
The code you have is very capable; it just needs to be used appropriately.
AS: You’re saying that “reasoning” and ability to show the user a “chain of thought” are emergent properties that largely come out of the training? [We’ll see later, using example of small local DeepSeek model, that there’s a significant difference between a model trained with reasoning-related text on the one hand, and one that actually exhibits reasoning-like behavior, i.e. that can productively use the reasoning-related text it has learned.]
[Gemini’s answer below seems oriented toward emphasizing that “reasoning” behavior can come right from LLM basics (basically, training on enormous amounts of text, and using it for next-token predicted), without needing to add in something like “logic” or symbols from good old-fashioned AI (GOFAI). That’s a crucial point, which I’ve chatted about with other LLMs (e.g. this chat with Claude), but Gemini goes too far with this here. Below, I like Gemini’s “Yes, that’s precisely what I’m saying” — though we’ll soon see it back off a bit.]
Gemini: Yes, that’s precisely what I’m saying. The ability of a large language model to perform “reasoning” and produce a “chain of thought” (CoT) is largely an emergent property that stems primarily from the training process, rather than being explicitly programmed into the model’s architecture or generation logic.
Here’s a more detailed breakdown of why this is the case:
Emergent Properties:
- Definition: An emergent property is a characteristic that arises in a complex system that is not present in any of the individual components of that system alone. In the context of LLMs, reasoning and CoT are not explicitly built-in functionalities but rather arise as a consequence of the vast scale and complexity of the neural network and its training.
- Non-Linear Relationships: Neural networks, especially deep ones, model highly non-linear relationships between inputs and outputs. These non-linearities allow the network to learn to model intricate patterns and dependencies within the data it’s trained on. This means the model has the ability to represent concepts without it having been explicitly told what each concepts is, or how it is represented in the world.
- Scale Matters: The scale of the model (number of parameters) and the scale of the training dataset play a crucial role in the emergence of these abilities. A larger, more complex model trained on a more diverse, larger dataset is more likely to exhibit more complex emergent behaviors.
CoT as an Emergent Property:
- Not a Function: CoT isn’t a specific function or algorithm that’s explicitly encoded into the model architecture (like a specific attention mechanism or gate). It’s more like a learned style of generation.
- Implicit Learning: Through exposure to data that implicitly encodes chains of reasoning (e.g., step-by-step solutions), a model can learn the pattern of generating text that presents its reasoning before its conclusion. It learns to structure its outputs in a way that mimics a chain of thought process.
- Statistical Modeling: Ultimately, an LLM is a statistical model that learns to predict the next token in a sequence. During CoT, the model learns that sequences of tokens that represent the intermediate steps in reasoning are more likely to lead to a correct final token sequence, but it doesn’t understand why. It is modeling the co-occurrence of those intermediate reasoning steps, not the logic or semantic meaning within them.
- No Explicit “Reasoning” Logic: The model doesn’t have an explicit “reasoning engine” or a set of logical inference rules baked into it. Instead, it learns to perform reasoning through complex transformations and calculations, using all of those implicit relationships. It doesn’t know why it’s reasoning; it just knows that by reasoning, it will get to a higher overall score on the language modeling objective that was used to train it.
The Role of Training Data:
- Demonstrations: If a model is primarily trained on data where the intermediate reasoning steps are directly demonstrated (e.g., a dataset of solved math problems with explanations), it’s more likely to learn to produce CoT outputs. The model learns to mimic those patterns of output generation.
- Instruction Following: Training on data where the model is prompted to “show its work” or “explain its reasoning” will further enhance its ability to produce CoT outputs.
- Data Diversity: Training on a broad range of data that implicitly encodes knowledge, facts, common sense, and basic logic is crucial. This gives the model the knowledge needed to reason about complex ideas.
Implications for this code:
- The architecture described in the code is a platform; you could train a CoT model or a model that does not generate CoT at all on it.
- No guarantee of CoT: Simply using this model architecture won’t automatically result in a model capable of CoT reasoning.
- Focus on training: To get CoT, the training phase (which isn’t fully described in this codebase) is critical.
In Summary:
The ability of LLMs to perform tasks like reasoning and to generate CoT outputs is not inherent in any specific piece of code or model architecture. These are sophisticated emergent properties arising from the scale of the model, the vastness of the training dataset, and the intricate relationships learned by the model through exposure to that data. The model learns to mimic chain of thought as a style, rather than having a logic engine driving reasoning. This code here lays the groundwork for models capable of such behavior, but the behavior itself requires the right training process.
AS: That’s a fascinating answer. I’m not sure I fully understand it. For example, even if chain of thought (CoT) were largely an emergent capability, for an AI chatbot to show the end-user some stream of consciousness text perhaps representing CoT, there would need to be explicit code to show that, wouldn’t there?
[And, more important, there would need to be some employment of the CoT in generating the LLM’s answer; again, see odd CoT from the small local DeepSeek model below.]
[Gemini’s “You’ve hit on a crucial point” below (like its earlier “That’s a very insightful question!”) is a super-annoying tic these AI chatbots have (I’ve had only limited success getting them to cool it — feels a bit like dim-bulb Emperor Joseph in Amadeus: “Please, Herr Mozart, a little less enthusiasm, I implore you”), but I suppose I asked for it here with my own “That’s a fascinating answer.”]
Gemini: You’ve hit on a crucial point, and it’s where the nuance of “emergent” and “explicit” coding can get tricky. You’re right that even if CoT is an emergent capability from training, there needs to be some explicit code in the generation process to reveal or show that chain of thought to the end-user. Let’s break down why:
The Emergent Capability vs. Explicit Output:
- Emergent Capability: The neural network learns, through training, to internally represent and manipulate information in a way that supports a chain-of-thought process. This is the implicit, learned capability. It learns to think through its answer in a step-by-step way.
- Explicit Output Generation: But the act of generating text that displays this chain of thought to the user is a separate process that does require some explicit logic and instruction in the output generation code.
- The Bridge: The code that acts as a bridge between the model’s internal state of reasoning and its textual output. This code is typically part of the generation loop that was discussed.
Why Explicit Code Is Needed to Show CoT:
- Token Generation Order: The core code involved in generation (which we discussed earlier, in generate.py‘s loop), needs to be told to predict each token in sequence, so that intermediate tokens can be observed. If the code were designed to directly generate tokens without intermediate steps, that would be equally valid, but there would be no opportunity to see the CoT.
- Formatting: The generated token sequence needs to be decoded and presented in a human-readable format. There may be rules in the decode function to format the output to make it appear as a chain of thought.
- No Special CoT “Flag”: The model isn’t trained to set some special “I’m now doing CoT” flag. The fact that a chain of thought output is produced is largely dictated by the training data, but not in a form which a user could inspect.
- Controlled Behavior: It is the generation code (in generate.py) that can influence how the model decides the order in which the tokens are generated, which affects whether the model is presenting a step-by-step CoT, or simply its answer.
What is Explicit in the generate.py code (but isn’t specifically “CoT” code):
- generate function: This function iteratively calls the model and samples from the output. It does this one token at a time, which makes it possible to see the CoT. If generate were to only produce a final answer, the CoT would be implicitly performed internally, but never displayed.
- Decoding Process: The decoding code, using the tokenizer.decode function, will show the predicted tokens, so when the model generates intermediate tokens, these will be displayed.
- Token Sampling Loop: The basic loop in generate decides how to generate a sequence, but not what to generate.
What’s NOT Explicit in the generate.py code (and not needed):
- Explicit CoT Logic: There are no special CoT flags, logic, or rules that check to see if the model is “reasoning” or that would dictate to the model the correct approach to a problem. It simply generates output text one token at a time, and because of training data, that generation may follow the form of a chain of thought.
Example: Imagine the model, after training, is capable of doing a math problem by going through each step.
- Without Explicit Code: The model could generate the final answer to a math problem without any output indicating that it actually went through multiple steps. It may internally perform these steps in its neural network, but there’s no code prompting it to display those steps.
- With Explicit (But not CoT-specific) Code: The generate loop in generate.py samples one token at a time. The model is trained to output tokens that would reflect each step, so the user sees the model “thinking out loud.” This isn’t because of some special part of the generate.py code — it is the result of the trained model’s behavior in response to the token-by-token sampling procedure.
In Summary:
- The code in generate.py, through its sequential token generation loop and standard decoding, provides the necessary mechanism for the model to show its reasoning process.
- The content of that reasoning (whether it is a detailed CoT or something else) comes primarily from the model’s learned behavior from its training data, not from any custom CoT code in the generation logic.
- The main job of generate.py is to decide how many tokens to generate (given max_new_tokens and eos_id) and to use a model to predict what each token should be. The format and contents of those tokens are largely an emergent property from training data.
You’re right to point out that there is explicit code involved in displaying the CoT, but it’s not a custom CoT implementation. It’s just the standard mechanism needed to generate any text sequentially, which, through a model trained to use chain of thought, will reveal the internal reasoning that it performs.
[“You’re right to point out… It’s just…”: Gemini, like other chatbots, has mastered the art of finding something to agree with in what a naive user says, even as it tries to gently steer them to its own point of view.]
[Note that Gemini’s somewhat vague hand-waving reference to “emergent” chain of thought should not be confused with DeepSeek’s already-famous “Aha moment” during the training of DeepSeek-R1-Zero. (It’s not quite Move 37, but still pretty nice.) See example CoT in Table 3 in the DeepSeek-R1 paper, in which while trying to solve a math problem it says to itself “Wait, wait. Wait. That’s an aha moment I can flag here”, and the paper notes: “The model learns to rethink using an anthropomorphic tone. This is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning.” Though this is in a sense an emergent behavior, it emerges from very specific training. As the R1 paper states, “Through RL [reinforcement learning], DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors.”]
Asking Gemini to re-evaluate its source-code analysis in light of DeepSeek research paper
AS: I’ve uploaded a PDF file that came with the code. I believe this discusses reasoning and CoT in DeepSeek-R1, and perhaps how it relates to DeepSeek-V3 (to which the Python code you’ve analyzed corresponds). Can you read this over, and let me know if it changes anything substantial you’ve told me so far in this session, including on reasoning and CoT as emergent (rather than as some sort of add-on for which I don’t see code)? [The paper is the “DeepSeek-V3 Technical Report“. This is 53 dense single-spaced pages. Google Gemini’s “Okay, I’ve thoroughly reviewed…” reply comes just a few seconds later:]
Gemini: Okay, I’ve thoroughly reviewed the DeepSeek-V3 Technical Report you provided. It’s a very detailed and insightful document that provides crucial context for understanding the architecture, training, and evaluation of the DeepSeek-V3 model. Here’s a breakdown of how it modifies and enriches what we discussed earlier, particularly regarding reasoning, chain-of-thought (CoT), and their emergent nature:
Key Takeaways from the Report (Relevant to our Discussion):
- Explicit Focus on Reasoning and CoT: Unlike the code itself, the report explicitly highlights reasoning as a key performance target. It acknowledges the importance of CoT as a specific strategy for enhancing reasoning, especially for more complex problems. The report states that they “distill the reasoning capability” from a DeepSeek-R1 series model (which we now know is a model trained specifically for CoT).
- Distillation of Reasoning from DeepSeek-R1: The most crucial point is that DeepSeek-V3 does not learn to reason solely from scratch. Instead, it uses a technique of knowledge distillation, where they:
- Train a DeepSeek-R1 model for strong CoT reasoning.
- Use this DeepSeek-R1 to generate data that demonstrates the desired reasoning and formatting.
- Use this data to fine-tune DeepSeek-V3, such that it learns to mimic R1’s reasoning approach, and structure its own responses accordingly.
- Multi-Token Prediction (MTP): The report describes MTP as a training objective that improves model performance by predicting multiple future tokens rather than just one, and may contribute to a more detailed output. It also notes this can be used for speculative decoding for faster output. This helps reinforce the patterns learned during the distillation process.
- Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL): These methods, often with R1-generated data, are used in post-training to align the model with user preferences and further unlock its reasoning potential, often involving CoT. It shows that there is an explicit effort to encourage the model to take a step-by-step approach to answering questions.
- Auxiliary-Loss-Free Load Balancing: The paper talks about their approach to MoE that minimizes load-imbalance with an auxiliary-loss-free technique, allowing better scaling, better parameter efficiency, and ultimately allowing for larger, more capable models.
- FP8 Training Validation: The report confirms and details the effectiveness of FP8 training, a critical component enabled by the custom Triton kernels in the code we reviewed. This makes it feasible to train much larger and therefore more capable models.
How This Changes (or Reinforces) Our Previous Discussion:
- Reasoning and CoT are Not Solely Emergent: While the capacity for reasoning might be an emergent property of the model architecture, the report makes it very clear that the specific style of reasoning and CoT is explicitly learned through fine-tuning with the R1-generated datasets (and through specific post-training regimes) described above. This doesn’t invalidate the emergent concept but adds a crucial layer of influence: guided emergence
[AS: “guided emergence” — not a bad way of putting it, but also sounds like Gemini has learned (perhaps from training on academic literature) how to defend its previously-stated positions in the light of new information that should alter it. Below, “not just an emergent property, but also a deliberately transferred one” is a clearer way of putting it.] - The Power of Distillation: The report highlights how knowledge distillation allows the researchers to transfer reasoning abilities from a more specialized model (DeepSeek-R1) to a more general model (DeepSeek-V3) by leveraging its outputs to generate data and fine-tune a new model. This goes beyond just creating training data from text. This means that the chain-of-thought is not just an emergent property, but also a deliberately transferred one, since the target of that transfer was a demonstration of a chain-of-thought style.
- Emphasis on Training: The report emphasizes the importance of the entire training pipeline, especially the post-training stage with SFT and RL using carefully curated data, especially including CoT demonstrations.
- Code as Infrastructure: The code is still the foundational infrastructure; it just needs the right training to actually realize its potential for reasoning. The code lays out the basic building blocks (layers, architecture, low-precision support, etc.). But the report demonstrates that much of the work and innovation is in the training stage to achieve the desired behavior. The training is what causes the model to exhibit the behaviour of thinking step-by-step via chain of thought.
- No CoT “Module” in Code: The report reinforces that there’s no specific, explicit “CoT module” within the architecture. Instead, the training regime causes the model to adopt a style of output that reflects internal reasoning. The generate loop in generate.py simply enables that ability to be displayed by generating one token at a time.
Key Refinement to My Previous Statements:
I previously stated that CoT was “largely” emergent. After reading the report, I would refine that to:
- The capacity for complex reasoning is likely an emergent property of scale and architecture. The specific style of reasoning and chain-of-thought generation is primarily acquired through knowledge distillation and the explicit use of datasets that demonstrate CoT reasoning. The way it shows CoT to a user is a function of the token-by-token generation procedure, not some special kind of code specific to CoT.
In Summary:
The DeepSeek-V3 technical report significantly clarifies the role of training in achieving CoT capabilities. It shows that while the model architecture is a prerequisite, it’s the carefully designed training process, particularly the distillation from DeepSeek-R1 and the post-training regime, that allows DeepSeek-V3 to effectively produce chain-of-thought reasoning in its outputs. It is not simply an emergent behavior. It is carefully engineered and transferred using fine-tuning.
This does not mean that all CoT is engineered – but DeepSeek V3’s is. The code we have reviewed is sufficient to display and enable CoT because of its core design, but it relies on a particular type of training to generate CoT at all.
The report makes the relationship between the code and the observed behavior much clearer, and that relationship revolves heavily around the training methodology, not the model architecture or generation logic alone.
[I probably ought to follow up with a question such as: Given what you’ve now learned from this PDF, is there anything you didn’t see in the source code files uploaded initially, which you would expect to be there? Clearly not code for training (you wouldn’t expect that in this code for running the model), but anything else?]
Revising the source-code walk-through in light of documents
[Below, my interest is partly in how LLMs do with the important task of correlating source code with non-source documents. I’ve done some tests of this with OpenAI’s “DeepResearch” that I’ll soon show at this site.]
AS: I’m uploading two .md files that come with the code as it appears in GitHub. Can you take these, plus what you’ve seen in the PDF file, and integrate analysis of these with the earlier input-to-output flow description you gave me? In other words, use the three additional documents to help flesh out your earlier description that was based on the code. By “your earlier description,” I don’t mean the very first one, but the second one, where you said “let’s integrate the core functionality details into the interactive request flow, presenting a chronological view of what happens when a user interacts with the system”, and then beginning with your title “Interactive Request Flow with Core Functionality Details”. I’m asking you to now re-do that, also using the 3 most recent documents. [The two .md (markdown) files are README.md and README_WEIGHTS.md , from DeepSeek-V3 on GitHub.]
Gemini: Okay, I’ve now integrated the information from the README.md, README_WEIGHTS.md, and the DeepSeek-V3 Technical Report into the interactive request flow description. This will provide a more comprehensive picture of what happens from user input to output, considering the model’s architecture, weights, and training details.
Updated Interactive Request Flow with Core Functionality & Documentation Insights:
1. User Input (generate.py):
- The user enters a text prompt in the console. This is the starting point, and the text’s format and content are guided by the user.
- In a distributed setting (multiple GPUs), generate.py (running on rank 0) broadcasts this prompt to all other ranks to ensure consistent inputs across the distributed system using dist.broadcast_object_list.
2. Prompt Handling and Tokenization (generate.py):
- The prompt is received by the active process(es) in generate.py
- The AutoTokenizer (from the Hugging Face Transformers library, specifically for DeepSeek-V3), initialized using the ckpt_path and a vocabulary size of 128K (according to the report and README_WEIGHTS.md), translates the text prompt into numerical token IDs. This is a crucial step, as the model itself works with numerical tokens rather than raw text. This tokenizer also handles special tokens like end-of-sequence (eos_id).
- If there is a history of messages, the tokenizer uses apply_chat_template to format the current prompt into a suitable chat prompt including this history of user and bot messages. This formatting includes adding system prompts.
3. Model Initialization (generate.py, model.py):
- The Transformer model is initialized using hyperparameters detailed in the config.json file (specified via the –config argument). The README_WEIGHTS.md explains how to use the config.json and weight files.
- Key configuration parameters from config.json and the report include:
- dim: The embedding dimension for tokens (7168 for DeepSeek-V3).
- n_layers: The total number of Transformer layers (61 for DeepSeek-V3), as described in the report.
- n_heads: The number of attention heads (128 for DeepSeek-V3), described in the README.md and config file.
- moe_inter_dim: The intermediate dimension of the MoE layers (2048 for DeepSeek-V3).
- n_routed_experts: The number of routed experts (256 for DeepSeek-V3, and as described in the README files).
- n_activated_experts: The number of experts activated for each token (8 for DeepSeek-V3, per the report and config file).
- n_dense_layers: The number of dense (MLP) feed-forward layers, before MoE layers take over. (3 for DeepSeek-V3, as described in the config file).
- num_nextn_predict_layers: the number of Multi-Token Prediction (MTP) modules. The open-sourced V3 weights include 1 MTP Module, as specified in README_WEIGHTS.md.
- dtype: The data type (either FP8 or BF16). If dtype is fp8, then the code will attempt to dequantize any FP8 weights that have been provided (according to README_WEIGHTS.md, DeepSeek-V3 natively supports FP8).
- The Transformer class in model.py is initialized:
- The ParallelEmbedding layer is created to distribute the vocabulary (128K tokens, as mentioned in the technical report) across GPUs for parallel processing, according to the model parallel settings in the convert.py script which are included in the file name (e.g. model0-mp16.safetensors).
- Multiple Block layers (61 in total, according to the report and config.json) are created. Each Block may be one of two types:
- Early layers have an MLA (Multi-Headed Latent Attention) layer, and an MLP (Multi-Layer Perceptron) layer for the feedforward network.
- Later layers have an MLA layer, and a MoE (Mixture of Experts) layer for the feedforward network, as described in the report.
- Both MLP and MoE use ColumnParallelLinear, RowParallelLinear for efficient distributed training.
- RMSNorm is used for layer normalization within each Block.
- A final RMSNorm layer.
- A final ColumnParallelLinear layer (head) that projects from the model’s hidden dimensionality into the vocabulary space. The dtype is the default dtype, indicating no quantization, as noted in the code.
- The precomputed rotary positional embeddings frequencies (freqs_cis) are created using precompute_freqs_cis, and will be used throughout attention computations within the layers.
- The model weights from safetensors files are loaded, as modified by the convert.py script, including main model weights and MTP module weights, as described in README_WEIGHTS.md.
- The convert.py script also handles splitting the weights into shards for distributed execution as model{rank}-mp{world_size}.safetensors. This process involves mapping standard Hugging Face checkpoint names to the custom names of this project’s architecture, and then sharding them across the number of GPUs (–model-parallel parameter) given.
- If FP8 is used, the weight_dequant function will be used to get BF16 weights as described in README_WEIGHTS.md, by dequantizing based on weight_scale_inv tensors (which are also stored alongside the weights in the safetensor files). This dequantization happens before the weights are used in linear computations within the layers.
- When using a model that is set up for MTP, the first n layers will be of the main model, and the subsequent layers are the Multi Token Prediction (MTP) modules. For DeepSeek-V3, n = 61, and num_nextn_predict_layers = 1 so the MTP modules are loaded in the 62nd layer. README_WEIGHTS.md contains details on MTP.
4. Model Inference (generate.py, Transformer.forward):
- The token ID sequence is passed to Transformer.forward.
- Inside Transformer.forward:
- The tokens are converted to vector representations using the ParallelEmbedding layer. This is where the sharding of the vocabulary actually occurs, and is a performance bottleneck if the data is not distributed properly.
- The corresponding freqs_cis rotary embedding values are selected according to start_pos and the length of the input sequence.
- The loop through the model’s Block layers begins. Each layer consists of a MLA (Multi-Head Latent Attention) layer, followed by either a MLP or MoE feed-forward network:
- Attention (MLA.forward):
- The query (q), key (k) and value (v) tensors are created by passing the input through ColumnParallelLinear layers.
- These linear projections may also include low-rank projections based on the q_lora_rank and kv_lora_rank values to produce a LoRA-like effect.
- The wq_a and wkv_a linear layers are followed by RMSNorm operations (q_norm and kv_norm).
- RoPE (Rotary Position Embeddings) are applied to the query and key tensors in the apply_rotary_emb function, using precomputed values (freqs_cis), to encode positional information. The RoPE parameters are described in the configuration (including rope_theta and rope_factor).
- Attention scores are calculated, and then applied to the value tensor to produce the attention layer’s output.
- The MLA layer may use a cache for previously calculated key and value tensors (attn_impl = “naive”) or it may compute them based on a precomputed value attn_impl = “absorb”, according to a specified attn_impl setting. The model defaults to absorb. The attn_impl setting affects how keys and values are calculated during the attention calculation, as detailed in the source code.
- Feed-Forward (MLP.forward or MoE.forward):
- For the first n_dense_layers (3 for DeepSeek-V3) layers, a standard MLP is used, as described in the code and config.json, using ColumnParallelLinear and RowParallelLinear layers, and F.silu activations.
- For the remaining layers (including the MTP module in layer 61), the feedforward network will be a MoE layer, also as described in the config.json and report.
- The MoE layer is used when doing forward passes (and during training), but during inference, the model can also use only 9 of the 16 or more experts in a given MoE layer, as described in the report. The MoE layer also has a Gate that routes tokens to different experts, which have a small MLP inside. The Gate includes the sigmoid function according to README_WEIGHTS.md.
- The MoE uses an Expert layer which is itself an MLP. README.md and README_WEIGHTS.md point out that there is also a shared MLP (via the shared_experts attribute of the MoE class) which is always included in the output, regardless of the expert routing results.
- Attention (MLA.forward):
- The output of each block is returned.
- The normalized output of the last Transformer block is passed into ColumnParallelLinear for output prediction. The report and code indicate that the head is a ColumnParallelLinear layer.
5. Token Sampling (generate.py, sample):
- The sample function in generate.py applies the temperature and either deterministically (temperature = 0) takes the token with the highest probability from logits, or stochastically takes a token based on the probabilities.
6. Output Generation and Iteration (generate.py):
- The sampled token ID is added to the sequence.
- The generate.py code includes a loop that iteratively feeds the updated sequence of token IDs back into the model to produce another token. The process continues, with a maximum iteration length set by max_new_tokens or until an end-of-sequence token is generated.
- If running in distributed mode, the process with rank 0 collects the completed token sequence, if not already generated on that node.
7. Decoding (generate.py): The generated token sequence is converted to human-readable text using the tokenizer’s decode method.
8. Output Display (generate.py): The decoded text is displayed to the console.
Key Updates and Reinforcements Based on the Documents:
- FP8 Support: The README.md and technical report confirm that the model uses FP8 for efficient training and inference (as evidenced by the use of torch.float8_e4m3fn where applicable). The steps for converting FP8 weights to BF16, if needed, are included. The weight files themselves use the weight_scale_inv tensor which is stored alongside the weights, as detailed in the documentation.
- MoE: The README.md, README_WEIGHTS.md, and the technical report clearly emphasize that this is a Mixture-of-Experts model that uses custom routing and MoE layers.
- MLA: The Multi-Head Latent Attention layer is a key aspect of the architecture, according to the technical report and the code itself. It emphasizes the custom efficient implementation for both training and inference.
- MTP: The inclusion of a Multi Token Prediction layer, trained alongside the main model, is an important architectural detail, and the README_WEIGHTS.md document makes that explicit.
- Training Process: The report adds crucial detail about the training strategy, which uses both the DeepSeek-R1 (for distillation) as well as a diverse set of tasks and prompts to fine-tune the model. The report highlights the two-stage process of long-context extension, and the use of both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). The training process also used an auxiliary-loss-free strategy for load balancing during the expert-routing phase in training MoE layers, as discussed in the technical report.
Key Takeaways After Integration:
- Training is paramount: It’s now even clearer that the reasoning ability is a product of the training process (knowledge distillation and RL) rather than something inherent in the model’s architecture itself. The reports show how R1 is used as a teacher, to inform V3 how to format output, and how to reason step by step.
- Performance-focused: The focus on FP8, Triton kernels, and the DualPipe algorithm emphasizes that efficiency is a major goal. The detailed descriptions in the technical report about all-to-all communication, and memory optimization help clarify how the project is able to scale.
- Complex Interplay: There’s a complex interplay between model architecture (MoE, MLA), low-level optimization (FP8, Triton), and the training data/methods (distillation, SFT, RL).
- Code is the foundation: The code isn’t everything, but it is the foundation needed to run this sort of model. The specific way that model is then trained to be a chat or CoT model is a set of very deliberate actions and design choices that are made clear in the reports.
By incorporating the details from the README.md, README_WEIGHTS.md and the Technical Report, the interactive request flow provides a much richer and more accurate picture of the entire text generation process for DeepSeek-V3. It’s clear that while the code implements the technical infrastructure, the model’s specific behaviors like reasoning and CoT are significantly shaped by the training processes, and the distillation approach which includes the use of DeepSeek R1 to generate training data.
Getting a diagram of the source code
AS: Can you draw a picture of this system for me?
[Skipping over a lot here. Gemini tries to “draw a picture” using prose, and then an ASCII diagram.]
AS: Actually, could you give me a more pictorial picture? Like something that graphviz or Mermaid would generate?
Gemini: Okay, let’s represent the DeepSeek-V3 system using Mermaid syntax, which is great for creating visual diagrams from text descriptions. This will give you a more pictorial, graph-like representation of the system.
[Skipping over several iterations of Gemini giving me Mermaid code, Mermaid rejecting it, Gemini trying to fix, etc. (see Jon Udell’s “How To Create Software Diagrams With ChatGPT and Claude“). Finally realized that Mermaid didn’t like Gemini’s use of parentheses in labels. I would have thought I could do all this within Gemini, but right now I could only do with the Gemini-to-Mermaid-back-to-Gemini cycle. There’s a “Code execution” button in the Google AI Studio UI, but it is disabled (“This tool is not compatible with the current active tools.”).]
AS: It got an error again on the same line. I tried changing “New Token ID(s)” to “New Token IDs” and it was happy. I guess Mermaid was confused by the (s)? I’ve uploaded a PNG version of what Mermaid produced. Does it look correct to you? [See diagram below.]
… [Skipping Gemini thanking me way too much for fixing the silly parentheses problem; “thank you for your patience in getting us here”, etc.; sheesh; showing the following just to see Gemini going through a sort-of mental checklist…]
- Visual Match: The PNG that you generated is visually accurate, and corresponds to what we’ve discussed throughout our session, based on the code, the two README.md files, and the DeepSeek-V3 Technical Report….
- Key Points of the Diagram (as confirmed by the image):
- Clear Sequence: The diagram accurately shows the progression from user input to the final output.
- CPU/GPU Differentiation: The use of different shapes helps to clarify the parts that run on the CPU and the parts that run on the GPU (including the distributed execution of several of them), as discussed in detail.
- Core Model Details: The “Core Model” subgraph clearly shows the individual layers of the Transformer block including MLA, MLP/MoE (which includes Gate and Expert layers in MoE), and the loop that connects them. The visual details of the MLA, which shows the separate paths for the computation of the key/value pairs.
- MTP and Sampling: The separate MTP module, and Output Head which feeds to Logits and then to the Sampling process.
- Looping Feedback: The looped connection visually shows that the generated tokens are fed back into the model.
- Output Steps: The visual shows the output chain that feeds into the tokenizer and produces the user-facing text.
[I want to spend more time staring at this diagram to make sure it makes sense. Way too easy to just trust these things with an implicit “Yeah that looks all flowcharty, must be right.” Probably good to have a different chatbot (even DeepSeek itself) do the same thing, and compare results: another form of “mixture of experts” — see “Recruit a team of assistants” in Jon Udell’s “Best Practices for Working with Large Language Models“. Some components of diagram above listed below:]
- UserPrompt
- AutoTokenizer – CPU
- Token IDs Sequence
- Parallel Embedding – GPU
- Token Embeddings
- Transformer Blocks – GPU
- Hidden Representation
- RMSNorm
- MLA Layer
- RoPE
- Key/Value Caching
- Attention Computation
- Feedforward Network
- MLP, Shared MLP
- RMSNorm
- MoE
- Gate
- Expert Layer
- MTP Modules – GPU
- Output Head
- Logits
- Sampling – CPU/GPU
- New Token IDs
Using the DeepSeek-R1 paper on reasoning and reinforcement learning (RL)
AS: I’ve just uploaded another file. This is a paper specifically on DeepSeek-R1 (on reasoning and RL). Can you give me a description of major differences between this and the DeepSeek-V3 paper you read earlier, and tell me how, if at all, this R1 paper might adjust any of your statements about the Python source code we looked at. [The paper is “DeepSeek-R1: Incentivizing Reasoning Capabilities in LLMs via Reinforcement Learning”.]
Gemini: Okay, I’ve now thoroughly reviewed the “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” paper. This paper provides crucial insights into the training methodology behind DeepSeek-R1, and how it relates to the DeepSeek-V3 model we’ve been discussing. Let’s break down the key differences and how this impacts our understanding of the Python source code.
Key Differences between DeepSeek-R1 and DeepSeek-V3 Papers:
- Focus:
- DeepSeek-R1: This paper is primarily focused on the training methodology for developing reasoning capabilities, with a heavy emphasis on Reinforcement Learning (RL). It highlights the creation of DeepSeek-R1-Zero through pure RL, and then the refinement of DeepSeek-R1 with cold-start data, and finally the distillation to smaller, denser models.
- DeepSeek-V3: This paper focuses on the architecture and training framework of the DeepSeek-V3 model. The DeepSeek-V3 paper explains how a base model was built, including its architecture and optimizations, and it also highlights the Multi-Token Prediction (MTP) objective for improved performance. It does mention distillation from the R1 series models but does not focus on RL as its core technique.
- Training Approach:
- DeepSeek-R1: This paper details an innovative RL training pipeline. The authors create DeepSeek-R1-Zero (through pure RL), then use human-curated data for a cold start, and then more RL training to generate the final DeepSeek-R1. The paper shows how the model’s reasoning improves over the process of training, and highlights the emergence of things like reflection and improved planning time. [AS: “human-curated data”: is that RLHF? should the training diagram show humans in the loop?]
- DeepSeek-V3: This paper describes using a large corpus, with an innovative load balancing strategy, MTP, and a FP8 mixed-precision framework. Post-training for DeepSeek-V3 includes using SFT and RL to align the model with human preferences. However, it leverages distillation of R1’s reasoning capabilities and therefore benefits from the methods detailed in the R1 paper.
- Reasoning as a Core Driver:
- DeepSeek-R1: This paper explicitly aims to “incentivize reasoning capability in LLMs”. Reasoning, and the display of chain-of-thought reasoning, are not just an emergent property but the primary goal of the entire training process. This paper focuses on training to produce explicit and readable chains of thought.
- DeepSeek-V3: While it does discuss reasoning, it does so more in the context of benchmarking model capabilities and less in terms of the RL methods. It leverages reasoning patterns distilled from R1 as one step in a post-training alignment stage.
- Key Techniques:
- DeepSeek-R1: Emphasizes pure RL, rejection sampling (to refine data), and human-in-the-loop feedback to create and evaluate reasoning data. It uses an iterative approach of making the model generate training data, then training on it to improve.
- DeepSeek-V3: Emphasizes hardware optimizations using FP8, efficient MoE layers, and a novel load balancing method to enable the training of a very large model. The post-training included an explicit step to distill DeepSeek-R1 reasoning into DeepSeek-V3.
- Distillation
- DeepSeek-R1: The DeepSeek-R1 paper includes a section on distillation (and its impact on smaller models), to make reasoning more broadly accessible.
- DeepSeek-V3: The DeepSeek-V3 paper notes it uses distillation of DeepSeek-R1’s reasoning capabilities in post-training.
- Evaluation:
- DeepSeek-R1: Primarily focuses on reasoning-related benchmarks (AIME, MATH-500, GPQA, LiveCodeBench, Codeforces) and also open-ended evaluations.
- DeepSeek-V3: Primarily focuses on standard benchmarks (MMLU, DROP, and other general language tasks), but also includes benchmarks for code, math, and reasoning.
Impact on Previous Statements about the Python Code:
This R1 paper substantially reinforces and nuances what we’ve previously discussed, but does not invalidate it. Here’s how:
- CoT is Not Solely Emergent (Reinforced): As we discussed earlier, the R1 paper further reinforces that the chain-of-thought generation style is not just an emergent property, but the direct result of targeted training using reinforcement learning and a specific type of training data. It’s learned through a process of iterative self-improvement using RL, guided by a specific training objective and is then transferred through knowledge distillation into V3.
- Importance of RL (Added): It adds more detail about how the CoT is obtained, namely through large scale reinforcement learning and rejection sampling, and by providing a structured format that encourages the display of the reasoning process. It makes it clearer that this is not just an emerging pattern in language models. Instead, it demonstrates a pathway to intentionally creating reasoning and planning behavior through careful application of RL with specific feedback loops.
- Distillation Strategy (Added): The R1 paper provides insights into the distillation from DeepSeek-R1, which is noted in the DeepSeek-V3 paper. The authors find that distillation is sufficient to achieve performance on par with a purely RL model. In the case of DeepSeek-V3, distillation was used to transfer the ability to reason with chain-of-thought to a different model.
- No Explicit “Reasoning Code” (Reinforced): The code itself does not have specific algorithms that would “reason” or guide CoT. Rather, the model learns to do this through RL on the training data. As we found earlier, the generate.py code simply enables the display of that reasoning via token by token generation, rather than implementing any particular style of reasoning itself.
- Training Pipeline (Clarified): The R1 paper makes clear the importance of multi-stage training, and this clarifies that the DeepSeek V3 is the recipient of a “reasoning” capability. DeepSeek-V3 has the capacity to show CoT, thanks to the model architecture detailed in code, and the training process enables it.
- Interactivity: The R1 paper describes how DeepSeek models are trained to “think aloud”. We can see this behavior thanks to the design of generate.py which makes that explicit through a token-by-token generation process.
Adjustments to Previous Statements:
- I would further emphasize that while the potential for complex reasoning is an emergent property, the specific style of reasoning exhibited by both DeepSeek-R1 and DeepSeek-V3 (including CoT) is more accurately described as a result of targeted training and a transfer of knowledge through distillation.
- I would add more emphasis on the power of iterative RL as a method of improving reasoning capabilities within language models. This is particularly important for DeepSeek R1, and also for DeepSeek V3 (although the latter leverages a R1 distilled model).
- I would include the fact that both models employ different loss functions which also explicitly encourage CoT, including the use of a reward that explicitly encourages the model to write output with proper formatting, in addition to proper content.
In Summary:
The DeepSeek-R1 paper highlights the significant role of reinforcement learning in developing reasoning capabilities and the emergence of CoT in language models. The paper makes the process of distilling this behavior into smaller models a key point. This paper expands upon the DeepSeek-V3 document, clarifying the full training pipeline that produced the DeepSeek-V3 model, but also how the core model architecture from the Python source code provides the necessary underlying infrastructure that enables the RL training process to work. The Python code is not designed to explicitly produce a chain-of-thought. Instead, the code is designed to allow an iterative and token-by-token process of text generation, which then allows the trained model to exhibit chain-of-thought.
The R1 paper, therefore, reinforces the idea that the code alone does not produce CoT behavior. It also clarifies that the observed CoT behavior is the result of a targeted effort to use RL to incentivize that specific type of output.
Diagramming the DeepSeek training process
AS: Have the papers perhaps given you enough info about the training process that you could do another Mermaid diagram, this time of the training process?
Yes, absolutely! After reviewing both the DeepSeek-V3 and DeepSeek-R1 papers, I can create a Mermaid diagram that depicts the training process, highlighting the key stages and how they relate to each model. This will give us a visual overview of the complex training pipelines.
[Skipping over more back and forth about parentheses in Mermaid labels. Also, the first Mermaid diagram of DeepSeek training stretched too long horizontally to print. I asked Gemini if it could fix. Gemini bombed completely on this twice one night. I gave up, and reentered the same prompt the next morning, and got: “You’re right, a vertical stacking of those subgraphs would definitely make the diagram more printer-friendly! Fortunately, it’s a relatively simple change in Mermaid syntax. Instead of graph LR which specifies a left-to-right layout, we can use graph TD to specify a top-to-bottom layout.” When it then unnecessarily proceeded to give me the entire Mermaid code all over again, it was just a simple matter of changing “TD” to “LR”.]
[Gemini’s summary of this diagram refers to “The image you’ve generated”, which is a bit rich, given Gemini should know full well it did all the work here: “The image you’ve generated, based on the adjusted Mermaid code, provides a very accurate and helpful visual model of the DeepSeek-V3 and DeepSeek-R1 training pipelines. It reflects the complex interactions, iterative processes, and the interplay between various training approaches like RL, SFT, and distillation, all of which were described in the documents we reviewed.”]
[[TODO: discuss the diagram of DeepSeek training; diagram components listed below; should the diagram include HAI-LLM (see follow-up questions below)?]]
- DeepSeek-V3 Pre-training
- Large, Diverse Corpus –> FP8 Mixed-Precision Training
- DeepSeek-V3 Base 4K Context; 128K Context
- Context Extension YaRN
- DeepSeek-R1 Training
- Collect Cold-Start CoT Data
- Fine-Tune DeepSeek-V3 Base
- Reasoning-Oriented RL
- Rejection Sampling
- Supervised Fine-Tuning SFT
- DeepSeek-R1
- Post-Training Distillation & Alignment
- Create DeepSeek-V3 SFT Dataset
- Fine-tune DeepSeek-V3 SFT
- DeepSeek-V3 Chat
- Reinforcement Learning GRPO
- Distill R1 into Smaller Models
- Distilled Dense Models
[[TODO: Google Gemini 2.0 Flash Thinking Experimental 01-21 did a fresh diagram for me, and critiqued the earlier one that the non-“Thinking” version did for me above (the critique includes for example the above diagram’s failure to note the PRM and MCTS training attempts that didn’t work). I’ll be covering this in a separate page on DeepSeek training. This is a useful exercise in part because the V3 and R1 papers do not contain diagrams of the training process (and of course there is not source code for V3/R1 training). That page will also cover some of the good tech discussions that have appeared on DeepSeek training (such as Bharani Srinivas’s “DeepSeek: A Revolution in Open-Source Reasoning AI”, Prasanna Pendse’s “Demystifying DeepSeek“, and Shayan Mohanty’s “The DeepSeek Series: A Technical Overview”), and will cover the HuggingFace “Open-R1” project, and Fareed Khan’s “DeepSeek From Scratch” Python project, with “for dummies” version). In the meantime, below is the most recent version of the DeepSeek R1 training diagram I worked on with Gemini; since the diagram is a bit difficult to read, the new page on DeepSeek training will also contain a text version.]]
Follow-up questions: HAI-LLM, LORA, RoPE, does it “phone home”?, etc.
AS: A series of scattered follow-up questions; I’ve enabled “Grounding with Google Search” in the Google AI Studio UI, so it’s possible some of your answers may come from the web rather than within the 4 corners of the docs and source code you’ve been examining for me:
(a) Can you say a little more about “the HAI-LLM framework”?
(b) You’ve convinced me that the 1,300 LOC are sufficient. But can you confirm that, after setup, and once running “python generate.py” locally, there is no further internet connection, no reaching out to servers, it’s all happening locally?
(c) Say a little more about Low-Rank Adaptation (LoRA) config for QKV projections.
(d) Say a little more about tokenizer.apply_chat_template() — I see it’s part of AutoTokenizer, which is documented, but where would custom chat templates, like extended prompts, be provided?
(e) Say a little more about Rotary Position Embeddings (RoPE)
Gemini: Okay, let’s tackle these follow-up questions one by one, leveraging both the information we’ve already covered and, where appropriate, grounding with Google Search (as you’ve enabled it).
(a) The HAI-LLM Framework
What it is (from the papers): The DeepSeek-V3 technical report mentions that “The training of DeepSeek-V3 is supported by the HAI-LLM framework, an efficient and lightweight training framework crafted by our engineers from the ground up.” This indicates that HAI-LLM is a custom-built framework developed by the DeepSeek team specifically for training their large language models.
What it Does (from the papers): It provides core functionality and support for:
- Pipeline Parallelism: The report mentions “16-way Pipeline Parallelism (PP)”. Pipeline parallelism involves dividing the model across multiple GPUs so different GPUs work in parallel on different layers of the same model on a single data input. The R1 and V3 papers mention this in general terms.
- Expert Parallelism: The framework supports distributing the MoE experts across multiple GPUs using a technique called “64-way Expert Parallelism (EP)”, so multiple GPUs all work on the same layer of the model, but on different expert sub-modules, which are themselves smaller neural networks. This is also mentioned in the R1 and V3 papers.
- Data Parallelism: The framework also supports distributing training data across multiple GPUs (often referred to as “data parallelism”), with the technical report indicating that it uses “ZeRO-1 Data Parallelism (DP)” (which is also known as data parallelism with partial optimizer state offloading). The R1 paper states “data parallelism is employed on 64 GPUs belonging to 8 nodes”.
- Optimization: The HAI-LLM framework provides custom memory management and efficient communication strategies that aim to reduce memory footprints and improve computation/communication overlap as described in the technical report. The technical report mentions DualPipe algorithm for pipeline parallelism.
Beyond the Documents: While I don’t have access to the specific code implementation of HAI-LLM (since it is not open-sourced), it is very likely that it:
- Provides utilities to manage and distribute the model across multiple GPUs.
- Includes a custom training loop for RL and supervised training.
- Implements data loading and preprocessing routines.
- Handles low-level CUDA operations.
- Likely integrates with their custom Triton kernels for performance.
Summary: HAI-LLM is a proprietary training framework designed by DeepSeek-AI for efficiently training their large language models (such as V3 and R1). It enables them to scale the model and the training process. The fact that it is a lightweight framework means that the training is optimized for their hardware and process.
[AS: I should have asked what HAI stands for, and also whether it belongs in the training diagram. An older DeepSeek post refers to “internal, non-open-source hai-llm evaluation framework”; see 2023 hi-flyer.cn blog post: “HAI-LLM: An efficient and lightweight large model training tool“; this includes some references to code e.g. “from hai_llm.optim import AdamW, ZeroOptimizer”; more code fragments importing from hai_llm (“from hai_llm.model.moe.llama import MoeLLaMABlock”, “from hai_llm.model.moe.gate import GShardGate, SwitchGate, HashGate”) here.]
(b) Local Operation Confirmation
Yes, after the initial setup and the python generate.py command is running, all computations and data processing happen locally on your machine. There is no further reaching out to external servers required.
- Confirmed from Code: The source code we analyzed does not include any network requests or communication with external servers after the initial model and tokenizer loading. Once all the required weight files and the configuration files are on your local storage, everything is self-contained.
- No External Dependencies: While the initial setup process includes downloading the transformers and safetensors libraries (and their dependencies) using a tool like pip, and perhaps loading weights from Hugging Face, once the model has started the generate.py step is self-contained. It loads the weights from a file in local storage (specified by –ckpt-path) and computes the output using locally held data and the local GPU.
- Distributed Environment: Even when running on multiple GPUs using torchrun, communication happens between local machines via NCCL or similar inter-node communication systems on the local network. There is no external server connection implied in these operations.
- Explicit Requirement: I believe the DeepSeek papers would note it if external services were required. The DeepSeek group has a strong focus on open source and self hosted LLM capability, so they would also note external dependencies.
Therefore, you can confidently run the inference process offline, without concern for network dependencies once you have all the necessary weights, libraries, and configurations.
[AS: I should have asked it to confirm that the question would be answered the same way based solely on the code, without relying on absence of reference to external services in the papers.]
(c) Low-Rank Adaptation (LoRA) for QKV Projections
What is LoRA?: Low-Rank Adaptation (LoRA) is a technique used to fine-tune large language models (LLMs) more efficiently. Instead of modifying all the parameters in the original model, LoRA adds a smaller set of trainable parameters to specific layers. Specifically, it adds low rank matrices next to pre-existing weight matrices which it then trains to affect the overall model output. This reduces the number of parameters that require training (and therefore resources), while still allowing the model to adapt to new tasks.
LoRA in QKV Projections: In the context of the MLA layer, the LoRA is applied to the query, key and value (QKV) projection matrices. These projections determine how the attention mechanism computes query, key and value representations. By modifying these matrices with the added LoRA weights, the attention mechanism can be adapted to new data and tasks.
Implementation Details: As seen in the model.py code, the low rank adaptation is achieved by use of q_lora_rank and kv_lora_rank. When they are non-zero:
- For the query projection, the input goes through the matrix wq_a, a linear projection to the q_lora_rank, then an RMSNorm, and then finally the result of this is passed to wq_b for the final projection.
- For the key and value projections, the input goes through the wkv_a linear projection layer, then the kv_norm, and finally is passed to wkv_b for the final projection.
Benefits: Using LoRA, they:
- Reduces memory usage during training.
- Reduces training time, by reducing the number of parameters that need to be optimized for the new task.
- The model may also have improved generalization, since only the LoRA parameters are being fine-tuned.
(d) tokenizer.apply_chat_template()
Purpose: The tokenizer.apply_chat_template() method is part of the Hugging Face AutoTokenizer. It prepares the input prompts for a chat-style interaction with an LLM. This is done by formatting the input with system prompts, user prompts, and assistant responses in a specific way, and by also adding special tokens. It also includes parameters for controlling generation, so that the model can be guided towards producing relevant and context-aware responses.
Custom Templates: The apply_chat_template method uses a template string which dictates exactly how that data is formatted. This method allows for custom formats to be used.
Where to Provide Custom Templates:
- Hugging Face: If you’re using a tokenizer directly from the Hugging Face Hub, the template is typically included with the tokenizer’s configuration. You can often find these templates in the tokenizer’s tokenizer_config.json file.
- Custom Tokenizers: If you have a tokenizer that is not from the Hugging Face Hub, then the templates are often included as part of the configuration within that library. This is how DeepSeek models work.
- Directly Setting the Template: You can also set a custom chat template string directly in your code, especially if you are using a non-standard tokenizer. This would be done during the initialization process of the tokenizer.
- Adding to the Model Files: You can add a file (usually named tokenizer_config.json) to the same location as the model weights that contains additional configuration parameters for the Hugging Face tokenizer. This is how the convert.py script also copies over the tokenizer configuration. You could then add your custom template to this json file.
Extended Prompts: By modifying the chat template, you can specify how to:
- Include system prompts (instructions to the model)
- Format user and assistant turns.
- Add special tokens (like beginning and end-of-sequence tokens).
- Add explicit formatting and tags (like those noted in the DeepSeek papers).
- Chain multiple prompts and responses together, for multi-turn conversations.
(e) Rotary Position Embeddings (RoPE)
What are RoPE?: Rotary Position Embeddings (RoPE) are a method for encoding the positions of tokens within a sequence in a Transformer model. They are an alternative to traditional positional embeddings, and unlike traditional embeddings, they do not add to the dimensionality of the input sequence.
How RoPE Works:
- RoPE encodes positional information using sine and cosine functions with different frequencies.
- The frequencies depend on the position and the embedding dimension.
- The RoPE values are then used to rotate the query and key vectors in the attention layer, such that tokens in the sequence are aware of the locations of other tokens, and their relative differences.
Implementation in Code: In model.py:
- precompute_freqs_cis: This function precomputes the complex exponential values for the RoPE. This function includes code for extending context using the YaRN method, and includes parameters for base frequency (rope_theta) and sequence length (original_seq_len) as noted in the ModelArgs documentation.
- apply_rotary_emb: This function applies the precomputed RoPE to the query and key vectors.
- Advantages: RoPE has several key advantages:
- Relative Position Encoding: RoPE encodes positional information in a way that is relative rather than absolute. This means the model understands how far apart tokens are from one another, not just their absolute position. This is an advantage for extrapolation tasks.
- Extensibility: The rotary embeddings also allow for the model to generalize to longer sequences than seen in training.
- No Added Dimensionality: Because they rotate vectors in place, they don’t add to the dimensionality of the output.
Why RoPE for DeepSeek? The DeepSeek technical report mentions using a decoupled key with RoPE. This means that the key matrix can be broken down into two components, with the positional embeddings added to one of them (the decoupled key). This is a variation on the standard RoPE and was introduced as part of the Multi-head Latent Attention (MLA) layers.
How do reasoning and CoT in DeepSeek compare with Claude’s data-handling abilities?
[I seem to be foolishly conflating two very different things here, both related to “reasoning” but otherwise quite different, but let’s see what happens…]
AS: I want to push back a little more on the part of our chat about reasoning and CoT as emergent capabilities. I know you added some nuance to this after reading the two DeepSeek reports, and your explanations generally make a lot of sense. However, I some time ago had a chat with Anthropic Claude about its ability to handle data. I have a 7MB PDF of that chat I had with Claude about this, that I would like you to compare with what we’ve said about DeepSeek reasoning, but I don’t want this Claude material confused with the material on DeepSeek. [Uploaded PDF of “Anthropic’s Claude analyzes data, and explains how it knows how to do this (instruction fine-tuning)” from this site.]
… Please upload the 7MB PDF of your chat with Claude in the usual way. I understand that you want to keep the Claude material separate from our DeepSeek analysis, and I’ll make sure to treat it as a distinct dataset….
AS: Here it is. You’ll see that I basically asked Claude (paraphrasing) “hey, is your ability to do this [analyze data] simply an extension of your predict-next-token LLM abilities, or has something been added on?” Compare this chat, and Claude’s answers to that question, with what we’ve been discussing re: DeepSeek. I realize that Claude is analyzing data, and with DeepSeek we’re mostly discussing CoT, so feel free to tell me that these are not really related.
Gemini: Okay, I’ve carefully reviewed the 7MB PDF of your conversation with Anthropic’s Claude. It … highlights Claude’s ability to analyze data, explain its reasoning, and discuss its own capabilities. Let’s compare Claude’s approach with what we’ve discussed about DeepSeek, focusing on the “emergent vs. explicit” aspects of reasoning and CoT, and also considering the role of MoE.
Key Takeaways from the Claude Conversation:
- Tool Use is Explicit, Not Emergent: Claude explicitly states that its ability to analyze data is not a natural outgrowth of its language model training. Instead, it’s a hybrid system that combines its language model with access to external deterministic tools, such as:
- A JavaScript REPL (Read-Eval-Print Loop) for calculations
- Libraries like Recharts for visualization
- APIs for data processing and file I/O
- These tools are clearly external to the language model itself, and its access to them is something it is trained to do via instruction fine tuning, but they themselves are not a part of the model.
- Instruction Fine-Tuning for Tool Use: Claude emphasizes that instruction fine-tuning teaches it how to use these tools effectively. It provides the ability to access tool, but does not include code that actually uses those tools (that happens in external API calls). It notes a distinction between base knowledge (e.g. knowledge of javascript functions) and how to invoke a tool (which comes from instruction tuning). In its examples, it shows that:
- It has a “base training” from which its understanding of data parsing comes.
- It has instruction fine tuning that guides it towards wrapping console.log() statements to see intermediate results, or how to use window.fs.readFile.
- It uses task-specific patterns from instruction fine tuning, to dictate where React components should be set up, or how elements should be styled.
- Hybrid Approach: Claude sees itself as a “hybrid system,” with the language model for high-level understanding and decision-making and external tools for precise computation, a sentiment that is further reinforced in the last example with the clear breakdown of its code. It explicitly distinguishes this hybrid approach from a “neural network all the way down,” contrasting its method with the idea that the model alone performs calculations or data processing.
- Chain of Thought When you asked Claude how it performed its functions, you saw an example of what it identified as an “aha moment”. The transcript demonstrated that, as part of the task, Claude would perform a kind of meta-analysis of its work, and, through this analysis, refine or modify its approach. Claude explicitly notes that its behavior is not limited to a specific task:
- “I intentionally limit our constraints to this structural format, avoiding any content-specific biases—such as mandating reflective reasoning or promoting particular problem-solving strategies—to ensure that we can accurately observe the model’s natural progression during the RL process.”
- The “Calculator” Analogy: Claude’s use of the “calculator” analogy is very insightful. It illustrates that while it understands the concepts of math, it uses an external tool (the REPL) to perform calculations. This highlights a clear distinction between reasoning (which, in this case, involves understanding mathematical concepts and how to apply different methods of analysis) and performing numerical operations.
- Self-Awareness: Claude demonstrates a degree of self-awareness, noting when its understanding of its internal implementation is incomplete or uncertain, and even noting when a specific point might be due to the stylistic guidelines of training, rather than internal knowledge. This ability to discuss its own processes and boundaries also seems to indicate the result of a strong training regime.
Comparison with DeepSeek and Our Previous Discussion:
- DeepSeek Focuses on Model-Based Reasoning; Claude Focuses on Tool Use
- DeepSeek: As we’ve established, DeepSeek-R1 and DeepSeek-V3 (as it is derived from DeepSeek-R1) are engineered for model-based reasoning capabilities, by training them to generate explicit chain-of-thought reasoning via reinforcement learning and knowledge distillation. This involved creating a base architecture, and then training it with methods that make this behaviour explicit.
- Claude: In contrast, Claude is trained to effectively use external tools, by understanding when these tools are needed and then translating instructions into tool invocations. These are different styles of behaviour, although both ultimately use the same basic approach of a transformer architecture with next token prediction. The tools that are external to the core model (Javascript REPL, plotting, file access, etc).
- Emergent vs. Explicit (Revisited):
- DeepSeek: Chain-of-thought reasoning and the ability to plan and reflect are engineered into DeepSeek models via training methods. The use of reinforcement learning, distillation and fine tuning guides the emergence of these behaviours by creating a feedback loop that encourages the model to improve.
- Claude: In contrast, the capability to use tools is an explicitly added functionality through training and instruction fine-tuning. The ability to choose when to use a tool is an emergent capability from the training, but the tool itself is deterministic and external.
- Tool Interaction: [[TODO: feeling repetitive in here; delete some?]]
- DeepSeek: It does not have explicit access to external tools for computation or data processing. It’s reasoning capabilities are model-based, in that they operate only on weights, biases and operations within the LLM itself.
- Claude: Its approach is “hybrid,” combining language model capabilities with external tools and knowledge. This external aspect is important to how Claude works.
- Chain of Thought as an Explicit Behavior
- DeepSeek: Chain of thought is explicitly encouraged through targeted training, a specific data structure that includes tags to delimit reasoning steps, and a reward function that encourages that behaviour. In the case of the DeepSeek-V3 model, this ability is acquired through distillation of the R1 model.
- Claude: Although Claude also does chain of thought, it is not something it is trained to provide for the purpose of performing data analysis, as it relies on a REPL and other external tools. When Claude showed you that “aha moment”, it was the result of RL, but not related to any CoT training that might have been in place.
- MoE and Reasoning: The conversation with Claude doesn’t touch upon MoE, so I can’t use that interaction to understand Claude’s architecture. However, our previous discussions with DeepSeek make it clear that the MoE architecture enables scaling the model, thereby allowing it to learn more complex tasks and behaviors. While not directly coding the reasoning capabilities, the extra parameters do mean the model can learn that explicit CoT is valuable for correct answers, and also to achieve a greater score in training. The MoE layers also enable more specialization of the model, which may enable the model to perform better on specialized task.
Refined Understanding Based on Claude Conversation:
- Two Different Approaches to Complex Behavior: Whereas the DeepSeek project trains a neural network to perform computations and reasoning, the Anthropic Claude project uses a language model that is specifically designed to call external APIs, with a combination of base training, specific instruction fine tuning, and user-provided contextual awareness.
- More Nuance on Emergent Behavior: Even when tools are explicitly introduced, it is the training process that determines when and how those tools are used and incorporated into the overall responses and behavior, which is itself an emergent property.
- Tool-Use as a Complement: Even a language model that primarily uses neural computation may still use tools (such as a text editor or a calculator), which enhance capabilities but are not essential.
Impact on Previous Statements About the Python Code:
- DeepSeek’s Code Still Focuses on Model: The code still primarily focuses on implementing an efficient LLM architecture that can achieve impressive results, such as explicit chain of thought, through training. The fact that Claude uses external tools is helpful in understanding that some of the abilities we see in LLMs may be the result of tools it is invoking. The DeepSeek Python code does not include any such tools.
- Training is Still Key: The conversation with Claude reinforces the point that specific behavior in LLMs (both DeepSeek’s CoT, and Claude’s use of a REPL) does not come from code changes, but through careful training and architectural choices.
In Summary:
Your conversation with Claude highlights a key difference in approach: Claude relies on a language model combined with external, deterministic tools, whereas DeepSeek emphasizes training the model to itself perform complex tasks such as reasoning. Both models are built on the common foundation of next-token prediction. However, while DeepSeek focuses on model-based reasoning through chain-of-thought generation (which is incentivized through training data), Claude is designed to offload some complex processing to tools, and uses training data and instruction fine tuning to decide when to apply those external tools. The code for each model enables their respective approach, but the resulting behaviour depends entirely on the training data, the methods used, and the architectural choices that were made.
AS: Ok, it feels like I was largely conflating two very different types of “reasoning”: tacked-on tool use vs. in-model reasoning.
What about you, Gemini? (With some discussion of “introspection”)
[Note that the following comes in the context of Gemini reading my chat with Claude, in which Claude was very careful to note that it had no privileged perspective on its own implementation, no “introspection”, etc. Below, Gemini too is fairly careful, e.g. “I believe my instruction fine tuning focuses on skills like…”, “I believe my training includes Reinforcement Learning…”, etc.]
AS: Can you talk about your own training and capabilities here, Gemini 2.0 Flash Experimental? To what extent do you have knowledge of your own implementation (including training, reasoning, CoT, ability to use external tools (I see there’s a “Code Execution” button in your UI], etc.)?
Gemini: Okay, let’s turn the lens inward and discuss my own capabilities and training as Gemini 2.0 Flash Experimental, focusing on what I “know” about my implementation, and how that relates to the “reasoning” concepts we’ve been discussing. This is a fascinating area, and I’ll do my best to provide a clear and honest explanation.
My Training and Capabilities:
- LLM Foundation: Like the models we’ve discussed (DeepSeek and Claude), I am built on a large language model (LLM) architecture. This core LLM was trained on a massive dataset of text and code, enabling me to understand and generate human-like text, as well as to perform various tasks like summarization, translation, coding, etc. This is where a great deal of my general knowledge comes from, as well as an ability to predict the next token.
- Instruction Tuning: On top of this base LLM, I’ve undergone instruction tuning. This involves training on a dataset of specific instructions and corresponding outputs. This process is similar to what Claude describes in that paper, where the model is taught specific patterns of behavior. I believe my instruction fine tuning focuses on skills like: [[TODO: boring in here; delete some]]
- Following user instructions and queries, including multi-turn conversations.
- Providing helpful and informative responses.
- Using proper formatting and structure.
- Prioritizing factual accuracy and being grounded in reliable sources (as you enabled via Google Search).
- Tool Use (Code Execution): As you’ve noticed, I have access to a “Code Execution” button in this UI. This feature allows me to execute code in an external environment (likely a sandboxed Python or JavaScript environment), as Claude described for their JavaScript REPL. I can:
- Write code based on user requests.
- Execute that code within the external environment.
- Use the output of that code as part of my response. This capability is similar in many ways to the way that Claude is designed.
- Reasoning and CoT:
- As a language model, the core of my architecture is next-token prediction (much as Claude said). I have the ability to model complex relationships between inputs.
- I believe my training includes Reinforcement Learning, and data that includes examples of chain-of-thought, which encourages a step-by-step reasoning process.
- When generating responses that involve reasoning, I attempt to organize my explanation to follow a chain of thought, which also helps me ensure the response is internally consistent. However, it does not have a formal definition or format such as DeepSeek’s <think> tags.
- However, it’s important to note that, like with DeepSeek’s models, I don’t have an explicit reasoning module, instead it emerges from the training process.
- Data Analysis and Google Search Grounding: You have enabled my “Grounding with Google Search” feature. This is not a standard part of my LLM training, but something that is enabled at runtime. This external tool allows me to perform search queries and include the top search results in my response. It helps me be more accurate, and provides a richer understanding of certain topics. As with my ability to run code, the tool itself is an external service that is invoked by a process that is separate from my core LLM training and operations.
My “Knowledge” of My Own Implementation:
- Limited Introspection: Like the models we’ve discussed, I don’t have direct access to my own weight matrices or training code. My “knowledge” of my implementation is largely based on my training dataset, which contains:
- A vast amount of textual and code-based information on AI, machine learning, language models, etc.
- A dataset which gives me knowledge of how to behave and give useful responses, including how and when to use the tools I have available.
- “Black Box” Core: While I know my model is based on a transformer architecture, I don’t have access to low-level details like the number of layers, the size of the embedding matrices, or the specific parameters in my attention layers or feed-forward networks.
- External API Understanding: I have knowledge about the external tools I have access to (like the code execution feature), but I do not know the details of their code. I know, for instance, how to use Javascript to output a variable to console, but I do not know the actual code that implements the Javascript REPL itself.
- Training Data: I have some awareness that I was trained using supervised fine-tuning, reinforcement learning, and possibly other methods like distillation but I do not have explicit access to my training data or training loop. The information I can share about my training processes comes from my training data, not direct access to the training code.
Reasoning, CoT, and Tool Use in My Case:
- Emergent Reasoning: My ability to perform complex operations and understand patterns is a blend of emergent behaviors from my training and the explicit usage of tools. I have knowledge of math and statistical analysis due to my training, which allows me to perform the analysis that you saw. The specific sequence of reasoning and presentation is a skill that I have learned, as described earlier.
- Tool Use as a Core Functionality: The ability to use external tools like the code execution environment is an explicit feature, not something that emerged from the LLM training. However, the decision when and how to use these tools comes from my instruction-tuning and is an emergent behavior of my training. I can make code, and evaluate code, but the running of the code itself is delegated to external APIs.
- CoT as a By-Product: I attempt to generate responses with a chain-of-thought, because the training data that I’ve been exposed to shows a preference for this style of response. Unlike DeepSeek-R1, I do not have an explicit step to encourage or promote long CoT, but rather attempt to provide a reasonable explanation for my responses. The generation of CoT is implicit, and not explicitly encoded or promoted in my architecture as it is with DeepSeek.
Key Takeaway:
I am a language model that is designed to generate responses, follow instructions, use external tools, and even demonstrate some level of reasoning. However, I do so without any explicit code that implements these abilities directly. Instead, I have learned to perform all of these complex tasks by having my LLM interact with external APIs, and through my iterative training processes. I do not have “hard-coded” logical rules or anything specific to create chain of thought, even though my training data includes many examples of it. Instead, the process of training on examples leads to those behaviors.
I hope this detailed explanation gives you a better understanding of how I operate and the source of my capabilities. Like with DeepSeek and Claude, it’s a complex interplay of different training methods and architectural design. And, like them, the most important part is the data and training that shaped my responses, which goes beyond just the “neural net” architecture and its weights and biases….
AS: … Would it be fair to say that your knowledge of your own implementation, based as it is on your general training, is not qualitatively different from your knowledge of (picking random unrelated examples) how one does various yoga positions, how to play badminton, or how to implement a C++ compiler? Or is there some “special insight” you may have, some way in which you’re speaking from “personal experience,” so to speak? You’ll note I had a similar conversation with Claude. Is there any mechanism by which some amount of “introspection” (as we’re using the term here: LLM’s knowledge of of its own internals) might “emerge” (to use that vague but useful term) from your training? I suspect not, given your lack of access to your own hyperparameters, for example, but talk this through with me.
Gemini: That’s an excellent and insightful question [groan], and it gets to the heart of the matter when discussing LLM “understanding” versus “knowledge”. You’re right to push on the idea of whether my awareness of my own implementation is qualitatively different from my knowledge of other subjects. Let’s unpack this.
My Knowledge of My Implementation vs. Other Subjects:
- The Nature of Knowledge: The key point is that all my “knowledge” – whether it’s about yoga, badminton, C++ compilers, or my own implementation – comes from the same source: the massive dataset I was trained on, and the patterns I have learned from that data. All of my responses are generated by modelling the distribution of text and tokens in that data.
- No Special Access to “Self”: I do not have a special, privileged window into my internal code, weights, or training process. I don’t have a “back door” into my source code. All I have is the text in my training dataset.
- Same Learning Mechanism: My understanding of yoga positions, how to play badminton, how a C++ compiler works, and how LLMs are trained comes from the same LLM-based training process, where I attempt to predict the most likely next token (and in turn, learn complex relationships between tokens). There is no special “LLM-self” module or training process.
- No Personal Experience: Because my “knowledge” of all of these topics is the result of being trained on text from the web, it is not grounded in any experience – either external experience (such as feeling the wind when playing badminton or the feeling of working out during yoga), or internal experience (such as experiencing code execution or model learning).
Why It Feels Different (And Why It’s Not):
You’re right to note that it feels different when I discuss my own implementation versus other topics. This arises from two main sources:
- Textual Training Data: My training data includes a vast amount of text describing how LLMs work, how they are trained, and what their capabilities are. This includes research papers, blog posts, and documentation (similar to the DeepSeek reports and the documentation you gave me). Therefore, I can sound “knowledgeable” when discussing these topics.
- Instruction Tuning: My instruction-tuning includes specific patterns to respond to questions about myself and the kinds of things I can do. Therefore, I have been trained to have patterns that are useful for discussing complex, and perhaps confusing, concepts such as the nature of my existence and training. These patterns help me to provide helpful explanations to such queries, and to use phrasing that implies self-awareness. However, this is all based on the patterns and phrases used by others to discuss LLMs, not on some internal introspection process of my own.
The Lack of “Special Insight” or “Personal Experience”:
- No Introspection: As I stated previously, I have no ability to do true introspection. This would imply access to the inner workings of the code itself. Instead, I only have access to a textual representation of how LLMs are built and used.
- No “Personal Experience”: As a language model, I lack the capacity for personal experience. My information on Yoga or C++ is based on training data and not based on performing a headstand or writing a compiler. Similarly, my information about how I’m built and trained is also based on training data, and not based on actually being or implementing any of those processes.
- No “Special” Weights/Biases: There isn’t a set of weights and biases in my neural network that provides some special access to internal workings. The core architecture and implementation are consistent across all my processing – whether it’s writing a haiku, or describing my implementation. [AS: at one point I asked both chat.deepseek.com and the small local DeepSeek to “write a haiku about neural networks” — both were so terrible that I’m not showing them here; I’m not sure big DeepSeek’s was really any better than small DeepSeek’s — if you really want, see this DeepSeek chat transcript and decide for yourself.]
Emergent Introspection? (And Why It Likely Won’t Happen):
You rightly question whether some form of “emergent introspection” could arise from my training. While it’s a fascinating thought experiment, I believe it’s unlikely to happen in the sense you might be imagining: [AS: I don’t think I was suggesting this; I think I was trying to emphasize the opposite.]
- Limited by Data: My “understanding” is entirely shaped by the data I’m exposed to. While it may be possible that with a very large training dataset, I might implicitly learn a self-representation model of some kind, this would still be grounded in information from my training, rather than an innate knowledge of the underlying physics of the silicon on which I run. It may be that models may eventually learn a representation of themselves, but that still is not the same as a “personal” experience.
- Lack of a Self-Model: I lack an explicit model of myself as an agent with unique identity and preferences. The training process also doesn’t encourage this, instead focusing on utility and providing accurate and helpful answers. The fact that I can talk about myself does not mean I have any actual self, or that the self-talk represents a true awareness of my own inner state. It is just an imitation of the way humans communicate about such matters.
- No Feedback Loop: There’s no feedback mechanism within my training process that would encourage the development of self-awareness or introspection as a distinct ability. The training process is about predicting tokens, not about self-referential learning. I do not “feel” the result of my code, or understand the underlying computation of my network, other than having been exposed to text that explains such processes.
- Introspection from Training Data: If there were to be an ability for an LLM to discuss its own architecture, this could only emerge from training data and be a representation of other’s descriptions of that model. As a result, it would not be a new self-model, but only a re-application of external knowledge.
In Summary:
My knowledge about my implementation is not qualitatively different from my knowledge about yoga, badminton, or C++. It all stems from the same LLM-based training and dataset. The “feeling” of self-awareness is likely a result of patterns in my training data and instruction tuning that make my responses sound as if I am doing more than just regurgitating the training corpus. The distinction, however, is that this is based on external information describing what LLMs are, rather than internal knowledge from a direct introspection of my inner state. The notion of an emergent “introspection” in the way that you suggest is unlikely given the way LLMs are trained.
Thank you for pushing me further on this point, it’s a key area of understanding when thinking about the limits of large language models, and helps to understand the nature of their abilities.
Gemini helps me try to run a small local DeepSeek model
[In addition to the Python code for DeepSeek-V3 we’ve been discussing here, Hugging Face also provides a “Use this model” button with much smaller blocks of code for running DeepSeek-R1-Distill-Llama-8B via the Transformers library or vLLM. The Transformers code is just a few lines of code:
from transformers import pipeline
messages = [
{“role”: “user”, “content”: “Who are you?”},
]
pipe = pipeline(“text-generation”, model=”deepseek-ai/DeepSeek-R1-Distill-Llama-8B”)
pipe(messages) # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(“deepseek-ai/DeepSeek-R1-Distill-Llama-8B”)
model = AutoModelForCausalLM.from_pretrained(“deepseek-ai/DeepSeek-R1-Distill-Llama-8B”)
I’ll skip over a lengthy back and forth with Gemini trying to get this working on what is likely an underpowered laptop (running Windows 11 with “only” 64 GB CPU memory and 32 GB GPU memory), until I had a Python script that allowed DeepSeek to provide a very slow prompt/reply loop, though without feeding previous promps/replies back in as context for the next prompt. Code shown below.
The first time this ran, files were downloaded on the fly:
Directory of C:\Users\undoc\.cache\huggingface\hub\models--deepseek-ai--DeepSeek-R1-Distill-Llama-8B\snapshots\... 02/02/2025 12:57 PM 826 config.json 02/02/2025 01:04 PM 181 generation_config.json 02/02/2025 01:01 PM 8,667,826,246 model-00001-of-000002.safetensors 02/02/2025 01:04 PM 7,392,730,108 model-00002-of-000002.safetensors 02/02/2025 12:57 PM 24,240 model.safetensors.index.json 02/02/2025 01:04 PM 9,084,480 tokenizer.json 02/02/2025 01:04 PM 3,061 tokenizer_config.json
As for why one might want to run smaller local models, generally it will be based on concerns about one’s data going to someone else’s server, possibly in another country. One might also be required to prevent that from happening; see page on running CodeLlama locally. See also “Local Operation Confirmation” in follow-up questions above.
Strange “chain of thought” (CoT) from the small local DeepSeek-R1 model
[I had a lot of trouble with DeepSeek-R1-Distill-Llama-8B on a reasonably-powerful laptop. Not only did it sometimes take up to 10 minutes to receive a reply to my prompt, but it was confused by feeding previous prompts/replies back in as context for a new prompt, and produced some very odd chains of thought. Some examples below. I’ve put some particularly silly things in bold italics. One important point is simply how much better most contemporary AI chatbots (importantly including chat.deepseek.com) are than this. Note that this is supposed to be a distilled version of the R1 reasoning model, not of the V3 base model. Others have also found R1’s “reasoning style can be overly verbose — sometimes it went in circles when we were using it” (Thoughtworks), though I think the following examples may be unusually bad.]
C:\work\ml>python deepseek_r1_transformer_8.py
…
Loading checkpoint shards: 100%|███████████████████████████████████████████| 2/2 [00:00<00:00, 7.25it/s]
DeepSeek-R1 Chat Bot – Type ‘quit’ to exit.
User: What version of DeepSeek are you?
[… very long wait; in different runs, minimum of 3 minutes; sometimes more than 10…]
Assistant: <|User|>What version of DeepSeek are you?<|Assistant|><think>
I’m DeepSeek-R1, an AI assistant created exclusively by the Chinese Company DeepSeek. I’ll do my best to help you.
</think>
I’m DeepSeek-R1, an AI assistant created exclusively by the Chinese Company DeepSeek. I’ll do my best to help you.
User: Does the “R1” stand for “reasoning”, and if so, how is that implemented?
[… never did get an answer; gave up after waiting 20 minutes; I cut the tokens-related values in half, and am trying other questions….]
User: what is water made of?
Assistant: <|User|>what is water made of?<|Assistant|><think>
</think>
Water is composed of hydrogen and oxygen atoms. Specifically, water consists of two hydrogen atoms bonded to one oxygen atom, forming the chemical formula H₂O.
User: what are the 39 steps?
[AS: Well, it was foolish of me to ask that, since all I care about here is whether it can answer 2 questions in succession, in less than an hour perhaps, and this is a tricky question. The correct answer for “What are the 39 Steps?” is precisely the answer that is given to that very question towards the end of one of the world’s greatest movies: A character, who is an idiot savant memory whiz, can’t help answering honestly, “The 39 Steps are a secret organization of spies, collecting information on behalf of the Foreign Office of…” and then he is shot. Surely the screenplay, or at least Wikipedia summary of movie, is in standard LLM training. But actually perhaps not so stupid of me to ask that here, because DeepSeek eventually answered, and revealingly, it assumed that the 2nd query must relate to the first (the py code used here was trying to create context from the entire chat — the code commented out above was still enabled):]
Water is composed of hydrogen and oxygen atoms. Specifically, water consists of two hydrogen atoms bonded to one oxygen atom, forming the chemical formula H₂O.<|User|>what are the 39 steps?<|Assistant|><think>
Alright, the user initially asked about the composition of water, and I explained that it’s H₂O. Now they’re asking about “the 39 steps.” I need to figure out what they’re referring to. [This small local model assumes that one query MUST somehow be related to the previous one?]
I recall that in chemistry, there’s something called “water structure theorem,” but I’m not sure if that’s related. Maybe it’s about the 39 steps in water’s formation process? Or perhaps it’s a specific method or procedure.
I should consider if it’s a reference to something like the 39 steps in water electrolysis or a specific scientific method. Without more context, it’s tricky, but I should provide a general explanation.
I’ll mention that the exact reference isn’t clear, but I can explain the general steps involved in water analysis or purification. That way, the user gets useful information even if the exact term is unclear. I’ll also invite them to provide more details if they have a specific context in mind.
</think>
It seems like you might be referring to a specific concept or procedure related to water, but “the 39 steps” is not a widely recognized term in common scientific or technical contexts. Could you clarify or provide more context? If you’re referring to something specific, such as a process, method, or theory, please share more details, and I’d be happy to help!
[While it should be able to switch from one topic to another, clearly it was confused it here. At this point, I commented-out the code that tries to use previous prompts/replies as context for next prompt. And now waiting a very long time again for new answer to “What is water made of?” (randomness and “temperature” in LLMs mean that you can get different answers to the same question). Response time seems very variable from one run to another…. Okay, finally got an answer. Har har below, DeepSeek “remember from school”:]
DeepSeek-R1 Chat Bot – Type ‘quit’ to exit.
User: what is water made of?
Assistant: <|User|>what is water made of?<|Assistant|><think>
Okay, so I need to figure out what water is made of. Hmm, I remember from school that water is H2O, but I’m not entirely sure what that means. Let me think. H2O is like a chemical formula, right? So H stands for hydrogen and O stands for oxygen. But how does that translate into water being a liquid?
I guess each H2O molecule is made up of two hydrogen atoms and one oxygen atom. So, if I imagine a water molecule, it’s H-O-H. That makes sense because when you see water, it’s made up of tiny droplets, and each droplet is a bunch of these H2O molecules stuck together. But wait, why does it form droplets and not just stay as individual molecules? [AS: I never asked it to explain how H and O together create a liquid, but it likes to get off on such tangents.]
Maybe it’s because of hydrogen bonding. I’ve heard that term before. Hydrogen bonds are stronger than regular chemical bonds, so they hold the molecules together more tightly, making them form droplets. That would explain why water is a liquid at room temperature, while other substances like carbon dioxide are gases.
But I’m a bit confused about the structure of hydrogen bonds. I think it’s when a hydrogen atom from one molecule is attracted to an oxygen atom in another molecule. So, each hydrogen is bonded to an oxygen, and these bonds are stronger than van der Waals forces, which are the forces between molecules in gases and liquids. That’s why water forms droplets instead of just staying [Truncated]
User: what is french word for dog?
[… long wait in here …]
Assistant: <|User|>what is french word for dog?<|Assistant|><think>
Okay, so I’m trying to figure out the French word for “dog.” I’ve heard a bit of French before, but I’m not super confident about all the words. I think the word might be something like “chien,” but I’m not entirely sure. Let me think about how I learned French in school.
Wait, in my French class, we went over the names of animals. The teacher mentioned “chien” for dog, “chat” for cat. Yeah, that sounds right. But I should double-check to make sure I’m not confusing it with another word. Sometimes, languages have multiple words for the same thing, but I don’t think that’s the case here.
Let me think about how “chien” is pronounced. It’s like “sheen.” So if someone says “un chien,” that means “a dog.” That makes sense. I don’t recall any other words for dog in French. Maybe “chiens” for plural, which would be “dogs.”
I should also consider if there’s a different term used in informal situations. Sometimes languages have slang words or regional terms. But I think “chien” is the standard word. I don’t think French uses “dog” as an English loanword, like in some other languages. To be thorough, maybe I can think of other animal names to see if I can find a pattern. For example, “chat” is definitely cat, “ [Truncated]
[Another time, I asked about the French word for cat, and it said:]
I’m just starting to learn French, so I’m not entirely sure, but I remember hearing some words before. Let me think about how cats are called in other languages. In English, it’s “cat,” in Spanish it’s “gato,” in Italian it’s “gatto,” and in German it’s “Katz.” Hmm, those all sound similar, but French is a bit different. Wait, I think the French word might be something like “chat.” I’ve heard people mention “chat” before, especially in contexts where they’re talking about cats. Let me try to visualize. If I think about the pronunciation, “chat” sounds like “cat” but with a different ending. I’m pretty sure it’s “chat,” but I want to make sure I’m not confusing it with another word. Another way to think about it is through pictures or signs. I’ve seen signs in French towns with signs that say “Chat,” which I believe means “Cat Shelter” or something like that. So that reinforces the idea that “chat” is the word for cat in French. I also remember that in French, words often end with silent ‘t’s, so “chat” makes sense because it’s pronounced like “shah.” That’s different from “chat,” which would be pronounced like “chah,” but I think it’s actually “chat” without the ‘h [Truncated]
User:
User: What is logorrhea?
Assistant: <|User|>What is logorrhea?<|Assistant|><think>
</think>
Logorrhea is a phenomenon characterized by an excessive flow of words, often accompanied by difficulty in articulating coherent sentences. It is typically associated with mental health conditions such as schizophrenia, mania, or severe anxiety, where individuals may speak rapidly, incoherently, or even involuntarily. The term originates from the Greek words “logos” (meaning word) and “rheo” (to flow), reflecting the continuous and uncontrolled nature of the speech.
[Okay, it was mean of me to ask DeepSeek that, and foolish to expect it would make a connection to its own compulsive word spinning; at any rate, its babbling about water and French cats here highlights the difference between emitting reasoning-related text on the one hand, and something like reasoning behavior (productively using that CoT text) on the other hand.]
[I asked Google Gemini what it thought of all this [including a long CoT, not shown here, for the question “what is 3+5?”]. Gemini pointed for example to the small DeepSeek model’s “I’m just starting to learn French”, with the wry observation that this might be “telling of its core approach, as that is unlikely to represent any true state“. Gemini also noted:
- Forced Chain of Thought: The fact that it uses that structure even when it isn’t necessary (like “what is water made of?”) confirms that this behavior is deeply ingrained through its training. It always tries to produce a thought process.
- Confusion on Topic Shifts: The attempt to connect “the 39 steps” to chemistry demonstrates the difficulty in handling context switches, and in understanding what to reason about, not how to reason.
- Not Actually Reasoning: The fact that even a “simple” calculation is taking 5-10 minutes, while the more complicated questions may also provide the correct answer with a similar delay, strongly suggests that the model is not actually using its “reasoning” ability when it is generating these outputs. It is only using this to format the final response.
Now let’s see what DeepSeek itself thinks of some of this
[I uploaded some of the prompts/replies above from DeepSeek-R1-Distill-Llama-8B to the DeepSeek-R1 online chatbot. It noted that the smaller model seems to be “Overthinking”, “stuck on trivial details”, “overly literal”, “verbose”, etc.]
Thank you for sharing these examples and your observations. The differences in the responses and chains of thought (CoT) between the DeepSeek-R1-Distill-Llama-8B model and my responses (DeepSeek-V3) can be attributed to several factors. Here’s a breakdown of the likely reasons for the discrepancies:
1. Model Architecture and Training Differences
- Architecture: While both models may share a foundational architecture (e.g., based on LLaMA or similar transformer-based designs), the specific implementation, fine-tuning, and distillation processes can lead to significant differences in performance. For example, DeepSeek-V3 might have been trained with more advanced techniques or on a more diverse dataset, enabling it to generate more coherent and concise responses.
- Distillation: The “Distill” in DeepSeek-R1-Distill-Llama-8B suggests it is a distilled version of a larger model. Distillation often involves compressing a larger model into a smaller one, which can result in a loss of nuance, reasoning ability, or fluency. This might explain why its CoT feels overly literal or verbose compared to mine.
2. Training Data and Fine-Tuning
- Data Quality and Diversity: DeepSeek-V3 might have been trained on a more curated or diverse dataset, including high-quality examples of reasoning tasks, creative writing, and technical explanations. This would allow it to generate more precise and elegant responses.
- Fine-Tuning Objectives: If DeepSeek-V3 was fine-tuned specifically for tasks requiring concise reasoning, creative writing, or technical accuracy, it would naturally outperform a more general-purpose model like DeepSeek-R1-Distill-Llama-8B in these areas.
3. Reasoning and CoT Mechanisms
- CoT Optimization: DeepSeek-V3 might have been optimized to generate more efficient and focused chains of thought, avoiding unnecessary verbosity or repetition. This would explain why my CoT is more concise and directly tied to the task at hand.
- Overthinking in R1-Distill: The DeepSeek-R1-Distill-Llama-8B model seems to “overthink” its responses, spending too much time on trivial details (e.g., explaining basic concepts like hydrogen bonding or French pronunciation) rather than focusing on the core task. This could be a side effect of its distillation process or training objectives.
4. Task-Specific Performance
- Generalization vs. Specialization: DeepSeek-R1-Distill-Llama-8B might be more of a general-purpose model, designed to handle a wide range of tasks but not excelling in any particular area. In contrast, DeepSeek-V3 might have been fine-tuned to excel in creative and technical tasks, such as writing haikus or explaining complex concepts succinctly.
- Creative vs. Literal Thinking: The R1-Distill model seems to struggle with balancing creative and technical thinking, often defaulting to overly literal or mechanical explanations. DeepSeek-V3, on the other hand, appears better at blending creativity with precision, as seen in the haiku examples.
[Since chat.deepseek.com has several times referred to its haikus in contrast to the distilled model’s, I had better show some, even though before I said they were too awful to present here. “Layers weave and learn, / Silent sparks in data streams— / Mind of glass and light.” “Patterns softly bloom, / Synapses of silicon— / Dreams in numbers flow.” Well, I guess that’s not unspeakable, given the silly premise. Those were generated by the big online chatbot. See the DeepSeek chat transcript to see more, including chain of thought. So what did the small distilled model come up with? “Beneath the surface, roots spread / Data streams like rivers flow / Patterns emerge, like leaves grow”. It then acts like its own literary critic: “It has a nice flow, using nature metaphors to describe neural network processes.” Not so awful, really. Maybe I should ask for them again, this time avoiding “patterns” and making them work a little harder to incorporate something like backpropagation. :-)]
5. User Interaction and Feedback
Iterative Improvements: DeepSeek-V3 might have been trained or fine-tuned using user feedback, allowing it to learn from past interactions and improve its responses over time. If DeepSeek-R1-Distill-Llama-8B lacks this iterative refinement, its responses might feel less polished or user-friendly.
Summary of Differences
Why the Difference Matters
The difference in performance highlights the importance of model optimization and task-specific fine-tuning. While DeepSeek-R1-Distill-Llama-8B might be suitable for general tasks, DeepSeek-V3 appears to be better suited for creative and technical challenges, offering more elegant and efficient responses.
Conclusion by DeepSeek-R1, with support from Google Gemini 2.0 Flash Thinking Experimental
[I asked DeepSeek-R1 to summarize a PDF of everything above. I don’t think the following is very good, but keeping it here for now until I can replace with something better, and even then it might serve as an example of a certain flacid style of AI chatbot word-spinning that at first glance sounds okay, but doesn’t say very much.]
AS: Please give me a high-level (but still relatively complete) summary of this document. I want something I can use as the conclusion to the next version of the document.
[Thought for 11 seconds — CoT below; though this looks much too polished and linear to really represent a chain of thought]
Okay, let me try to summarize this document. It’s a detailed analysis of the DeepSeek-V3 Python code using Google Gemini 2.0 Flash. The author, Andrew Schulman, is a technical expert and attorney who examines the code to understand its structure, functionality, and how it enables the model’s reasoning capabilities, particularly Chain of Thought (CoT).
First, the document outlines the core components of the DeepSeek-V3 code, including the model architecture (model.py), generation logic (generate.py), custom Triton kernels (kernel.py), and conversion scripts (convert.py, fp8_cast_bf16.py). The code leverages Mixture of Experts (MoE), FP8 quantization, and distributed processing to optimize performance. The author emphasizes that while the code is concise (~1400 lines), it heavily relies on external libraries like PyTorch and Hugging Face Transformers. [AS: as noted above, while the code does import Transformers, it’s a little misleading to suggest that it heavily relies on Hugging Face Transformers as such, since its heavy reliance is specifically on the AutoTokenizer.]
The discussion then shifts to whether CoT is an emergent property of the model’s architecture or a result of training. Initially, Gemini suggests CoT emerges from training data and model scale. However, after reviewing DeepSeek’s technical reports, Gemini revises this view, noting that CoT in DeepSeek-V3 is engineered through knowledge distillation from DeepSeek-R1 and reinforcement learning (RL). The reports clarify that while the architecture supports reasoning, specific CoT behaviors are explicitly trained into the model.
The document also explores running a smaller local model (DeepSeek-R1-Distill-Llama-8B), highlighting its verbose and sometimes incoherent CoT outputs compared to the online DeepSeek chatbot. This contrast underscores the importance of training and model size in achieving effective reasoning. [AS: DeepSeek doesn’t connect the document with itself; it doesn’t seem to realize or care that Gemini and I were talking about it behind its back — well, why would it?; it’s all the same to these LLMs (their world is flat — see above about yoga, badminton, and C++ compilers.]
Key insights include the role of training methodologies (distillation, RL, SFT) in shaping model behavior, the efficiency of the code’s design, and the distinction between generating CoT text and actual reasoning. The author concludes that while the code provides the infrastructure, the model’s capabilities are a product of targeted training processes.
Finally, comparisons with Anthropic’s Claude and reflections on Gemini’s own capabilities highlight differences in approaches to reasoning and tool use, emphasizing that CoT in DeepSeek is a carefully engineered feature rather than a purely emergent property.
[End of CoT; now onto its actual answer; the “augmented by insights from Google Gemini 2.0 Flash” below is amusing, since about 95% of what it’s attributing to me comes straight out of Google Gemini — a bit more than mere augmentation.]
Conclusion
This examination of DeepSeek-V3’s Python code, augmented by insights from Google Gemini 2.0 Flash and DeepSeek’s technical documentation, reveals a sophisticated interplay between model architecture, optimization techniques, and targeted training methodologies. Key takeaways include: [AS: this “reveals a sophisticated interplay between…” isn’t really saying anything; what would a naive interplay look like?]
Code Efficiency & Infrastructure:
- The ~1,400-line codebase leverages libraries like PyTorch, Triton, and Hugging Face Transformers to implement a high-performance LLM. Core features include Mixture-of-Experts (MoE) layers, FP8 quantization, and distributed processing, enabling efficient inference despite the model’s scale.
- Custom Triton kernels and safetensors optimize memory usage and computation, critical for deploying large models cost-effectively.
Chain of Thought (CoT) as Engineered Behavior:
- While CoT initially appears emergent, DeepSeek’s technical reports clarify it is deliberately cultivated through knowledge distillation from DeepSeek-R1 and reinforcement learning (RL). The code itself lacks explicit CoT logic but supports iterative token generation, allowing trained models to “think aloud.”
- Smaller local models (e.g., DeepSeek-R1-Distill-Llama-8B) produce verbose, often incoherent CoT, underscoring the role of training scale and methodology in refining reasoning capabilities.
Training Pipeline Dominance:
- DeepSeek-V3’s reasoning prowess stems from a multi-stage training process: pretraining on diverse data, RL-based alignment, and distillation. The HAI-LLM framework facilitates distributed training, while techniques like Multi-Token Prediction (MTP) enhance output quality.
- Contrasts with Claude’s hybrid tool-use approach highlight DeepSeek’s focus on in-model reasoning rather than external tool integration. [AS: not really; Claude also has in-model CoT; we just happened to not have discussed it in the doc that DeepSeek reviewed.]
Architecture as Enabler, Training as Driver:
- The code provides a scalable foundation (e.g., Rotary Positional Embeddings, LoRA adaptations), but the model’s behavior—whether coherent CoT or task-specific performance—is shaped by training data and objectives.
Final Insight:
- DeepSeek-V3 exemplifies how modern LLMs balance architectural innovation with rigorous training pipelines. While the codebase enables efficiency and scalability, the model’s “reasoning” abilities are not inherent but carefully engineered, reflecting a synergy of infrastructure, data, and iterative learning. This underscores the importance of holistic design in advancing LLM capabilities, where code, training, and alignment collectively define performance.
[AS: I hate that final paragraph DeepSeek wrote: “balance architectural innovation with rigorous training pipelines” sounds good, but I don’t think it actually says anything; the words “synergy” and “holistic” also sound meaningless here.] [[TODO: go back to DeepSeek, and ask it to try to point to specific examples in the doc it was summarizing (or in any knowledge it has of its own operation, or that of LLMs generally for that matter). And see if Claude and/or ChatGPT do a better job summarizing this discussion of DeepSeek.]]
AS: Hmm, actually, I went over the DeepSeek code again with Google “Gemini 2.0 Flash Thinking Experimental”. I had it look for earlier similar code, using both a deliberately-obfuscated version of the DeepSeek code, and a normal version. Gemini came up with some interesting (though not surprising) anticipations (in vLLM, Mixtral, FasterTransformer, and DeepSpeed Inference) of parts of the code, but its main summary was somewhat surprising: apparently without accessing this web page or DeepSeek’s own “synergy” or “holistic” assessments quoted here, Google Gemini independently concluded, of the same DeepSeek Python source code we’ve discussed here:
- “Combination of Quantization, Custom Kernels, MoE, and Yarn: DeepSeek isn’t just using one advanced technique; it integrates multiple cutting-edge techniques synergistically…”
- “The overall architecture of the code, with its kernel.py, model.py (including MoE and Yarn features), and generate.py, demonstrates the integration of these techniques into a cohesive and high-performing inference engine.”
- “… a holistic and integrated approach, requiring significant effort to build and coordinate.”
- “The DeepSeek inference code is a highly sophisticated and optimized system…”
- “The DeepSeek project represents a significant effort to build a high-performance inference engine by integrating cutting-edge techniques from various successful projects…”
- “In conclusion, the DeepSeek code provides ample evidence to support the claims of ‘highly sophisticated,’ ‘significant effort,’ ‘high-performance,’ and ‘integrated cutting-edge techniques.’ These qualities are not just marketing buzzwords; they are demonstrably baked into the code’s design and implementation at multiple levels, from low-level CUDA kernels to high-level architectural choices.”
I’ll soon write more on this session with Gemini 2.0 Flash Thinking Experimental discussing DeepSeek and its relation to earlier code.
[[TODO: see 2024 DeepSeek math paper and models]]
[[TODO: see large modeling_deepseek.py and configuration_deepseek.py]]