Home
Enterprise & Ecosystem
Core AI Fundamentals Explained: From Tokens to RAG, A Complete Breakdown of Large Language Model Logic

Core AI Fundamentals Explained: From Tokens to RAG, A Complete Breakdown of Large Language Model Logic

Core Technology Insights Open Source & AI Applications Enterprise & Ecosystem

2025-12-29 14 mins read

This article provides an in-depth analysis of the core underlying logic of large AI models, covering the entire workflow from Tokenization and Embedding semantic vectors to Retrieval-Augmented Generation (RAG). It demystifies how large models "understand language" with plain language, helping AI beginners and enterprise developers master key technologies such as Tokens, vectors, retrieval and generation, and lay a solid foundation for AI application development.

大模型技术图谱

In an era where large models and RAG technology are fully deployed, most developers only stay at the level of calling APIs and modifying Prompts: they know large models can converse and RAG can reduce hallucinations, but they do not understand how language is perceived by models, why vectors can represent semantics, and how retrieval and generation work in synergy.

This article avoids fancy frameworks and complex formulas, and only starts from the most fundamental logic: it thoroughly explains the entire process of large models "hearing—thinking—responding" from Tokenization and Embedding vectors to semantic representation and RAG. It is suitable for AI beginners, backend developers switching to AI, and those who want to truly understand the principles rather than just using the tools.

1. Foundation: AI "Understands" Language Starting with Tokens

Human text is just a string of meaningless symbols to models. The first step for large models to process language is to convert natural language into the smallest model-recognizable units—Tokens.

1.1 Tokens: The "Atoms of Text" for Large Models

A Token is neither a character nor a word, but a minimal processing unit defined by the model:

For English: Usually split into roots, prefixes and whole words
For Chinese: Usually split into single characters and common words

Example: The sentence What is the main function of RAG? is split into: RAG / 的 / 主要 / 作用 / 是 / 什么.

Each segment is mapped to a unique numeric ID, like an "ID card" for language. Models do not process text directly, only these numeric sequences.

1.2 Why Not Use Direct Characters or Words? — The Significance of Vocabulary and BPE

Storing all words in a vocabulary would lead to:

A bloated vocabulary (hundreds of thousands or even millions of entries)
Massive video memory usage and low computing efficiency

This is why modern models uniformly adopt the Byte Pair Encoding (BPE) subword tokenization algorithm:

Retain high-frequency words in full
Split low-frequency words into smaller units
Further split rare and new words until matching entries in the vocabulary

Core Objective: Cover almost all language scenarios with a limited vocabulary (usually 30,000 to 300,000 entries), while ensuring no loss of semantics and affordable computing costs.

In a nutshell: Tokens = the gateway to language digitization; BPE = the optimal solution for controlling vocabulary size.

2. Core Breakthrough: How Tokens Become "Semantic" Vectors

Tokens only convert text into numeric IDs with no inherent semantics. The key step that enables AI to understand "meaning" is Embedding (word/sentence vectors).

2.1 Embeddings: The "Digital Coordinate System" of Semantics

The essence of Embeddings: Mapping a piece of text to a point (vector) in a high-dimensional space.

Text with similar semantics is closer in this space, while irrelevant text is farther apart. This forms the underlying mathematical foundation for all semantic search, recommendation systems and RAG technology.

2.2 Initial Vectors: Starting from Random Numbers, Not Predefined Meanings

A common misconception among beginners: Thinking Embeddings are pre-written "dictionaries of meaning"—this is entirely incorrect.

When a model is initialized, it creates a huge matrix of vocabulary size × vector dimension, filled with random decimals ranging from -1 to +1. Each Token ID corresponds to a row in this matrix, which is the initial vector—at this stage, the vector has no semantic meaning and is just a string of random numbers.

2.3 Vector Dimensions: Why 768 and 1024 Are Commonly Used?

Vector dimension can be understood as the number of features used to describe semantics. A 768-dimensional vector, for example, describes a word or sentence with 768 "hidden features" (e.g., part of speech, sentiment, theme), all of which are automatically learned by the model without manual annotation.

The reason why dimensions are mostly powers of two is not a mathematical mystery, but a result of hardware engineering adaptation:

GPUs/NPUs are based on parallel computing at the underlying level
Memory reading requires address alignment
Power-of-two dimensions (128, 256, 1024) maximize bandwidth utilization and reduce memory fragmentation

Although 768 is not a pure power of two, 768 = 3 × 256, and 256 is the standard block size of mainstream chips—making it highly efficient for computation and a standard choice for models such as BERT and LLaMA.

2.4 Where Does Semantics Come From? — Learned Through Training

Initial vectors are random; semantics are "forged" through prediction tasks.

Typical process (taking autoregressive models as an example):

Input text → Retrieve Token vectors from the matrix
The model performs the task: Predict the next word
Compare the predicted result with the ground truth and calculate the Loss
Backpropagate and adjust all vectors by a tiny margin
Repeat for hundreds of millions of iterations

A stable pattern eventually forms:

Words in similar contexts (e.g., apple, banana, orange) → their vectors move closer
Irrelevant words (e.g., apple, keyboard, universe) → their vectors move farther apart

The core truth in one sentence: Embeddings are not designed, but forged. Vector distance = semantic similarity.

3. Key Application: How RAG Eliminates Hallucinations in Large Models

Large models have two inherent flaws:

Knowledge cutoff: No updates after training
Hallucinations: Confidently generating false content for unseen information

Retrieval-Augmented Generation (RAG) is an engineering architecture (not a new model) created to solve these two problems.

3.1 Core Logic of RAG: Open-Book Exam Before Answering

Standard RAG consists of two major phases:

(1) Offline Knowledge Base Construction

Document Chunking: Split long documents into small segments to ensure semantic integrity, compliance with Embedding model length limits, and compatibility with the large model's context window.
Embedding: Convert each segment into a vector.
Storage in Vector Database: Establish an index (e.g., HNSW) for fast Approximate Nearest Neighbor (ANN) search.

(2) Online Retrieval and Generation

Convert the user's question into a vector
Vector Retrieval: Find the most similar segments in the database
Prompt Chunking: Combine the question and reference segments and feed them to the large model
Answer Generation: The model generates responses based only on the provided materials

The essence of RAG: Transforming large models from a "closed-book exam" to an open-book exam.

3.2 Why Document Chunking Is Mandatory?

Many beginners ask: Why not embed entire documents directly? This is not feasible for purely engineering reasons:

Embedding models have a maximum input length; overlong text will be truncated or cause errors
Excessively long text leads to "semantic averaging" and a sharp drop in retrieval accuracy
Limited context window prevents large models from processing entire books
Imprecise retrieval recall fails to locate the exact sentences containing answers

Document chunking is a critical engineering step for effective RAG.

3.3 ANN: How to Search Millions of Vectors in Seconds?

Brute-force vector retrieval (calculating distances for the entire database) freezes the system with large datasets. The industry uniformly uses Approximate Nearest Neighbor (ANN) algorithms.

Core Idea: Sacrifice a tiny amount of accuracy for an order-of-magnitude speed boost.

Through index structures such as trees and graphs (e.g., HNSW):

No full database comparison required
Only fine-screening in candidate sets
Accuracy usually maintained at over 95%
Speed increased by tens or even hundreds of times

This is the underlying support for RAG's real-time online response.

4. Core Differences: Large Models vs. RAG

This is the most commonly confused point in interviews and architecture design.

Comparison Dimension	Large Models	RAG
Essence	Generative models responsible for language understanding and generation	A retrieval+generation engineering architecture, not a model
Role	The "brain"	External reference book + retriever
Knowledge Source	Fixed training data, no real-time updates	External knowledge base, support real-time addition/deletion/modification
Major Flaw	Hallucinations, outdated knowledge, inability to access private data	Solves hallucinations and ensures responses are based on real information
Core Capability	Probabilistic word prediction for fluent and natural language generation	Semantic retrieval for precise matching of questions and evidence

Industry Standard Answer: Large models are responsible for "speaking like a human"; RAG is responsible for "speaking the truth and the latest information". Practical business implementation = Large model + RAG + Vector database + Chunking strategy + Prompt engineering.

5. Frequently Asked Questions (Essential for Interviews & Engineering)

Here are clear answers to the 8 most confusing and commonly misunderstood questions.

Q1: Must BPE be used for Tokenization?

No.

BPE: Mainstream for GPT, LLaMA series
WordPiece: Exclusive to BERT series
Unigram: Used for T5, etc.

All share the same objective: Achieve the optimal balance between vocabulary size, tokenization granularity and semantic integrity.

Q2: Initial vectors are random—why are results reproducible?

Because:

Randomness only exists at the initialization stage
The training objective (predicting the next word) is fixed
Consistent training data, model structure and learning rate
Vectors will converge to a similar semantic space

Random starting points do not affect the final semantic patterns.

Q3: ANN is "approximate"—will it cause inaccurate retrieval?

There is a tiny accuracy loss, but it is completely acceptable.

Brute-force retrieval: 100% accurate but too slow for production
ANN: Real-time speed with almost no accuracy drop

Engineering is always about trade-offs.

Q4: Are higher vector dimensions always better?

No, there is a diminishing marginal return.

768/1024 dimensions meet the needs of most business scenarios
Increasing to 2048/4096 dimensions brings minimal accuracy improvement but doubles video memory/computing costs

Practical selection priority: 768 (balance) and 1024 (high accuracy).

Q5: Are Embedding models the same as large models?

Completely different.

Embedding models: Specialized small models with a single function—outputting vectors
Large models: General generative models for understanding, reasoning, generation and conversation

Large models have an internal Embedding layer, but it is not suitable for direct retrieval. RAG must use specialized semantic vector models.

Q6: Must RAG use a vector database? Can regular databases work?

No.

Regular databases (MySQL/ES): Only support keyword matching, no semantic understanding
Vector retrieval: Based on semantic similarity (e.g., searching for "apple" can retrieve "fruit, banana, Red Fuji")—a capability unattainable by traditional databases.

Q7: Why are vector updates extremely small during training?

To prevent model collapse. A large single update would destroy the already learned semantic relationships. Small incremental updates (low learning rate) plus hundreds of millions of iterations enable the model to learn stable and generalizable language patterns.

Q8: Can large models answer accurately without RAG?

It depends on the scenario:

Common sense questions (e.g., The Earth is round): Yes
Latest data, private documents and niche professional knowledge: Almost certain to hallucinate

RAG is not an added bonus, but a necessary reliability guarantee for enterprise-level AI applications.

6. Ultimate Summary: The Full AI Underlying Workflow (One Sentence)

You may forget the details, but remember this core thread:

Text → Tokenization → Random Embeddings → Iterative training adjustments → Formation of semantic vector space → Large model generates language via probabilistic prediction → RAG attaches an external knowledge base through vector retrieval → Final realization: Semantic understanding, no hallucinations, updatable, deployable.

AI is not mysterious at all. All complex designs solve four core problems:

How to digitize language?
How to make numbers represent semantics?
How to make generation more authentic?
How to make the system fast and stable?

Tokens, Embeddings, vectors and RAG are the standard answers to these four questions.

7. Epilogue

This article intentionally avoids formulas and obscure derivations, retaining only the underlying logic that programmers can immediately understand and apply to architecture design.

Mastering this framework allows you to:

Grasp the core of any large model architecture paper
Design, optimize and troubleshoot RAG systems
Distinguish between problems solvable by large models and those requiring retrieval
Avoid being misled by "AI mysticism" and "AGI myths"

The essence of AI is always: Language digitization + Semantic representation + Probabilistic generation + Engineering architecture.

May this article be your first step from an "AI user" to an "AI understander".

#AI Large Model #AI Native Integration #AI Application Development #RAG #AGENT #TOKEN #ANN #Embedding

Core AI Fundamentals Explained: From Tokens to RAG, A Complete Breakdown of Large Language Model Logic

1. Foundation: AI "Understands" Language Starting with Tokens

1.1 Tokens: The "Atoms of Text" for Large Models

1.2 Why Not Use Direct Characters or Words? — The Significance of Vocabulary and BPE

2. Core Breakthrough: How Tokens Become "Semantic" Vectors

2.1 Embeddings: The "Digital Coordinate System" of Semantics

2.2 Initial Vectors: Starting from Random Numbers, Not Predefined Meanings

2.3 Vector Dimensions: Why 768 and 1024 Are Commonly Used?

2.4 Where Does Semantics Come From? — Learned Through Training

3. Key Application: How RAG Eliminates Hallucinations in Large Models

3.1 Core Logic of RAG: Open-Book Exam Before Answering

(1) Offline Knowledge Base Construction

(2) Online Retrieval and Generation

3.2 Why Document Chunking Is Mandatory?

3.3 ANN: How to Search Millions of Vectors in Seconds?

4. Core Differences: Large Models vs. RAG

5. Frequently Asked Questions (Essential for Interviews & Engineering)

Q1: Must BPE be used for Tokenization?

Q2: Initial vectors are random—why are results reproducible?

Q3: ANN is "approximate"—will it cause inaccurate retrieval?

Q4: Are higher vector dimensions always better?

Q5: Are Embedding models the same as large models?

Q6: Must RAG use a vector database? Can regular databases work?

Q7: Why are vector updates extremely small during training?

Q8: Can large models answer accurately without RAG?

6. Ultimate Summary: The Full AI Underlying Workflow (One Sentence)

7. Epilogue

Subscribe our newsletter