Exciting news! TCMS official website is live! Offering full-stack software services including enterprise-level custom R&D, App and mini-program development, multi-system integration, AI, blockchain, and embedded development, empowering digital-intelligent transformation across industries. Visit dev.tekin.cn to discuss cooperation!
The core of PHP character processing lies in "matching requirements while balancing efficiency and readability." Regular expressions are the first choice for complex rules, covering most scenarios with their flexible pattern-matching capabilities; strtr/str_replace deliver optimal efficiency for simple fixed rules; and traversal filtering is suitable for customized requirements integrated with business logic...

In web development, the core goal of character processing is to "retain valid information and eliminate invalid content." Common scenarios include:
Filtering malicious characters (such as XSS attack payloads) from user input;
Extracting key information from text (such as pure Chinese characters, letter-number combinations);
Unifying content formats (such as removing redundant symbols, standardizing punctuation).
The core challenges in achieving these requirements are:
Multi-character type compatibility: Need to handle Chinese characters (multi-byte), letters, numbers, symbols (full-width/half-width), etc., simultaneously;
Balance between efficiency and accuracy: Especially when processing large texts, it is necessary to achieve precise filtering while avoiding performance loss;
Unicode standard adaptation: The Unicode encoding range of Chinese characters and symbols is complex, making "misjudgment" likely (e.g., \p{Han} matching non-Chinese symbols).
Regular expressions (Regex) implement character filtering through pattern matching, making them efficient tools for handling complex rules. In PHP, combined with the PCRE engine and Unicode support, they can easily meet multi-scenario requirements.
(1) Chinese Character Matching: From \p{Han} to Pure Chinese Character Ranges
\p{Han} matches based on the Unicode "Han script" but includes Chinese punctuation (such as 【】、。). For pure Chinese characters, directly match the Unicode encoding range:
// Pure Chinese character regex (covers basic set, extended set, compatibility characters)
$pureChinesePattern = '/[\x{4E00}-\x{9FFF}\x{3400}-\x{4DBF}\x{20000}-\x{2A6DF}\x{2A700}-\x{2B73F}\x{2B740}-\x{2B81F}\x{2B820}-\x{2CEAF}\x{F900}-\x{FAFF}\x{2F800}-\x{2FA1F}]/u';The /u modifier must be added: Enables Unicode mode; otherwise, multi-byte encoding cannot be recognized;
The range covers all Unicode standard Chinese characters, excluding special symbols such as 々、〇 (add them separately if retention is needed).
(2) Compound Rules: Retain Specified Character Sets
To retain "pure Chinese characters, letters, numbers, and specified symbols (- _ . , space)" simultaneously, regex can be implemented through character set combination:
// Retain target characters and filter all other content
$pattern = '/[^\x{4E00}-\x{9FFF}\x{3400}-\x{4DBF}...a-zA-Z0-9-_. ,]/u';
$filtered = preg_replace($pattern, '', $text);[^...] means "exclude characters outside the set," with concise logic;
The symbol - must be placed at the beginning of the character set (e.g., [-_. ,]) to avoid being parsed as a "range operator."
Reuse regex patterns: PHP caches compiled regex. When calling the same pattern multiple times, avoid repeated compilation (30%+ performance improvement);
Simplify character sets: For example, use \d instead of 0-9, and \w instead of a-zA-Z0-9_ (note that \w includes underscores, use as needed);
Avoid greedy matching: In complex rules, use non-greedy quantifiers (such as *?) to reduce backtracking and improve efficiency.
When rules are simple or higher readability is required, non-regex methods are more advantageous. Common solutions include string replacement and traversal filtering.
str_replace and strtr(1) str_replace: Batch delete known characters
Suitable for a clear list of characters to be filtered (such as fixed symbols):
$invalidChars = ['@', '#', '¥', '!', '【', '】'];
$filtered = str_replace($invalidChars, '', $text);(2) strtr: Multi-to-multi mapping replacement
Suitable for batch replacement (such as "symbol → empty" or "symbol → standardized symbol"), with slightly higher efficiency than str_replace:
// Keys are characters to be filtered, values are replacement results (empty means deletion)
$replaceMap = ['@' => '', '¥' => '', '【' => '[', '】' => ']'];
$filtered = strtr($text, $replaceMap);Advantages: Replacement is order-independent (avoids nested replacement issues with str_replace), and the underlying C implementation ensures high efficiency.
Judge and retain/eliminate characters one by one, suitable for rules involving business logic (such as excluding specific Chinese characters, dynamic validation):
function filterByLoop($text) {
$filtered = '';
$length = mb_strlen($text, 'UTF-8');
for ($i = 0; $i < $length; $i++) {
$char = mb_substr($text, $i, 1, 'UTF-8');
$code = mb_ord($char, 'UTF-8'); // Get Unicode encoding
// Determine if it is an allowed character (pure Chinese, letter, number, specified symbol)
$isAllowed = isPureChinese($char) || // Reuse pure Chinese character judgment function
ctype_alnum($char) || // Letter/number
in_array($char, ['-', '_', '.', ',', ' ']);
if ($isAllowed) $filtered .= $char;
}
return $filtered;
}Register custom filters through filter_var, suitable for framework-level reuse:
filter_var_register('filter_allowed_chars', function($value) {
return preg_replace('/[^\x{4E00}-\x{9FFF}a-zA-Z0-9-_. ,]/u', '', $value);
});
// Call: Filter user input
$username = filter_var($_POST['username'], FILTER_CALLBACK, ['options' => 'filter_allowed_chars']);The theoretical time complexity of all methods is O(n) (where n is the string length), but actual efficiency is significantly affected by underlying implementations.
| Method | Time Consumed (ms) | Core Advantages | Bottlenecks |
|---|---|---|---|
strtr/str_replace | 12-15 | Underlying C implementation, no PHP loop overhead | Need to manually list filtered characters, fixed rules |
| Regular Expressions | 15-20 | Supports complex rules, concise code | First call requires regex compilation (optimized after caching) |
| Traversal Filtering | 70-90 | Intuitive logic, supports complex business rules | PHP loops + multiple function calls (e.g., mb_substr) |
| Filter Functions | 80-100 | Reusable, compliant with filter specifications | Additional overhead of callback functions |
Simple fixed rules (e.g., filtering known symbols): Choose strtr/str_replace (highest efficiency);
Complex multi-character types (e.g., Chinese characters + letters + symbols): Choose regular expressions (balance efficiency and conciseness);
With business logic (e.g., excluding sensitive words): Choose traversal filtering (priority to flexibility);
Framework-level reuse: Choose filter functions (unified call entry).
Misusing strlen/substr: For multi-byte characters such as Chinese characters, use mb_strlen/mb_substr (specify UTF-8 encoding);
Ignoring the /u modifier: When processing Unicode with regex, the lack of /u will cause matching chaos (e.g., Chinese characters split into bytes).
Differences between full-width and half-width symbols: The full-width comma , (U+FF0C) and half-width comma , (U+2C) have different encodings; clarify the range when filtering;
Classification of special symbols: Although 々 (U+3005) and 〇 (U+3007) belong to the Unicode "Han script," they are not pure Chinese characters and need to be processed as needed.
Chunked processing: For texts with more than 1 million characters, filter in chunks of 4096 bytes to reduce memory usage;
Avoid global replacement: Use preg_match_all to extract valid content, which is more efficient than preg_replace for deleting invalid content (especially when invalid characters account for a high proportion).
The core of PHP character processing lies in "matching requirements while balancing efficiency and readability." Regular expressions are the first choice for complex rules, covering most scenarios with their flexible pattern-matching capabilities; strtr/str_replace