HTML Guides for unicode
Learn how to identify and fix common HTML validation errors flagged by the W3C Validator — so your pages are standards-compliant and render correctly across every browser. Also check our Accessibility Guides.
What Are Control Characters?
Control characters occupy code points U+0000 through U+001F and U+007F through U+009F in Unicode. They were originally designed for controlling hardware devices (e.g., U+0002 is “Start of Text,” U+0007 is “Bell,” U+001B is “Escape”). These characters have no visual representation and carry no semantic meaning in a web document.
The HTML specification explicitly forbids character references that resolve to most control characters. Even though the syntax  is a structurally valid character reference, the character it points to is not a permissible content character. The W3C validator raises this error to flag references like �, , , , and others that fall within the control character ranges.
Why This Is a Problem
- Standards compliance: The WHATWG HTML Living Standard defines a specific set of “noncharacter” and “control character” code points that must not be referenced. Using them produces a parse error.
- Unpredictable rendering: Browsers handle illegal control characters inconsistently. Some may silently discard them, others may render a replacement character (�), and others may exhibit unexpected behavior.
- Accessibility: Screen readers and other assistive technologies may choke on or misinterpret control characters, degrading the experience for users who rely on these tools.
- Data integrity: Control characters in your markup often indicate a copy-paste error, a corrupted data source, or a templating bug that inserts raw binary data into HTML output.
How to Fix It
- Identify the offending reference — look for character references like , , �, , or similar that point to control character code points.
- Determine intent — figure out what character or content was actually intended. Often, a control character reference is the result of a bug in a data pipeline or template engine.
- Remove or replace — either delete the reference entirely or replace it with the correct printable character or HTML entity.
Examples
Incorrect: Control character reference
This markup contains , which expands to the control character U+0002 (Start of Text) and triggers the validation error:
<p>Some text  more text</p>
Incorrect: Hexadecimal form of a control character
The same problem occurs with the hexadecimal syntax:
<p>Data: </p>
Correct: Remove the control character reference
If the control character was unintentional, simply remove it:
<p>Some text more text</p>
Correct: Use a valid character reference instead
If you intended to display a special character, use the correct printable code point or named entity. For example, to display a bullet (•), copyright sign (©), or ampersand (&):
<p>Item • Details</p>
<p>Copyright © 2024</p>
<p>Tom & Jerry</p>
Correct: Full document without control characters
<!DOCTYPE html>
<html lang="en">
<head>
<title>Example Page</title>
</head>
<body>
<p>This paragraph uses only valid character references: & < > ©</p>
</body>
</html>
Common Control Character Code Points to Avoid
| Reference | Code Point | Name |
|---|---|---|
| � | U+0000 | Null |
|  | U+0001 | Start of Heading |
|  | U+0002 | Start of Text |
|  | U+0007 | Bell |
|  | U+0008 | Backspace |
|  | U+000B | Vertical Tab |
|  | U+000C | Form Feed |
|  | U+007F | Delete |
If your content is generated dynamically (from a database, API, or user input), sanitize the data before inserting it into HTML to strip out control characters. Most server-side languages and templating engines provide utilities for this purpose.
Private Use Area (PUA) characters are reserved ranges in Unicode whose interpretation is not specified by any encoding standard. Their meaning is determined entirely by private agreement between cooperating parties—such as a font vendor and its users. This means that a PUA character that renders as a custom icon in one font may appear as a blank square, a question mark, or a completely different glyph when that specific font is unavailable.
This warning commonly appears when using icon fonts like older versions of Font Awesome, Material Icons, or custom symbol fonts. These fonts map their icons to PUA code points. While this approach works visually when the font loads correctly, it creates several problems:
- Accessibility: Screen readers cannot interpret PUA characters meaningfully. A visually impaired user may hear nothing, hear “private use area character,” or hear an unrelated description depending on their assistive technology.
- Portability: If the associated font fails to load (due to network issues, content security policies, or user preferences), the characters become meaningless boxes or blank spaces.
- Interoperability: Copy-pasting text containing PUA characters into another application, email client, or document will likely produce garbled or missing content since the receiving system won’t know how to interpret those code points.
- Standards compliance: The W3C and Unicode Consortium both recommend against using PUA characters in publicly exchanged documents for exactly these reasons.
Sometimes PUA characters sneak into your HTML unintentionally—through copy-pasting from word processors, PDFs, or design tools that use custom encodings. Other times, they are inserted deliberately via CSS content properties or HTML entities by icon font libraries.
To fix this, identify where the PUA characters appear and replace them with standard alternatives. Use inline SVG for icons, standard Unicode symbols where appropriate (e.g., ✓ U+2713 instead of a PUA checkmark), or CSS background images. If you must use an icon font, hide the PUA character from assistive technology using aria-hidden="true" and provide an accessible label separately.
Examples
Problematic: PUA character used directly in HTML
<p>Status: </p>
The character ` (U+E001) is a PUA code point. Without the specific icon font loaded, this renders as a missing glyph. ### Fixed: Using inline SVG with accessible label ```html <p> Status: <svg aria-hidden="true" width="16" height="16" viewBox="0 0 16 16"> <path d="M6 10.8L2.5 7.3 1.1 8.7 6 13.6 14.9 4.7 13.5 3.3z"/> </svg> <span>Complete</span> </p> ``` ### Problematic: Icon font via CSS content property ```html <style> .icon-check::before { font-family: "MyIcons"; content: "\e001"; /* PUA character */ } </style> <span class="icon-check"></span> ``` ### Fixed: Icon font with accessibility safeguards If you must continue using an icon font, hide the PUA character from assistive technology and provide an accessible alternative: ```html <style> .icon-check::before { font-family: "MyIcons"; content: "\e001"; } </style> <span class="icon-check" aria-hidden="true"></span> <span class="sr-only">Checkmark</span> ``` Note that this approach still triggers the validator warning if the PUA character is detectable in the markup. The most robust fix is to avoid PUA characters entirely. ### Fixed: Using a standard Unicode character ```html <p>Status: ✓ Complete</p> ``` The character✓` (U+2713, CHECK MARK) is a standard Unicode character that is universally understood and renders consistently across platforms. ### Problematic: PUA character from copy-paste html <p>Click here to download</p> Invisible or unexpected PUA characters sometimes hide in text pasted from external sources. Inspect your source code carefully—many code editors can highlight non-ASCII characters or reveal their code points. ### Fixed: Cleaned-up text html <p>Click here to download</p> If you’ve audited your document and determined that the PUA characters are intentional and rendering correctly in your target environments, you may choose to accept this warning. However, for publicly accessible web pages, replacing PUA characters with standard alternatives is always the safer and more accessible choice.
Ready to validate your sites?
Start your free trial today.