Characters allowed in a URL
Categories:
Understanding Characters Allowed in a URL
Explore the rules and conventions governing characters permissible within Uniform Resource Locators (URLs), including reserved and unreserved characters, and the importance of URL encoding.
URLs are fundamental to how we navigate the web, but their structure is governed by strict rules regarding which characters are allowed and how they should be represented. Not every character on your keyboard can be directly inserted into a URL. This article delves into the specifications that define valid URL characters, the distinction between reserved and unreserved characters, and the necessity of URL encoding.
The Basics: RFC 3986 and URL Components
The standard for Uniform Resource Identifiers (URIs), which URLs are a subset of, is defined by RFC 3986. This specification categorizes characters into two main groups: reserved and unreserved. Understanding these categories is crucial for constructing valid and interoperable URLs.
URLs are composed of several parts, each with specific character requirements:
- Scheme: (e.g.,
http
,https
,ftp
) - Authority: (e.g.,
www.example.com:8080
)- Userinfo (optional)
- Host
- Port (optional)
- Path: (e.g.,
/path/to/resource
) - Query: (e.g.,
?key=value&another=param
) - Fragment: (e.g.,
#section
)
Each component has its own set of rules, but the general principle of reserved and unreserved characters applies throughout.
Components of a URL
Unreserved Characters: Safe to Use
Unreserved characters are those that can be safely used in a URL without any special encoding. They are defined as characters that do not have a reserved purpose within the URI syntax. Using these characters directly is generally recommended for readability and simplicity.
According to RFC 3986, the unreserved characters consist of:
- Uppercase letters:
A-Z
- Lowercase letters:
a-z
- Digits:
0-9
- Hyphen:
-
- Period:
.
- Underscore:
_
- Tilde:
~
Any other character encountered in a URL's data that is not part of the reserved set must be percent-encoded if it is not an unreserved character.
https://www.example.com/my-article_123~data
A URL demonstrating the use of various unreserved characters.
Reserved Characters: Special Meaning and Encoding
Reserved characters are those that sometimes have a special meaning within a URI. For example, the slash /
is used to separate path segments, and the question mark ?
indicates the start of a query component. If a reserved character needs to be used for its data value rather than its delimiter role, it must be percent-encoded.
RFC 3986 defines the generic reserved characters as:
:
/
?
#
[
]
@
!
$
&
'
(
)
*
+
,
;
=
When these characters appear in a context where they are not acting as delimiters but as part of the data, they must be encoded. For instance, if a query parameter needs to contain an &
symbol literally, it would be encoded as %26
.
Percent-Encoding: The Solution for Special Characters
Percent-encoding (also known as URL encoding) is the mechanism used to represent reserved characters, non-ASCII characters, and other characters not allowed in a URL's unencoded form. It involves replacing the character with a percent sign (%
) followed by the two-digit hexadecimal representation of its ASCII value.
For example:
- Space (
%20
- Ampersand (
&
) becomes%26
- Plus sign (
+
) becomes%2B
- Equals sign (
=
) becomes%3D
Many programming languages provide built-in functions for URL encoding and decoding. It's crucial to apply encoding correctly, especially for user-generated content or dynamic parameters.
Tab 1
Python
Tab 2
JavaScript
encodeURI()
encodes fewer characters (leaving reserved characters like /
and ?
unencoded), encodeURIComponent()
encodes almost all characters that are not letters, digits, or _.~*'
for use within a URL component (like a query parameter). Always use encodeURIComponent()
for encoding individual URL components.Internationalized Domain Names (IDNs) and Punycode
For domain names that contain non-ASCII characters (e.g., characters from Cyrillic, Arabic, or Chinese scripts), a special encoding scheme called Punycode is used. Punycode converts these Unicode domain names into an ASCII-compatible encoding that can be used in the DNS system. Browsers typically handle this conversion automatically for the user, but it's an important underlying mechanism for global web accessibility.
Understanding the rules for characters in URLs is essential for web developers, system administrators, and anyone dealing with web resources. Adhering to RFC 3986 and properly using percent-encoding ensures that URLs are correctly interpreted across different systems and prevent unexpected behavior. Always err on the side of encoding when in doubt, especially for dynamic content.