These patterns aren't necessarily simple, but here's what I think works best in every situation. Keep in mind that (quite recently) Internationalized Domain Names (IDNs) are available too. With that, an un-testable amount of characters are allowed in URLs (there still exist lots of characters that aren't allowed in domain names, but the list of allowed characters is so big, and will change so often for different Top-Level Domains, that it's not practical to keep up with them). If you want to support the internationalized domain names, you should use the second URL pattern, otherwise, use the first.
##TL;DR:
Here's a live demo to see the following patterns in action. Scroll down for an explanation, reasoning and analysis of these patterns.
URLs
https?://(?![^/]{253}[^/])((?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])).){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(/.*)?
https?://(?!.{253}.+$)((?!-.*|.*-.)([^ !-,./:-@[-`{-~]{1,63}.)+([^ !-/:-@[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]).){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(/.*)?
Emails
(?!(^[.-].*|[^@]*[.-]@|.*.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+/=?^_`{|}~.-]+@)(?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}
Phone numbers
((+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Western-style names
([A-Z?-?à-??-T][A-Z?-?à-??-Ta-z?-??-??-?]{1,19} ?){1,10}
##URLs, without IDN support
https?://(?![^/]{253}[^/])((?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])).){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(/.*)?
Explanation:
- DNSes
- URLs should always start with http:// or https://, since we don't want links to other protocols.
- Domain names should not start or end with
-
- Domain names can be a maximum of 63 characters each (so a maximum of 63 characters between each dot), and the total length (including dots) cannot exceed 253 (or 255? be safe and bet on 253.) characters [1].
- Non-IDNs can only support the letters of the Latin alphabet, the numbers 0 through 9, and a dash.
- Top-level domains of non-IDNs only contain at least the letters of the Latin alphabet [2].
- I've set an arbitrary limit of 15 letters, since there are currently no domains that exceed 13 characters ("
.international
"), which most likely won't change any time soon.
- IPs
- Special cases such as
0.0.0.0
, 127.0.0.1
, etc. are not checked for
- IPs that have padded zeroes in them are not allowed (for example
01.1.1.1
) [4].
- IP numbers can only go from 0 through 255. 256 is not allowed.
Note that the default http:.*
pattern built into modern browsers will always be enforced, so even if you remove the https?://
at the start in this pattern, it will still be enforced. Use type="text"
to avoid it.
##URLs, with IDN support
https?://(?!.{253}.+$)((?!-.*|.*-.)([^ !-,./:-@[-`{-~]{1,63}.)+([^ !-/:-@[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]).){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(/.*)?
Explanation:
Since there is a huge amount of characters that are allowed in IDNs, it's not practically possible to list every possible combination in a HTML attribute (you'd get a huge pattern, so in that case it's much better to test it by some other method than regex) [5].
- Disallowed characters in domain names are:
!"#$%&'()*+, ./ :;<=>?@ []^_`` {|}~
with the exception of a period as domain seperator.
- These are matched in the ranges
[!-,]
[./]
[:-@]
[[-``]
[{-~]
.
- All other characters are allowed in this input field
- TLDs are allowed to have the same letters in them, up to an arbitrary limit of 15 characters (like with the non-IDN URLs).
- Alternatively, TLDs can be of the format
xn--*
with *
being an encoded version of the actual TLD. This encoding uses 2 Latin letters or Arabic numerals per original character, so the arbitrary limit here is doubled to 30.
##Email addresses
(?!(^[.-].*|[^@]*[.-]@|.*.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+/=?^_`{|}~.-]+@)(?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}
Explanation:
Since email addresses require a whole lot more than this pattern to be 100% foolproof, this will cover the near full 100% of them. A 100% complete pattern does exist, but contains PCRE (PHP)-only regex lookaheads, so it won't work in HTML forms.
- Email addresses can only contain letters of the Latin alphabet, the numbers 0-9, and the characters in
!#$%&'*+/=?^_``{|}~.-
[6].
- Accents are not universally supported [7], but if needed, post a comment, and I could perhaps write a version that meets the RFC 6530 standard.
- The local part (before the
@
can only be 63 characters long, and the total address can only be 254 characters long [8].
- Addresses may not start or end with a
-
or .
, and no two dots may appear consecutively [8].
- The domain may not be an IP address [9].
- Other than that, I only included the non-IDN part of the pattern. IDNs are allowed too though, so those will result in false negatives.
##Phone numbers
((+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Explanation:
- Phone numbers must start with one of the following, where
[CTRY]
stands for the country code, and X stands for the first non-zero digit (such as 6
in mobile numbers),
00[CTRY]X
+[CTRY]X
0X
[CTRY]X
(This is not officially correct syntax, but Chrome Autofill seems to like it for some reason.)
- Spaces are allowed between the digits (see the second pattern for the space-less version), except before the non-zero X as defined above.
- Phone numbers must be exactly 9 digits long, other than the part before the first non-zero X as defined above.
This regex is just for 10-digit phone numbers. Since phone number lengths may vary between countries, it's best to use a less strict version of this pattern, or modify it to work for the desired countries. So, this pattern should generally be used as a kind of template pattern.
##Extra: Western-style names
([A-Z?-?à-??-T][A-Z?-?à-??-Ta-z?-??-??-?]{1,19} ?){1,10}
Yes, I know, I'm very western-centric, but this may be useful too, since it might be difficult to make this too, and in case you're making a site for western people too, this will always work (Asian names have a representation in exactly this format too).
- All names must start with an uppercase letter
- Uppercase letters may occur in the middle of names (such as John McDoe)
- Names must be at least 2 letters long
- I've set an arbitrary maximum of 10 names (these people probably won't mind), each of which can be at most 20 letters long (the length of "Werbenjagermanjensen", who happens to be #1).
- Latin and Greek letters are allowed, including all accented Latin and Greek letters (list) and Icelandic letters (
DT et
):
A-Z
matches all uppercase Latin letters: ABCDEFGHIJKLMNOPQRSTUVWXYZ
?-?
matches all uppercase Greek letters, including the accented ones: ??????????? ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ?ΣΤΥΦΧΨΩ ??
.
à-??-T
matches all uppercase accented Latin letters, and the D and T: àá??????èéê?ìí??D?òó????ùú?üYT
. In between there's also the character ×
(between ?
and ?
), which is left out this way.
a-z
matches all lowercase Latin letters: abcdefghijklmnopqrstuvwxyz
?-?
matches all lowercase Greek letters, including the accented ones: ?????αβγδεζηθικλμνξοπρ?στυφχψω?????
?-??-?
matches all lowercase accented Latin letters, and the ?, e and t: ?àáa?????èéê?ìí??e?òó????ùú?üyt?
. In between there's also the character ÷
(between ?
and ?
), which is left out this way.
##References
- https://en.wikipedia.org/wiki/Domain_Name_System#Domain_name_syntax → https://www.rfc-editor.org/rfc/rfc1034#section-3.1
- https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains / https://www.icann.org/resources/pages/tlds-2012-02-25-en
- https://en.wikipedia.org/wiki/Domain_name#Technical_requirements_and_process / What are the allowed characters in a subdomain?
- Based on the fact neither browsers nor the Windows cmd line allow the padded format.
- <a href="https://stackoverflow.com/q/7111881/1256925#2231