I need help building a regular expression that can properly match an URL inside free text.
- scheme
- One of the following: ftp, http, https (is ftps a protocol?)
- optional user (and optional pass)
- host (with support for IDNs)
- support for www and sub-domain(s) (with support for IDNs)
- basic filtering of TLDs (
[a-zA-Z]{2,6}
is enough I think)
- optional port number
- path (optional, with support for Unicode chars)
- query (optional, with support for Unicode chars)
- fragment (optional, with support for Unicode chars)
Here is what I could find out about sub-domains:
A "subdomain" expresses relative
dependence, not absolute dependence:
for example, wikipedia.org comprises a
subdomain of the org domain, and
en.wikipedia.org comprises a subdomain
of the domain wikipedia.org. In
theory, this subdivision can go down
to 127 levels deep, and each DNS label
can contain up to 63 characters, as
long as the whole domain name does not
exceed a total length of 255
characters.
Regarding the domain name itself I couldn't find any reliable source but I think the regular expression for non-IDNs (I'm not sure how to write a IDN compatible version) is something like:
[0-9a-zA-Z][0-9a-zA-Z-]{2,62}
Can someone help me out with this regular expression or point me to a good direction?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…