Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
778 views
in Technique[技术] by (71.8m points)

regex - Extracting URLs from a text document using Java + Regular Expressions

I'm trying to create a regular expression to extract URLs from text documents using Java, but thus far I've been unsuccessful. The two cases I'm looking to capture are listed below:

URLs that start with http:// URLs that start with www. (Missing the protocol from the front)

along with the query string parameters.

Thanks! I wish I really knew Regular expressions better.

Cheers,

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you want to make sure you are really matching a url adress and not only some word starting with 'www.' you can use the expression mentioned by DVK before. I modified it slightly and wrote a small code snippet to be a starting point for you:

import java.util.*;
import java.util.regex.*;

class FindUrls
{
    public static List<String> extractUrls(String input) {
        List<String> result = new ArrayList<String>();

        Pattern pattern = Pattern.compile(
            "\b(((ht|f)tp(s?)\:\/\/|~\/|\/)|www.)" + 
            "(\w+:\w+@)?(([-\w]+\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\d]{1,5})?" + 
            "(((\/([-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?" + 
            "((\?([-\w~!$+|.,*:]|%[a-f\d{2}])+=?" + 
            "([-\w~!$+|.,*:=]|%[a-f\d]{2})*)" + 
            "(&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?" + 
            "([-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*" + 
            "(#([-\w~!$+|.,*:=]|%[a-f\d]{2})*)?\b");

        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            result.add(matcher.group());
        }

        return result;
    }
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...