Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
4.1k views
in Technique[技术] by (71.8m points)

regex - Python match all URL's in a file and list each on new line in file

I'm trying to get a script that opens a file, matches the file for all URL's and outputs a new file with just the matches. What currently happens with the below is just get the first match. The file I'm parsing is basically 1 line with multiple urls "This is a a random string of urls http://www.yandex.ru:8080, http://www.hao123.com:8080, another bit here , http://www.wordpress.com:8080,"

import re

with open("C:\Users\username\Desktop\test.txt") as f:
    Lines = f.readlines()
file_to_write = open("C:\Users\username\Desktop\output.txt", "w")
pattern = 'https?://(?:w{1,3}.)?[^s.]+(?:.[a-z]+)*(?::d+)?(?![^<]*(?:</w+>|/?>))'
matches = []
for line in Lines:
   m = re.search(pattern, line)
   if m:
     matches.append(m.group(0))
   print(matches)
   file_to_write.write("
".join(matches))

Now, if I replace the regex with something more simple like "'(https?://.):(d)'" I get all the matches but they are not separated on the lines, they are all joined together on one line.

Not sure how to quite modify the script OR the Regex to capture ALL urls' base:port and add to a new line.

Current output with Regex ('(https?://.):(d)'):

http://www.yandex.ru:8080, http://www.hao123.com:8080, antoher bit here , http://www.wordpress.com:8080,http://www.gmw.cn:8080, http://www.tumblr.com:8080/test/etete/eete, http://www.paypal.com:8080

Desired Output:

http://www.yandex.ru:8080
http://www.hao123.com:8080
http://www.wordpress.com:8080
http://www.gmw.cn:8080
http://www.tumblr.com:8080
http://www.paypal.com:8080

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can try with re.findall (and the pattern you have):

>>> import re
>>>
>>> s = 'This is a a random string of urls http://www.yandex.ru:8080, http://www.hao123.com:8080, another bit here, http://www.wordpress.com:8080,'
>>> pattern = 'https?://(?:w{1,3}.)?[^s.]+(?:.[a-z]+)*(?::d+)?(?![^<]*(?:</w+>|/?>))'
>>> urls = re.findall(pattern, s)
>>> urls
['http://www.yandex.ru:8080', 'http://www.hao123.com:8080', 'http://www.wordpress.com:8080']

You can then use the list named urls as you see fit. For example, to write the URLs in a file, you can use (as you already have) file_to_write.write(' '.join(urls)). For illustration:

>>> print('
'.join(urls))
http://www.yandex.ru:8080
http://www.hao123.com:8080
http://www.wordpress.com:8080

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...