Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
459 views
in Technique[技术] by (71.8m points)

python - Extracting variables from Javascript inside HTML

I need all the lines which contains the text '.mp4'. The Html file has no tag!

My code:

import urllib.request
import demjson
url = ('https://myurl')
content = urllib.request.urlopen(url).read()

<script type="text/javascript">
							/* <![CDATA[ */
															function getEmbed(width, height) {
									if (width && height) {
										return '<iframe width="' + width + '" height="' + height + '" src="https://www.ptrex.com/embed/33247" frameborder="0" allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen></iframe>';
									}
									return '<iframe width="768" height="432" src="https://www.ptrex.com/embed/33247" frameborder="0" allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen></iframe>';
								}
							
							var flashvars = {
																	video_id: '33247', 																	license_code: '$535555517668914', 																	rnd: '1537972655', 																	video_url: 'https://www.ptrex.com/get_file/4/996a9088fdf801992d24457cd51469f3f7aaaee6a0/33000/33247/33247.mp4/', 																	postfix: '.mp4', 																	video_url_text: '480p', 																	video_alt_url: 'https://www.ptrex.com/get_file/4/774833c428771edee2cf401ef2264e746a06f9f370/33000/33247/33247_720p.mp4/', 																	video_alt_url_text: '720p HD', 																	video_alt_url_hd: '1', 																	timeline_screens_url: '//di-iu49il1z.leasewebultracdn.com/contents/videos_screenshots/33000/33247/timelines/timeline_mp4/200x116/{time}.jpg', 																	timeline_screens_interval: '10', 																	preview_url: '//di-iu49il1z.leasewebultracdn.com/contents/videos_screenshots/33000/33247/preview.mp4.jpg', 																	skin: 'youtube.css', 																	bt: '1', 																	volume: '1', 																	hide_controlbar: '1', 																	hide_style: 'fade', 																	related_src: 'https://www.ptrex.com/related_videos_html/33247/', 																	adv_pre_vast: 'https://pt.ptawe.com/vast/v3?psid=ed_pntrexvb1&utm_source=bf1&utm_medium=network&ms_notrack=1', 																	lrcv: '1556867449254522707330811', 																	adv_pre_replay_after: '2', 																	embed: '1'															};
														var player_obj = kt_player('kt_player', 'https://www.ptrex.com/player/kt_player.swf?v=4.0.2', '100%', '100%', flashvars);
							/* ]]> */
						</script>
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You could use BeautifulSoup to extract the <script> tag, but you would still need an alternative approach to extract the information inside.

Some Python can be used to first extract flashvars and then pass this to demjson to convert the Javascript dictionary into a Python one. For example:

import demjson

content = """<script type="text/javascript">/* <![CDATA[ */ 
... 
...
</script>"""

script_var = content.split('var flashvars = ')[1]
script_var = script_var[:script_var.find('};') + 1]
data = demjson.decode(script_var)

print(data['video_url'])
print(data['video_alt_url'])

This would then display:

https://www.ptrex.com/get_file/4/996a9088fdf801992d24457cd51469f3f7aaaee6a0/33000/33247/33247.mp4/
https://www.ptrex.com/get_file/4/774833c428771edee2cf401ef2264e746a06f9f370/33000/33247/33247_720p.mp4/

demjson is an alternative JSON decoder which can be installed via PIP

pip install demjson

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...