Here are a few different possible approaches; use whichever works for you. All my code examples below use requests
for HTTP requests to the API; you can install requests
with pip install requests
if you have Pip. They also all use the Mediawiki API, and two use the query endpoint; follow those links if you want documentation.
1. Get a plain text representation of either the entire page or the page "extract" straight from the API with the extracts
prop
Note that this approach only works on MediaWiki sites with the TextExtracts extension. This notably includes Wikipedia, but not some smaller Mediawiki sites like, say, http://www.wikia.com/
You want to hit a URL like
https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Bla_Bla_Bla&prop=extracts&exintro&explaintext
Breaking that down, we've got the following parameters in there (documented at https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts):
action=query
, format=json
, and title=Bla_Bla_Bla
are all standard MediaWiki API parameters
prop=extracts
makes us use the TextExtracts extension
exintro
limits the response to content before the first section heading
explaintext
makes the extract in the response be plain text instead of HTML
Then parse the JSON response and extract the extract:
>>> import requests
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'query',
... 'format': 'json',
... 'titles': 'Bla Bla Bla',
... 'prop': 'extracts',
... 'exintro': True,
... 'explaintext': True,
... }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> print(page['extract'])
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.
2. Get the full HTML of the page using the parse
endpoint, parse it, and extract the first paragraph
MediaWiki has a parse
endpoint that you can hit with a URL like https://en.wikipedia.org/w/api.php?action=parse&page=Bla_Bla_Bla to get the HTML of a page. You can then parse it with an HTML parser like lxml (install it first with pip install lxml
) to extract the first paragraph.
For example:
>>> import requests
>>> from lxml import html
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'parse',
... 'page': 'Bla Bla Bla',
... 'format': 'json',
... }
... ).json()
>>> raw_html = response['parse']['text']['*']
>>> document = html.document_fromstring(raw_html)
>>> first_p = document.xpath('//p')[0]
>>> intro_text = first_p.text_content()
>>> print(intro_text)
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.
3. Parse wikitext yourself
You can use the query
API to get the page's wikitext, parse it using mwparserfromhell
(install it first using pip install mwparserfromhell
), then reduce it down to human-readable text using strip_code
. strip_code
doesn't work perfectly at the time of writing (as shown clearly in the example below) but will hopefully improve.
>>> import requests
>>> import mwparserfromhell
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'query',
... 'format': 'json',
... 'titles': 'Bla Bla Bla',
... 'prop': 'revisions',
... 'rvprop': 'content',
... }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> wikicode = page['revisions'][0]['*']
>>> parsed_wikicode = mwparserfromhell.parse(wikicode)
>>> print(parsed_wikicode.strip_code())
{{dablink|For Ke$ha's song, see Blah Blah Blah (song). For other uses, see Blah (disambiguation)}}
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.
Background and writing
He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"''.
Music video
The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.
Chart performance
Chart (1999-00)PeakpositionIreland (IRMA)Search for Irish peaks23
References
External links
Category:1999 singles
Category:Gigi D'Agostino songs
Category:1999 songs
Category:ZYX Music singles
Category:Songs written by Gigi D'Agostino