I am working to move nexted xml data into a hierarchical data frame. I was able to get all of the data out of the xml thanks to help on SO. However, now, I am working to clean up the data that I extract and shape it before output because I will be doing this thousands of times.
UPDATED: THIS IS WHAT I EVENTUALLY WANT OUT. I cannot seem to fetch just the Time
and value
for channel
dynamically. The channel names will change for each file.
When channel = txt1[0]
(for this file, txt1[0]="blah"
) through when channel = txt1[8]
(for this file, txt1[8]="lir"
)
channel Time value
0 blah 2013-05-01 00:00:00 258
1 blah 2013-05-01 00:01:00 259
...
n-2 lir 2013-05-01 23:57:00 58
n-1 lir 2013-05-01 23:58:00 37
n lir 2013-05-01 23:59:00 32
Here is how my xml file is fetched and structured:
import requests
from lxml import etree, objectify
r = requests.get('https://api.stuff.us/place/getData? security_key=key&period=minutes&startTime=2013-05-01T00:00&endTime=2013-05-01T23:59&sort=channel') #edited for privacy
root = etree.fromstring(r.text)
xml_new = etree.tostring(root, pretty_print=True)
print xml_new[300:900] #gives xml output to show structure
<startTime>2013-05-01 00:00:00</startTime>
<endTime>2013-05-01 23:59:00</endTime>
<summaryPeriod>minutes</summaryPeriod>
<data>
<channel channel="97925" name="blah">
<Time Time="2013-05-01 00:00:00">
<value>258</value>
</Time>
<Time Time="2013-05-01 00:01:00">
<value>259</value>
</Time>
<Time Time="2013-05-01 00:02:00">
<value>258</value>
</Time>
<Time Time="2013-05-01 00:03:00">
<value>257</value>
</Time>
Yesterday, I asked here on SO and solved the problem of getting the time
and value
values into a data frame: Parsing xml to pandas data frame throws memory error
dTime=[]
dvalue=[]
for df in root.xpath('//channel/Time'):
## Iterate over attributes of Time but Time only has one attrib [@Time]
for attrib in df.attrib:
dTime.append(df.attrib[attrib])
## value is a child of time, and iterate
subfields = df.getchildren()
for subfield in subfields:
dvalue.append(subfield.text)
pef=DataFrame({'Time':dTime,'values':dvalue})
pef
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12960 entries, 0 to 12959
Data columns (total 2 columns):
Time 12960 non-null values
value 12960 non-null values
dtypes: object(2)
pef[:5]
Time value
0 2013-05-01 00:00:00 258
1 2013-05-01 00:01:00 259
2 2013-05-01 00:02:00 258
3 2013-05-01 00:03:00 257
4 2013-05-01 00:04:00 257
Now, I am working to this data out for each of the channels (structure is channel
-> Time
-> value
) separately, so that I can insert the channel as a column of the data set.
So, I decided to get the channel names dynamically, and search through the data.For this file, there are nine separate valid channel names, but it is not the same for all of the files (number or names).
txt1 = root.xpath('//channel/@name') #this prints all channel names!
len(txt1)
Out[67]: 9
print txt1
['blah', 'b', 'c', 'd', 'vd', 'ef', 'fg', 'kc', 'lir']
I thought I could dynamically fetch the data (using the earlier solution but adding @name=txt1[0]
) and eventually doing a for i = 0 to len(txt1), ...
to go through all of them. But I get an empty data frame:
dTime=[]
dchannel = txt1[0] # can hardcode, but need to be able to get all
dvalue=[]
for df in root.xpath('//channel[@name=txt1[0]]/Time'):
#CODE NEEDED: to get dchannel to dynamically = channel[@name]
## Iterate over attributes of time for specific channel
for attrib in df.attrib:
dTime.append(df.attrib[attrib])
## value is a child of time, and iterate
subfields = df.getchildren()
for subfield in subfields:
dvalue.append(subfield.text)
perf=DataFrame({'Channel': dchannel,'Time':dTime,'values':dvalue})
perf
Int64Index([], dtype=int64)
Empty DataFrame
If I hard code the desired attribute, like for df in root.xpath('/*/*/*/channel[@name="blah"]/Time'):
it will print it for one attribute, but I cannot get it to work referencing txt1[]
.
I tried with reference to {0}..., txt1[]
but then it spits out a tuple for the dchannel attribute (because it is getting all of txt1 instead of getting the txt1 attribute name that is the parent of the time node.
I looked over the XPath documentation, and I have been through the lxml tutorial, and I cannot figure out why my dynamic search does not work. Do I need to fall back to .findall()
? How can I use this dynamic search to get the data for each value in txt1
?
There is probably a more pythonic way to approach this such as setting up a function that gets the attribute [@name]
of the parent, the attribute [@Time]
of the child, and then the text of the grandchild value
, but I have not figured out how to do that yet.
See Question&Answers more detail:
os