xml parsing - How to extract xml attributes using Xpath in Pig?

Question

Welcome To Ask or Share your Answers For Others

xml parsing - How to extract xml attributes using Xpath in Pig?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

xml parsing - How to extract xml attributes using Xpath in Pig?

I wanted to extract the attributes form an xml using Pig Latin.

This is a sample of the xml file

<CATALOG>
<BOOK>
<TITLE test="test1">Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
</CATALOG>

I used this script but it didn't work:

REGISTER ./piggybank.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();

A =  LOAD './books.xml' using org.apache.pig.piggybank.storage.XMLLoader('BOOK') as (x:chararray);

B = FOREACH A GENERATE XPath(x, 'BOOK/TITLE/@test'), XPath(x, 'BOOK/PRICE');
dump B;

The output was:

(,24.90)

I hope someone can help me with this. Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:29:53+0000

There are 2 bugs in piggybank's XPath class:

The ignoreNamespace logic breaks searching for XML attributes https://issues.apache.org/jira/browse/PIG-4751
The ignoreNamepace parameter is defaulted to true and cannot be overwritten https://issues.apache.org/jira/browse/PIG-4752

Here is my workaround using XPathAll:

XPathAll(x, 'BOOK/TITLE/@test', true, false).$0 as (test:chararray)

Also if you still need to ignore namespaces:

XPathAll(x, '//*[local-name()='BOOK']//*[local-name()='TITLE']/@test', true, false).$0 as (test:chararray)

Categories

xml parsing - How to extract xml attributes using Xpath in Pig?

xml parsing - How to extract xml attributes using Xpath in Pig?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags