Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

apache pig - Load File delimited by double colon :: in pig

Following is a sample dataset delimited by double colon(::).

1::Toy Story (1995)::Animation|Children's|Comedy    

I want to extract three fields from above data set as movieID,title and genre. I have written following code for that

movies = LOAD 'location/of/dataset/on/hdfs ' 
using PigStorage('::')
as 
(MovieID:int,title:chararray,genre:chararray);  

But i am getting following error

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to  parse:  
 <file script.pig, line 1, column 9> pig script failed to validate:
 java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[::]' 
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Use MyRegExloader: You will need piggybank.jar for this.

REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('([^\:]+)::([^\:]+)::([^\:]+)') 
      as (movieid:int, title:chararray, genre:chararray);

Output :

(1,Toy Story (1995),Animation|Children's|Comedy)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...