Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
268 views
in Technique[技术] by (71.8m points)

java - Process unstructured and multiple line CSV in hadoop

I would like to process data in Hadoop Mapreduce, I am having data below format with unstructured, multiple line and un-terminated quotations.

    2/1/2013 5:16,Edward Felton,2,8/1/2012 3:57,Working on all the digital elements for our big event in Sydney in a couple of weeks... for more visit http://www.xy.com/au/geworks/,324005862,2,18200695
    12/28/2012 19:28,Laura McCullum,2,7/26/2012 18:03,"The Day You Give Them Jive  <br>
<a href="http://youtu.be/qfq9LVD2Qr4" > http://youtu.be/qfq9LVD2Qr4 <br>
 <br>
'Like' if you have always wanted to destroy a cube!",502114904,2,18400313
    11/21/2012 13:35,Timothy Widdowson,4,8/17/2012 12:38,"Can a table really replace a laptop...

With the new Windows tablets on the horizon and the Apple / Android devices out there I have been wondering if it is possible to really work with just and tablet. 

My mission:
-For one whole week I will be working with just my iPad. 

Hardware:
-Apple iPad
-Apple keyboard.
-Apple to HDMI connector.
-HDMI capable monitor.
- InCase iPad stand.

:-)",105001439,1,19301609
    3/15/2013 13:43,Mary Romeo,3,8/16/2012 22:23,"HOW TO SHORTEN LONG LINKS YOU'RE POSTING <br>
The attached image describes how to shorten a long url before posting it.  In 4 easy steps the 3-4 line urls can become a tiny link to post.",213022329,1,19901561
    11/30/2012 2:17,Lu Yin Zhong,3,8/29/2012 1:29,working on 2013 comms plan...need big ideas!!,302014449,2,20300666
    3/5/2013 22:15,Tim Steigert,12,8/29/2012 15:36,"Looking up 1024 email addresses. Manually? Probably a day! Doing it with SSOget, the add-in for  #[&quot;excel&quot;]? 5 minutes! Effort saved and  #[&quot;productivity&quot;] gained? Priceless! Now go get it and enjoy it for yourself! :)<br>http://sc.xy.com/*SSOget @@@data@@@{&quot;image&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;}",100011871,11,20400713
    11/1/2012 20:46,Pranay Jain,2,8/30/2012 14:26,Do people agree with the iCloud restrictions that Airwatch will put on Personal iOS devices that have email?,212065316,0,20700913
    11/9/2012 18:32,Monica Sharma,5,9/7/2012 11:42,hhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh,502000192,5,21400516

Please provide me code snippet how to handle mentioned data ? Thanks in advance!!!!!!!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Because you're coping with multi-line data you cannot use a simple TextInputFormat to access your data. Thus you need to use a custom InputFormat for CSV files.

Currently there is no built-in way of processing multi-line CSV files in Hadoop (see https://issues.apache.org/jira/browse/MAPREDUCE-2208), but luckily there's come code on github you can try: https://github.com/mvallebr/CSVInputFormat.

As far as the non-terminated quotations is concerned, it might be necessary to pre-process the data and clean it up in the first place. One simple rule would be to escape the quotations if there is no separator before or after the quotation ("):

  • escape: a"b => a"b
  • leave unchanged: a;"b and a";b

Another option would be correcting the application that produces invalid CSV to escape the data in a proper way.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...