Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
512 views
in Technique[技术] by (71.8m points)

scala - How to use regex to include/exclude some input files in sc.textFile?

I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile().

I have attempted to do the following:

sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*")

This should match the following:

/user/Orders/201507270010033.gz
/user/Orders/201507300060052.gz

Any idea how to achieve this?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat.

Searching reveals that paths supplied to FileInputFormat's addInputPath or setInputPath "may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, SparkContext also uses those APIs to set the path.

The syntax of the glob includes:

  • * (match 0 or more character)
  • ? (match single character)
  • [ab] (character class)
  • [^ab] (negated character class)
  • [a-b] (character range)
  • {a,b} (alternation)
  • c (escape character)

Following the example in the accepted answer, it is possible to write your path as:

sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")

It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary:

sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...