Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
514 views
in Technique[技术] by (71.8m points)

regex - BASH regexp matching - including brackets in a bracketed list of characters to match against?

I'm trying to do a tiny bash script that'll clean up the file and folder names of downloaded episodes of some tv shows I like. They often look like "[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE", and I basically just want to strip out that speedcd advertising bit.

It's easy enough to remove www.Speed.Cd, spaces, and dashes using regexp matching in BASH, but for the life of me, I cannot figure out how to include the brackets in a list of characters to be matched against. [- [] doesn't work, neither does [- [], [- \[], [- \[], or any number of escape characters preceding the bracket I want to remove.

Here's what I've got so far:

[[ "$newfile" =~ ^(.*)([- []*(www.torrenting.com|spastikustv|www.speed.cd|moviesp2p.com)[- ]]*)(.*)$ ]] &&
    newfile="${BASH_REMATCH[1]}${BASH_REMATCH[4]}"

But it breaks on the brackets.

Any ideas?

TIA, Daniel :)

EDIT: I should probably note that I'm using "shopt -s nocasematch" to ensure case insensitive matching, just in case you're wondering :)

EDIT 2: Thanks to all who contributed. I'm not 100% sure which answer was to be the "correct" one, as I had several problems with my statement. Actually, the most accurate answer was just a comment to my question posted by jw013, but I didn't get it at the time because I hadn't understood yet that spaces should be escaped. I've opted for aefxx's as that one basically says the same, but with explanations :) Would've liked to put a correct answer mark on ormaaj's answer, too, as he spotted more grave issues with my expression.

Anyway, the approach I was using above, trying to match and extract the parts to keep and leave behind the unwanted ones is really not very elegant, and won't catch all cases, not even something really simple like "Some.Show.S07E14.720p.HDTV.X264-SOMEONE - [ www.Speed.Cd ]". I've instead rewritten it to match and extract just the unwanted parts and then do string replacement of those on the original string, like so (loop is in case there's multiple brandings):

# Remove common torrent site brandings, including surrounding spaces, brackets, etc.:
while [[ "$newfile" =~ ([[ {(-]*(www.)?(torrentday.com|torrenting.com|spastikustv|speed.cd|moviesp2p.com|publichd.org|publichd|scenetime.com|kingdom-release)[] })-]*) ]]; do
    newfile=${newfile//"${BASH_REMATCH[1]}"/}
done
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Ok, this is the first time I've heard of the =~ operator but nevertheless here's what I found by trial and error:

if [[ $newfile =~ ^(.*)([-[:space:][]*(what|ever)[][:space:]-]*)(.*)$ ]] 
                          ^^^^^^^^^^              ^^^^^^^^^^

Looks strange but actually does work (just tested it).

EDIT
Quote from the Linux man pages regex(7):

To include a literal ] in the list, make it the first character (following a possible ^). To include a literal -, make it the first or last character, or the second endpoint of a range. To use a literal aq-aq as the first endpoint of a range, enclose it in "[." and ".]" to make it a collating element (see below). With the exception of these and some combinations using aq[aq (see next paragraphs), all other special characters, including aqaq, lose their special significance within a bracket expression.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...