Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
462 views
in Technique[技术] by (71.8m points)

linux - What character sequence should I not allow in a filename?

I found out after testing that linux allows any character in a file name except for / and null (). So what sequence should I not allow in a filename? I heard a leading - may confuse some command line programs, which doesn't matter to me, however it may bother other people if they decide to collect a bunch of files and filter it with some GNU programs.

It was suggested to me to remove leading and trailing spaces and I plan to only because typically the user doesn't mean to have leading/trailing space.

What problematic sequence might there be and what sequence should I consider not allowing? I am also considering not allowing characters illegal in windows just for convenience. I think I may not allow dashes at the beginning (dash is a legal window character)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your question is somewhat confusing since you talk at length about Linux, but then in a comment to another answer you say that you are generating filenames for people to download, which presumably means that you have absolutely no control whatsoever over the filesystem and operating system that the files will be stored on, making Linux completely irrelevant.

For the purpose of this answer I'm going to assume that your question is wrong and your comment is correct.

The vast majority of operating systems and filesystems in use today fall roughly into three categories: POSIX, Windows and MacOS.

The POSIX specification is very clear on what a filename that is guaranteed to be portable across all POSIX systems looks like. The characters that you can use are defined in Section 3.276 (Portable Filename Character Set) of the Open Group Base Specification as:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789._-
The maximum filename length that you can rely on is defined in Section 13.23.3.5 (<limits.h> Minimum Values) as 14. (The relevant constant is _POSIX_NAME_MAX.)

So, a filename which is up to 14 characters long and contains only the 65 characters listed above, is safe to use on all POSIX compliant systems, which gives you 24407335764928225040435790 combinations (or roughly 84 bits).

If you don't want to annoy your users, you should add two more restrictions: don't start the filename with a dash or a dot. Filenames starting with a dot are customarily interpreted as "hidden" files and are not displayed in directory listings unless explicitly requested. And filenames starting with a dash may be interpreted as an option by many commands. (Sidenote: it is amazing how many users don't know about the rm ./-rf or rm -- -rf tricks.)

This leaves you at 23656340818315048885345458 combinations (still 84 bits).

Windows adds a couple of new restrictions to this: filenames cannot end with a dot and filenames are case-insensitive. This reduces the character set from 65 to 39 characters (37 for the first, 38 for the last character). It doesn't add any length restrictions, Windows can deal with 14 characters just fine.

This reduces the possible combinations to 17866587696996781449603 (73 bits).

Another restriction is that Windows treats everything after the last dot as a filename extension which denotes the type of the file. If you want to avoid potential confusion (say, if you generate a filename like abc.mp3 for a text file), you should avoid dots altogether.

You still have 13090925539866773438463 combinations (73 bits).

If you have to worry about DOS, then additional restrictions apply: the filename consists of one or two parts (seperated by a dot), where neither of the two parts can contain a dot. The first part has a maximum length of 8, the second of 3 characters. Again, the second part is usually reserved to indicate the file type, which leaves you only 8 characters.

Now you have 4347792138495 possible filenames or 41 bits.

The good news is that you can use the 3 character extension to actually correctly indicate the file type, without breaking the POSIX filename limit (8+3+1 = 12 < 14).

If you want your users to be able to burn the files onto a CD-R formatted with ISO9660 Level 1, then you have to disallow hyphen anywhere, not just as the first character. Now, the remaining character set looks like

ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789_
which gives you 3512479453921 combinations (41 bits).

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...