Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
468 views
in Technique[技术] by (71.8m points)

filesystems - Check whether a path is valid in Python without creating a file at the path's target

I have a path (including directory and file name).
I need to test if the file-name is a valid, e.g. if the file-system will allow me to create a file with such a name.
The file-name has some unicode characters in it.

It's safe to assume the directory segment of the path is valid and accessible (I was trying to make the question more gnerally applicable, and apparently I wen too far).

I very much do not want to have to escape anything unless I have to.

I'd post some of the example characters I am dealing with, but apparently they get automatically removed by the stack-exchange system. Anyways, I want to keep standard unicode entities like ?, and only escape things which are invalid in a filename.


Here is the catch. There may (or may not) already be a file at the target of the path. I need to keep that file if it does exist, and not create a file if it does not.

Basically I want to check if I could write to a path without actually opening the path for writing (and the automatic file creation/file clobbering that typically entails).

As such:

try:
    open(filename, 'w')
except OSError:
    # handle error here

from here

Is not acceptable, because it will overwrite the existent file, which I do not want to touch (if it's there), or create said file if it's not.

I know I can do:

if not os.access(filePath, os.W_OK):
    try:
        open(filePath, 'w').close()
        os.unlink(filePath)
    except OSError:
        # handle error here

But that will create the file at the filePath, which I would then have to os.unlink.

In the end, it seems like it's spending 6 or 7 lines to do something that should be as simple as os.isvalidpath(filePath) or similar.


As an aside, I need this to run on (at least) Windows and MacOS, so I'd like to avoid platform-specific stuff.

``

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

tl;dr

Call the is_path_exists_or_creatable() function defined below.

Strictly Python 3. That's just how we roll.

A Tale of Two Questions

The question of "How do I test pathname validity and, for valid pathnames, the existence or writability of those paths?" is clearly two separate questions. Both are interesting, and neither have received a genuinely satisfactory answer here... or, well, anywhere that I could grep.

vikki's answer probably hews the closest, but has the remarkable disadvantages of:

  • Needlessly opening (...and then failing to reliably close) file handles.
  • Needlessly writing (...and then failing to reliable close or delete) 0-byte files.
  • Ignoring OS-specific errors differentiating between non-ignorable invalid pathnames and ignorable filesystem issues. Unsurprisingly, this is critical under Windows. (See below.)
  • Ignoring race conditions resulting from external processes concurrently (re)moving parent directories of the pathname to be tested. (See below.)
  • Ignoring connection timeouts resulting from this pathname residing on stale, slow, or otherwise temporarily inaccessible filesystems. This could expose public-facing services to potential DoS-driven attacks. (See below.)

We're gonna fix all that.

Question #0: What's Pathname Validity Again?

Before hurling our fragile meat suits into the python-riddled moshpits of pain, we should probably define what we mean by "pathname validity." What defines validity, exactly?

By "pathname validity," we mean the syntactic correctness of a pathname with respect to the root filesystem of the current system – regardless of whether that path or parent directories thereof physically exist. A pathname is syntactically correct under this definition if it complies with all syntactic requirements of the root filesystem.

By "root filesystem," we mean:

  • On POSIX-compatible systems, the filesystem mounted to the root directory (/).
  • On Windows, the filesystem mounted to %HOMEDRIVE%, the colon-suffixed drive letter containing the current Windows installation (typically but not necessarily C:).

The meaning of "syntactic correctness," in turn, depends on the type of root filesystem. For ext4 (and most but not all POSIX-compatible) filesystems, a pathname is syntactically correct if and only if that pathname:

  • Contains no null bytes (i.e., x00 in Python). This is a hard requirement for all POSIX-compatible filesystems.
  • Contains no path components longer than 255 bytes (e.g., 'a'*256 in Python). A path component is a longest substring of a pathname containing no / character (e.g., bergtatt, ind, i, and fjeldkamrene in the pathname /bergtatt/ind/i/fjeldkamrene).

Syntactic correctness. Root filesystem. That's it.

Question #1: How Now Shall We Do Pathname Validity?

Validating pathnames in Python is surprisingly non-intuitive. I'm in firm agreement with Fake Name here: the official os.path package should provide an out-of-the-box solution for this. For unknown (and probably uncompelling) reasons, it doesn't. Fortunately, unrolling your own ad-hoc solution isn't that gut-wrenching...

O.K., it actually is. It's hairy; it's nasty; it probably chortles as it burbles and giggles as it glows. But what you gonna do? Nuthin'.

We'll soon descend into the radioactive abyss of low-level code. But first, let's talk high-level shop. The standard os.stat() and os.lstat() functions raise the following exceptions when passed invalid pathnames:

  • For pathnames residing in non-existing directories, instances of FileNotFoundError.
  • For pathnames residing in existing directories:
    • Under Windows, instances of WindowsError whose winerror attribute is 123 (i.e., ERROR_INVALID_NAME).
    • Under all other OSes:
    • For pathnames containing null bytes (i.e., 'x00'), instances of TypeError.
    • For pathnames containing path components longer than 255 bytes, instances of OSError whose errcode attribute is:
      • Under SunOS and the *BSD family of OSes, errno.ERANGE. (This appears to be an OS-level bug, otherwise referred to as "selective interpretation" of the POSIX standard.)
      • Under all other OSes, errno.ENAMETOOLONG.

Crucially, this implies that only pathnames residing in existing directories are validatable. The os.stat() and os.lstat() functions raise generic FileNotFoundError exceptions when passed pathnames residing in non-existing directories, regardless of whether those pathnames are invalid or not. Directory existence takes precedence over pathname invalidity.

Does this mean that pathnames residing in non-existing directories are not validatable? Yes – unless we modify those pathnames to reside in existing directories. Is that even safely feasible, however? Shouldn't modifying a pathname prevent us from validating the original pathname?

To answer this question, recall from above that syntactically correct pathnames on the ext4 filesystem contain no path components (A) containing null bytes or (B) over 255 bytes in length. Hence, an ext4 pathname is valid if and only if all path components in that pathname are valid. This is true of most real-world filesystems of interest.

Does that pedantic insight actually help us? Yes. It reduces the larger problem of validating the full pathname in one fell swoop to the smaller problem of only validating all path components in that pathname. Any arbitrary pathname is validatable (regardless of whether that pathname resides in an existing directory or not) in a cross-platform manner by following the following algorithm:

  1. Split that pathname into path components (e.g., the pathname /troldskog/faren/vild into the list ['', 'troldskog', 'faren', 'vild']).
  2. For each such component:
    1. Join the pathname of a directory guaranteed to exist with that component into a new temporary pathname (e.g., /troldskog) .
    2. Pass that pathname to os.stat() or os.lstat(). If that pathname and hence that component is invalid, this call is guaranteed to raise an exception exposing the type of invalidity rather than a generic FileNotFoundError exception. Why? Because that pathname resides in an existing directory. (Circular logic is circular.)

Is there a directory guaranteed to exist? Yes, but typically only one: the topmost directory of the root filesystem (as defined above).

Passing pathnames residing in any other directory (and hence not guaranteed to exist) to os.stat() or os.lstat() invites race conditions, even if that directory was previously tested to exist. Why? Because external processes cannot be prevented from concurrently removing that directory after that test has been performed but before that pathname is passed to os.stat() or os.lstat(). Unleash the dogs of mind-fellating insanity!

There exists a substantial side benefit to the above approach as well: security. (Isn't that nice?) Specifically:

Front-facing applications validating arbitrary pathnames from untrusted sources by simply passing such pathnames to os.stat() or os.lstat() are susceptible to Denial of Service (DoS) attacks and other black-hat shenanigans. Malicious users may attempt to repeatedly validate pathnames residing on filesystems known to be stale or otherwise slow (e.g., NFS Samba shares); in that case, blindly statting incoming pathnames is liable to either eventually fail with connection timeouts or consume more time and resources than your feeble capacity to withstand unemployment.

The above approach obviates this by only validating the path components of a pathname against the root directory of the root filesystem. (If even that's stale, slow, or inaccessible, you've got larger problems than pathname validation.)

Lost? Great. Let's begin. (Python 3 assumed. See "What Is Fragile Hope for 300, leycec?")

import errno, os

# Sadly, Python fails to provide the following magic number for us.
ERROR_INVALID_NAME = 123
'''
Windows-specific error code indicating an invalid pathname.

See Also
----------
https://docs.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
    Official listing of all such codes.
'''

def is_pathname_valid(pathname: str) -> bool:
    '''
    `True` if the passed pathname is a valid pathname for the current OS;
    `False` otherwise.
    '''
    # If this pathname is either not a string or is but is empty, this pathname
    # is invalid.
    try:
        if not isinstance(pathname, str) or not pathname:
            return False

        # Strip this pathname's Windows-specific drive specifier (e.g., `C:`)
        # if any. Since Windows prohibits path components from containing `:`
        # characters, failing to strip this `:`-suffixed prefix would
        # erroneously invalidate all valid absolute Windows pathnames.
        _, pathname = os.path.splitdrive(pathname)

        # Directory guaranteed to exist. If the current OS is Windows, this is
        # the drive to which Windows was installed (e.g., the "%HOMEDRIVE%"
        # environment variable); else, the typical root directory.
        root_dirname = os.environ.get('HOMEDRIVE', 'C:') 
            if sys.platform == 'win32' else os.path.sep
        assert os.path.isdir(root_dirname)   # ...Murphy and her ironclad Law

        # Append a path separator to this directory if needed.
        root_dirname = root_dirname.rstrip(os.path.sep) + os.path.sep

        # Test whether each path component split from this pathname is valid or
        # not, ignoring non-existent and non-readable path components.
        for pathname_part in pathname.split(os.path.sep):
            try:
                os.lstat(root_dirname + pathname_part)
            # If an OS-specific exception is raised, its error code
            # indicates whether this pathname is valid or not. Unless this
            # is the case, this exception implies an ignorable kernel or
            # filesystem complaint (e.g., path not found or inaccessible).
            #
            # Only the following exceptions indicate invalid pathnames:
            #
            # * Instances of the Windows-specific "WindowsError" class
            #   defining the "winerror" attrib

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...