Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
424 views
in Technique[技术] by (71.8m points)

Powershell regex replace unescaped double quote followed by line break

I am processing a large csv file with fields enclosed in double quotes which has text descriptions containing unescaped double quotes which I need to replace with an escaped double quote. I have tried using the following regex: (?<!^|",)("(?:$[^"])|"(?!,"|$)) which is able to find the unescaped quotes except when they are followed by a line break. Any help in resolving this issue gratefully received.

I know the csv is incorrectly formatted but don't have control of this unfortunately, so I need to be able to correct the formatting for further processing.

Example:

"Field 1","Field 2","Field 3 "with unescaped quote"
followed by line break","Field 4"

Needs to become:

"Field 1","Field 2","Field 3 ""with unescaped quote""
followed by line break","Field 4"

Powershell script I'm using is as follows:

    [string]$path = 'C: ...'
    [string]$directory = [System.IO.Path]::GetDirectoryName($Path);
    [string]$strippedFileName = [System.IO.Path]::GetFileNameWithoutExtension($Path);
    [string]$extension = [System.IO.Path]::GetExtension($Path);
    [string]$newFileName = $strippedFileName + [DateTime]::Now.ToString("yyyyMMdd-HHmmss") + $extension;
    [string]$newFilePath = [System.IO.Path]::Combine($directory, $newFileName);

    $reader = New-Object 'System.IO.StreamReader'($path, $true);
    $regex = [regex] '(?<!^|",)("(?:$[^"])|"(?!,"|$))'
    $writer = [System.IO.StreamWriter] $newFilePath;  

    try{
        while (($line = $reader.ReadLine()) -ne $null ){
            $newline = $line -replace $regex, '""';
            $writer.WriteLine($newline);            
        }
    }
    finally{
        $reader.Close();
        $writer.Close();
    }
question from:https://stackoverflow.com/questions/65951075/powershell-regex-replace-unescaped-double-quote-followed-by-line-break

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Next time, try to build an Minimal, Reproducible Example (also for yourself) as it might help to better understand the problem.
A common pitfall in this is the fact that the standard cmdlet Get-Content reads a stream of lines (string[]) where each line doesn't contain any line break itself but line breaks are used as a default delimiter (between each item in the array) when output to the display or a file. You might workaround this by using the -Raw parameter but that will read everything into memory and probably make it even more complex than it actually is.
I suspect that you actually want to look for lines that do not start with a double quote which means that the previous csv line is probably truncated. meaning, in such a case, you want concatenate the previous line with an extra double quote, reinsert the line break and add the current line:

Get-Content .Input.csv | Foreach-Object { $Previous = $Null } {
    if ($_.StartsWith('"')) { 
        $Previous
        $Previous = $_
    } else {
        $Previous += '"' + [Environment]::NewLine + $_
    }
} { $Previous } | Set-Content .Output.csv

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...