Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.1k views
in Technique[技术] by (71.8m points)

powershell - Optimize Word document keyword search

I'm trying to search for keywords across a large number of MS Word documents, and return the results to a file. I've got a working script, but I wasn't aware of the scale, and what I've got isn't nearly efficient enough, it would take days to plod through everything.

The script as it stands now takes keywords from CompareData.txt and runs it through all the files in a specific folder, then appends it to a file.

So when I'm done I will know how many files have each specific keyword.

[cmdletBinding()] 
Param( 
$Path = "C:willscratch" 
) #end param 
$findTexts = (Get-Content c:scratchCompareData.txt)
Foreach ($Findtext in $FindTexts)
{
$matchCase = $false 
$matchWholeWord = $true 
$matchWildCards = $false 
$matchSoundsLike = $false 
$matchAllWordForms = $false 
$forward = $true 
$wrap = 1 
$application = New-Object -comobject word.application 
$application.visible = $False 
$docs = Get-childitem -path $Path -Recurse -Include *.docx  
$i = 1 
$totaldocs = 0 
Foreach ($doc in $docs) 
{ 
Write-Progress -Activity "Processing files" -status "Processing $($doc.FullName)" -PercentComplete ($i /$docs.Count * 100) 
$document = $application.documents.open($doc.FullName) 
$range = $document.content 
$null = $range.movestart() 
$wordFound = $range.find.execute($findText,$matchCase, 
  $matchWholeWord,$matchWildCards,$matchSoundsLike, 
  $matchAllWordForms,$forward,$wrap) 
  if($wordFound) 
    { 
     $doc.fullname 
     $document.Words.count 
     $totaldocs ++ 
  } #end if $wordFound 
$document.close() 
$i++ 
} #end foreach $doc 
$application.quit() 
"There are $totaldocs total files with $findText"  | Out-File -Append C:scratchoutput.txt

#clean up stuff 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($range) | Out-Null 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($document) | Out-Null 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($application) | Out-Null 
Remove-Variable -Name application 
[gc]::collect() 
[gc]::WaitForPendingFinalizers() 
}

What I'd like to do is figure out a way to search each file for everything in CompareData.txt once, rather than iterate through it a bunch of times. If I was dealing with a small set of data, the approach I've got would get the job done - but I've come to find out that both the data in CompareData.txt and the source Word file directory will be very large.

Any ideas on how to optimize this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Right now you're doing this (pseudocode):

foreach $Keyword {
    create Word Application
    foreach $File {
        load Word Document from $File
        find $Keyword
    }
}

That means that if you have a 100 keywords and 10 documents, you're opening and closing a 100 instances of Word and loading in a thousand word documents before you're done.

Do this instead:

create Word Application
foreach $File {
    load Word Document from $File
    foreach $Keyword {
        find $Keyword
    }
}

So you only launch one instance of Word and only load each document once.


As noted in the comments, you may optimize the whole process by using the OpenXML SDK, rather than launching Word:

(assuming you've installed OpenXML SDK in its default location)

# Import the OpenXML library
Add-Type -Path 'C:Program Files (x86)Open XML SDKV2.5libDocumentFormat.OpenXml.dll'

# Grab the keywords and file names    
$Keywords  = Get-Content C:scratchCompareData.txt
$Documents = Get-childitem -path $Path -Recurse -Include *.docx  

# hashtable to store results per document
$KeywordMatches = @{}

# store OpenXML word document type in variable as a shorthand
$WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]

foreach($Docx in $Docs)
{
    # create array to hold matched keywords
    $KeywordMatches[$Docx.FullName] = @()

    # open document, wrap content stream in streamreader 
    $Document       = $WordDoc::Open($Docx.FullName, $false)
    $DocumentStream = $Document.MainDocumentPart.GetStream()
    $DocumentReader = New-Object System.IO.StreamReader $DocumentStream

    # read entire document
    $DocumentContent = $DocumentReader.ReadToEnd()

    # test for each keyword
    foreach($Keyword in $Keywords)
    {
        $Pattern   = [regex]::Escape($KeyWord)
        $WordFound = $DocumentContent -match $Pattern
        if($WordFound)
        {
            $KeywordMatches[$Docx.FullName] += $Keyword
        }
    }

    $DocumentReader.Dispose()
    $Document.Dispose()
}

Now, you can show the word count for each document:

$KeywordMatches.GetEnumerator() |Select File,@{n="Count";E={$_.Value.Count}}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...