Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
185 views
in Technique[技术] by (71.8m points)

xml - Recursive call to a duplicated Bash script, making it unable to access the assets

Edit : This post is now addressed in a new, as the problem as to be presented slightly differently. It's here : How can I efficiently run XSLT transformations for a large number of files in parallel?

I'm stuck in my attempts of parallelizing a process, and after some decent time spent on it I'd like to request some help ...

Basically, I have a lots of XML files to transform with a specific XSLT sheet. But the sheet uses a call to an (very slow) API to fetch additional data, and taking the whole batch of XMLs in 1 go will take (very) long.

Therefore I splitted all the files from the original "input" folder into subfolder containing each around 5000 XML files, and I copied the following Bash script inside each subfolder too:

for f in *.xml
do
  java -jar ../../saxon9he.jar -xsl:../../some-xslt-sheet.xsl -s:$f
done

And I call each process, for each folder, from the "root" folder containing altogether the "input" folder, the Saxon library and the XSLT sheet :

find input -type d -exec sh {}/script.sh ;

But I get this error:

Unable to access jarfile ../../saxon9he.jar

I suppose it comes form the fact that I'm operating from the "root" folder, when the scripts being called are lower in the directories. I could solver the problem (if I'm correct) by copying all the assets in each subfolder, but I found the solution making my current approach even clumsier.

Thanks to anyone who might have an idea and make me understand this !

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Firstly, you really don't want to initialize a new Java VM to run each transformation: this is typically going to take much longer than running the actual transformation. To put this in perspective, for "typical" transformations you will often see Java initialization time 3 seconds, stylesheet compilation time 300ms, transformation time 10ms. So if you can find a way to do it that only initializes Java and compiles the stylesheet once, your total time for 10K documents is going to be 2 minutes rather than 10 hours.

There are various ways to achieve this but they all involve using something other than a shell-script to control the process. The simplest, in my view, is to control it from XSLT itself, by using the collection() function to access all the files in the directory. This has an added bonus, if you're using Saxon-EE, that the files will be processed (parsed) in parallel using all the cores on your machine, which can speed things up by another factor of 4 or so. You just need to add an entry point to the stylesheet something like:

<xsl:template name="main">
  <xsl:for-each select="collection('file:///my/dir?select=*.xml;recurse=yes')!saxon:discard-document(.)">
    <xsl:result-document href="....">
      <xsl:apply-templates/>
    </xsl:result-document>
  </xsl:for-each>
</xsl:template>

The saxon:discard-document call is optional, but because it makes documents eligible for garbage collection, means that you are less likely to run out of memory.

Another approach to writing the control loop is to use a specialized shell such as xmlsh.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...