Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
153 views
in Technique[技术] by (71.8m points)

xslt - Serialize XML file on the basis of Character Count during an XSL transformation

I have an XML document (A.xml) and it is being transformed to another XML document (B.xml), which is nothing but a replica of A.xml with an unique @id being added to each element belonging to B.xml. And this part is done.

Now I would like implement a mechanism which would track character count of every text node within B.xml (within a temporary tree) and based on maximum character count, the mechanism would be able to split and serialize B.xml in one or several parts.

Source XML Document (A.xml):

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <!--
    Rules for splitting:
    1. ?head/text()? is common for all splits.
    2. split files can have 600 characters max each.
    3. ?title? elements could not be the last element of the any result document.
    -->
    <head><!-- 8 characters -->Kinesics</head>
    <section>
        <para><!-- 37 characters -->From Wikipedia, the free encyclopedia</para>
        <para><!-- 204 characters [space normalized]-->Kinesics is the interpretation of body
            language such as facial expressions and gestures — or, more formally, non-verbal
            behavior related to movement, either of any part of the body or the body as a
            whole. </para>
        <section>
            <title><!-- 19 characters -->Birdwhistell's work</title>
            <para><!-- 432 characters [space normalized]-->The term was first used (in 1952) by Ray
                Birdwhistell, an anthropologist who wished to study how people communicate through
                posture, gesture, stance, and movement. Part of Birdwhistell's work involved making
                film of people in social situations and analyzing them to show different levels of
                communication not clearly seen otherwise. The study was joined by several other
                anthropologists, including Margaret Mead and Gregory Bateson.</para>
            <para><!-- 453 characters [space normalized]--> Drawing heavily on descriptive
                linguistics, Birdwhistell argued that all movements of the body have meaning (i.e.
                are not accidental), and that these non-verbal forms of language (or paralanguage)
                have a grammar that can be analyzed in similar terms to spoken language. Thus, a
                "kineme" is "similar to a phoneme because it consists of a group of movements which
                are not identical, but which may be used interchangeably without affecting social
                meaning".</para>
        </section>
        <section>
            <title><!-- 19 characters -->Modern applications</title>
            <para><!-- 390 characters [space normalized]-->Kinesics are an important part of
                non-verbal communication behavior. The movement of the body, or separate parts,
                conveys many specific meanings and the interpretations may be culture bound. As many
                movements are carried out at a subconscious or at least a low-awareness level,
                kinesic movements carry a significant risk of being misinterpreted in an
                intercultural communications situation.</para>
        </section>
    </section>
</root>

XSL File

<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0">
    <xsl:output method="xml" encoding="UTF-8" indent="no"/>

    <!--update 1-->
    <xsl:strip-space elements="*"/>

    <xsl:template match="/">
        <xsl:variable name="root-replica">
            <xsl:call-template name="create-root-replica">
                <xsl:with-param name="context" select="*"/>
            </xsl:call-template>
        </xsl:variable>
        <xsl:copy-of select="$root-replica"/>
        <!--
            <xsl:call-template name="split-n-serialize">
            <xsl:with-param name="context" select="$root-replica"/>
            </xsl:call-template>
        -->
    </xsl:template>

    <xsl:template name="split-n-serialize">
        <xsl:param name="context"/>
        <xsl:for-each select="$context">
            <xsl:result-document encoding="utf-8" href="{concat('split_',position(),'.xml')}" method="xml" indent="no">
                <xsl:sequence select="."/>
            </xsl:result-document>
        </xsl:for-each>
    </xsl:template>

    <xsl:template name="create-root-replica">
        <xsl:param name="context"/>
        <root>
            <head>
                <xsl:value-of select="$context/head"/>
            </head>
            <xsl:apply-templates select="$context/*[not(self::head)]"/>
        </root>
    </xsl:template>

    <xsl:template match="element()">
        <xsl:element name="{local-name()}">
            <xsl:attribute name="id">
                <xsl:value-of select="generate-id()"/>
            </xsl:attribute>
            <xsl:apply-templates/>
        </xsl:element>
    </xsl:template>

    <!--update 2-->
    <xsl:template match="text()">
        <xsl:value-of select="normalize-space(.)"/>
    </xsl:template>

</xsl:transform>

My input XML contains 1562 characters (assuming s+ is equal to ), and I like to split A.xml into 4 parts using the rule mentioned within source xml document.

Does anyone have any idea how to do this? Any ideas or comments are greatly appreciated.

Update 3

Details of split files

1st File
       8
      37
     204  =  249
2nd File
       8
      19
     432  =  459
3rd File
       8
     453  =  461
4th File
       8
      19
     390  =  417

Details on Split procedure:

  1. Contents of element ?head? should part of each and every XML file.

  2. Files could be splitted from middle of section but not in the middle of a paragraph.

  3. Not ?title? element should come at the end of an split.

  4. Maximum number characters (excluding opening and closing tags) in a split file is upto 600.

Sample output files (indents are used for better readability)

1st file

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <head>Kinesics</head>
    <section id="d1e6">
        <para id="d1e7">From Wikipedia, the free encyclopedia</para>
        <para id="d1e10">Kinesics is the interpretation of body language such as facial expressions and gestures — or, more formally, non-verbal behavior related to movement, either of any part of the body or the body as a whole.</para>
    </section>
</root>

2nd file

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <head>Kinesics</head>
    <section id="d1e6">
        <section id="d1e13">
            <title id="d1e14">Birdwhistell's work</title>
            <para id="d1e17">The term was first used (in 1952) by Ray Birdwhistell, an anthropologist who wished to study how people communicate through posture, gesture, stance, and movement. Part of Birdwhistell's work involved making film of people in social situations and analyzing them to show different levels of communication not clearly seen otherwise. The study was joined by several other anthropologists, including Margaret Mead and Gregory Bateson.</para>
        </section>
    </section>
</root>

3rd File

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <head>Kinesics</head>
    <section id="d1e6">
        <section id="d1e13">
            <para id="d1e20">Drawing heavily on descriptive linguistics, Birdwhistell argued that all movements of the body have meaning (i.e. are not accidental), and that these non-verbal forms of language (or paralanguage) have a grammar that can be analyzed in similar terms to spoken language. Thus, a "kineme" is "similar to a phoneme because it consists of a group of movements which are not identical, but which may be used interchangeably without affecting social meaning".</para>
        </section>
    </section>
</root>

4th file

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <head>Kinesics</head>
    <section id="d1e6">
        <section id="d1e23">
            <title id="d1e24">Modern applications</title>
            <para id="d1e27">Kinesics are an important part of non-verbal communication behavior. The movement of the body, or separate parts, conveys many specific meanings and the interpretations may be culture bound. As many movements are carried out at a subconscious or at least a low-awareness level, kinesic movements carry a significant risk of being misinterpreted in an intercultural communications situation.</para>
        </section>
    </section>
</root>
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You would use string-length() to get the "character count" and then xsl:result-document to split your result tree into parts.

Do you need further help coding it up?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...