c# - How can I extract subscript / superscript properly from a PDF using iTextSharp?

Question

Welcome To Ask or Share your Answers For Others

c# - How can I extract subscript / superscript properly from a PDF using iTextSharp?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c# - How can I extract subscript / superscript properly from a PDF using iTextSharp?

iTextSharp works well extracting plain text from PDF documents, but I'm having trouble with subscript/superscript text, common in technical documents.

TextChunk.SameLine() requires two chunks to have identical vertical positioning to be "on" the same line, which isn't the case for superscript or subscript text. For example, on page 11 of this document, under "COMBUSTION EFFICIENCY":

http://www.mass.gov/courts/docs/lawlib/300-399cmr/310cmr7.pdf

Expected text:

monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO2 /(CO + CO2)]

Result text:

monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO /(CO + CO )] 
2 2

I moved SameLine() to LocationTextExtractionStrategy and made public getters for the private TextChunk properties it reads. This allowed me to adjust the tolerance on the fly in my own subclass, shown here:

public class SubSuperStrategy : LocationTextExtractionStrategy {
  public int SameLineOrientationTolerance { get; set; }
  public int SameLineDistanceTolerance { get; set; }

  public override bool SameLine(TextChunk chunk1, TextChunk chunk2) {
    var orientationDelta = Math.Abs(chunk1.OrientationMagnitude
       - chunk2.OrientationMagnitude);
    if(orientationDelta > SameLineOrientationTolerance) return false;
    var distDelta = Math.Abs(chunk1.DistPerpendicular
       - chunk2.DistPerpendicular);
    return (distDelta <= SameLineDistanceTolerance);
    }
}

Using a SameLineDistanceTolerance of 3, this corrects which line the sub/super chunks are assigned to, but the relative position of the text is way off:

monoxide (CO) in flue gas in accordance with the following formula:   C.E. = [CO /(CO + CO )] 2 2

Sometimes the chunks get inserted somewhere in the middle of the text, and sometimes (as with this example) at the end. Either way, they don't end up in the right place. I suspect this might have something to do with font sizes, but I'm at my limits of understanding the bowels of this code.

Has anyone found another way to deal with this?

(I'm happy to submit a pull request with my changes if that helps.)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:08:39+0000

To properly extract these subscripts and superscripts in line, one needs a different approach to check whether two text chunks are on the same line. The following classes represent one such approach.

I'm more at home in Java/iText; thus, I implemented this approach in Java first and only afterwards translated it to C#/iTextSharp.

An approach using Java & iText

I'm using the current development branch iText 5.5.8-SNAPSHOT.

A way to identify lines

Assuming text lines to be horizontal and the vertical extend of the bounding boxes of the glyphs on different lines to not overlap, one can try to identify lines using a RenderListener like this:

public class TextLineFinder implements RenderListener
{
    @Override
    public void beginTextBlock() { }
    @Override
    public void endTextBlock() { }
    @Override
    public void renderImage(ImageRenderInfo renderInfo) { }

    /*
     * @see RenderListener#renderText(TextRenderInfo)
     */
    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        LineSegment ascentLine = renderInfo.getAscentLine();
        LineSegment descentLine = renderInfo.getDescentLine();
        float[] yCoords = new float[]{
                ascentLine.getStartPoint().get(Vector.I2),
                ascentLine.getEndPoint().get(Vector.I2),
                descentLine.getStartPoint().get(Vector.I2),
                descentLine.getEndPoint().get(Vector.I2)
        };
        Arrays.sort(yCoords);
        addVerticalUseSection(yCoords[0], yCoords[3]);
    }

    /**
     * This method marks the given interval as used.
     */
    void addVerticalUseSection(float from, float to)
    {
        if (to < from)
        {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.size(); i++)
        {
            float flip = verticalFlips.get(i);
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.size(); j++)
            {
                flip = verticalFlips.get(j);
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        boolean fromOutsideInterval = i%2==0;
        boolean toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.remove(j);
        if (toOutsideInterval)
            verticalFlips.add(i, to);
        if (fromOutsideInterval)
            verticalFlips.add(i, from);
    }

    final List<Float> verticalFlips = new ArrayList<Float>();
}

(TextLineFinder.java)

This RenderListener tries to identify horizontal text lines by projecting the text bounding boxes onto the y axis. It assumes that these projections do not overlap for text from different lines, even in case of subscripts and superscripts.

This class essentially is a reduced form of the PageVerticalAnalyzer used in this answer.

Sorting text chunks by those lines

Having identified the lines like above, one can tweak iText's LocationTextExtractionStrategy to sort along those lines like this:

public class HorizontalTextExtractionStrategy extends LocationTextExtractionStrategy
{
    public class HorizontalTextChunk extends TextChunk
    {
        public HorizontalTextChunk(String string, Vector startLocation, Vector endLocation, float charSpaceWidth)
        {
            super(string, startLocation, endLocation, charSpaceWidth);
        }

        @Override
        public int compareTo(TextChunk rhs)
        {
            if (rhs instanceof HorizontalTextChunk)
            {
                HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs;
                int rslt = Integer.compare(getLineNumber(), horRhs.getLineNumber());
                if (rslt != 0) return rslt;
                return Float.compare(getStartLocation().get(Vector.I1), rhs.getStartLocation().get(Vector.I1));
            }
            else
                return super.compareTo(rhs);
        }

        @Override
        public boolean sameLine(TextChunk as)
        {
            if (as instanceof HorizontalTextChunk)
            {
                HorizontalTextChunk horAs = (HorizontalTextChunk) as;
                return getLineNumber() == horAs.getLineNumber();
            }
            else
                return super.sameLine(as);
        }

        public int getLineNumber()
        {
            Vector startLocation = getStartLocation();
            float y = startLocation.get(Vector.I2);
            List<Float> flips = textLineFinder.verticalFlips;
            if (flips == null || flips.isEmpty())
                return 0;
            if (y < flips.get(0))
                return flips.size() / 2 + 1;
            for (int i = 1; i < flips.size(); i+=2)
            {
                if (y < flips.get(i))
                {
                    return (1 + flips.size() - i) / 2;
                }
            }
            return 0;
        }
    }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        textLineFinder.renderText(renderInfo);

        LineSegment segment = renderInfo.getBaseline();
        if (renderInfo.getRise() != 0){ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to 
            Matrix riseOffsetTransform = new Matrix(0, -renderInfo.getRise());
            segment = segment.transformBy(riseOffsetTransform);
        }
        TextChunk location = new HorizontalTextChunk(renderInfo.getText(), segment.getStartPoint(), segment.getEndPoint(), renderInfo.getSingleSpaceWidth());
        getLocationalResult().add(location);        
    }

    public HorizontalTextExtractionStrategy() throws NoSuchFieldException, SecurityException
    {
        locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
        locationalResultField.setAccessible(true);

        textLineFinder = new TextLineFinder();
    }

    @SuppressWarnings("unchecked")
    List<TextChunk> getLocationalResult()
    {
        try
        {
            return (List<TextChunk>) locationalResultField.get(this);
        }
        catch (IllegalArgumentException | IllegalAccessException e)
        {
            e.printStackTrace();
            throw new RuntimeException(e);
        }
    }

    final Field locationalResultField;
    final TextLineFinder textLineFinder;
}

(HorizontalTextExtractionStrategy.java)

This TextExtractionStrategy uses a TextLineFinder to identify horizontal text lines and then uses these information to sort the text chunks.

Beware, this code uses reflection to access private parent class members. This might not be allowed in all environments. In such a case, simply copy the LocationTextExtractionStrategy and directly insert the code.

Extracting the text

Now one can use this text extraction strategy to extract the text with inline superscripts and subscripts like this:

String extract(PdfReader reader, int pageNo) throws IOException, NoSuchFieldException, SecurityException
{
    return PdfTextExtractor.getTextFromPage(reader, pageNo, new HorizontalTextExtractionStrategy());
}

(from ExtractSuperAndSubInLine.java)

The example text on page 11 of the OP's document, under "COMBUSTION EFFICIENCY", now is extracted like this:

monoxide (CO) in flue gas in accordance with the following formula:   C.E. = [CO 2/(CO + CO 2 )]

The same approach using C# & iTextSharp

Explanations, warnings, and sample results from the Java-centric section still apply, here is the code:

I'm using iTextSharp 5.5.7.

A way to identify lines

public class TextLineFinder : IRenderListener
{
    public void BeginTextBlock() { }
    public void EndTextBlock() { }
    public void RenderImage(ImageRenderInfo renderInfo) { }

    public void RenderText(TextRenderInfo renderInfo)
    {
        LineSegment ascentLine = renderInfo.GetAscentLine();
        LineSegment descentLine = renderInfo.GetDescentLine();
        float[] yCoords = new float[]{
            ascentLine.GetStartPoint()[Vector.I2],
            ascentLine.GetEndPoint()[Vector.I2],
            descentLine.GetStartPoint()[Vector.I2],
            descentLine.GetEndPoint()[Vector.I2]
        };
        Array.Sort(yCoords);
        addVerticalUseSection(yCoords[0], yCoords[3]);
    }

    void addVerticalUseSection(float from, float to)
    {
        if (to < from)
        {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.Count; i++)
        {
            float flip = verticalFlips[i];
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.Count; j++)
            {
                flip = verticalFlips[j];
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        bool fromOutsideInterval = i%2==0;
        bool toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.RemoveAt(j);
        if (toOutsideInterval)
            verticalFlips.Insert(i, to);
        if (fromOutsideInterval)
            verticalFlips.Insert(i, from);
    }

    public List<float> verticalFlips = new List<float>();
}

Sorting text chunks by those lines

public class HorizontalTextExtractionStrategy : LocationTextExtractionStrategy
{
    public class HorizontalTextChunk : TextChunk
    {
        public HorizontalTextChunk(String stringValue, Vector startLocation, Vector endLocation, float charSpaceWidth, TextLineFinder textLineFinder)
            : base(stringValue, startLocation, endLocation, charSpaceWidth)
        {
            this.textLineFinder = textLineFinder;
        }

        override public int CompareTo(TextChunk rhs)
        {
            if (rhs is HorizontalTextChunk)
            {
                HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs;
                int rslt = CompareInts(getLineNumber(), horRhs.getLineNumber());
                if (rslt != 0) return rslt;
                return CompareFloats(StartLocation[Vector.I1], rhs.StartLocation[Vector.I1]);
            }
            else
                return base.CompareTo(rhs);
        }

        public override bool SameLine(TextChunk a)
        {
            if

Categories

c# - How can I extract subscript / superscript properly from a PDF using iTextSharp?

c# - How can I extract subscript / superscript properly from a PDF using iTextSharp?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

An approach using Java & iText

A way to identify lines

Sorting text chunks by those lines

Extracting the text

The same approach using C# & iTextSharp

A way to identify lines

Sorting text chunks by those lines

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags