To properly extract these subscripts and superscripts in line, one needs a different approach to check whether two text chunks are on the same line. The following classes represent one such approach.
I'm more at home in Java/iText; thus, I implemented this approach in Java first and only afterwards translated it to C#/iTextSharp.
An approach using Java & iText
I'm using the current development branch iText 5.5.8-SNAPSHOT.
A way to identify lines
Assuming text lines to be horizontal and the vertical extend of the bounding boxes of the glyphs on different lines to not overlap, one can try to identify lines using a RenderListener
like this:
public class TextLineFinder implements RenderListener
{
@Override
public void beginTextBlock() { }
@Override
public void endTextBlock() { }
@Override
public void renderImage(ImageRenderInfo renderInfo) { }
/*
* @see RenderListener#renderText(TextRenderInfo)
*/
@Override
public void renderText(TextRenderInfo renderInfo)
{
LineSegment ascentLine = renderInfo.getAscentLine();
LineSegment descentLine = renderInfo.getDescentLine();
float[] yCoords = new float[]{
ascentLine.getStartPoint().get(Vector.I2),
ascentLine.getEndPoint().get(Vector.I2),
descentLine.getStartPoint().get(Vector.I2),
descentLine.getEndPoint().get(Vector.I2)
};
Arrays.sort(yCoords);
addVerticalUseSection(yCoords[0], yCoords[3]);
}
/**
* This method marks the given interval as used.
*/
void addVerticalUseSection(float from, float to)
{
if (to < from)
{
float temp = to;
to = from;
from = temp;
}
int i=0, j=0;
for (; i<verticalFlips.size(); i++)
{
float flip = verticalFlips.get(i);
if (flip < from)
continue;
for (j=i; j<verticalFlips.size(); j++)
{
flip = verticalFlips.get(j);
if (flip < to)
continue;
break;
}
break;
}
boolean fromOutsideInterval = i%2==0;
boolean toOutsideInterval = j%2==0;
while (j-- > i)
verticalFlips.remove(j);
if (toOutsideInterval)
verticalFlips.add(i, to);
if (fromOutsideInterval)
verticalFlips.add(i, from);
}
final List<Float> verticalFlips = new ArrayList<Float>();
}
(TextLineFinder.java)
This RenderListener
tries to identify horizontal text lines by projecting the text bounding boxes onto the y axis. It assumes that these projections do not overlap for text from different lines, even in case of subscripts and superscripts.
This class essentially is a reduced form of the PageVerticalAnalyzer
used in this answer.
Sorting text chunks by those lines
Having identified the lines like above, one can tweak iText's LocationTextExtractionStrategy
to sort along those lines like this:
public class HorizontalTextExtractionStrategy extends LocationTextExtractionStrategy
{
public class HorizontalTextChunk extends TextChunk
{
public HorizontalTextChunk(String string, Vector startLocation, Vector endLocation, float charSpaceWidth)
{
super(string, startLocation, endLocation, charSpaceWidth);
}
@Override
public int compareTo(TextChunk rhs)
{
if (rhs instanceof HorizontalTextChunk)
{
HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs;
int rslt = Integer.compare(getLineNumber(), horRhs.getLineNumber());
if (rslt != 0) return rslt;
return Float.compare(getStartLocation().get(Vector.I1), rhs.getStartLocation().get(Vector.I1));
}
else
return super.compareTo(rhs);
}
@Override
public boolean sameLine(TextChunk as)
{
if (as instanceof HorizontalTextChunk)
{
HorizontalTextChunk horAs = (HorizontalTextChunk) as;
return getLineNumber() == horAs.getLineNumber();
}
else
return super.sameLine(as);
}
public int getLineNumber()
{
Vector startLocation = getStartLocation();
float y = startLocation.get(Vector.I2);
List<Float> flips = textLineFinder.verticalFlips;
if (flips == null || flips.isEmpty())
return 0;
if (y < flips.get(0))
return flips.size() / 2 + 1;
for (int i = 1; i < flips.size(); i+=2)
{
if (y < flips.get(i))
{
return (1 + flips.size() - i) / 2;
}
}
return 0;
}
}
@Override
public void renderText(TextRenderInfo renderInfo)
{
textLineFinder.renderText(renderInfo);
LineSegment segment = renderInfo.getBaseline();
if (renderInfo.getRise() != 0){ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to
Matrix riseOffsetTransform = new Matrix(0, -renderInfo.getRise());
segment = segment.transformBy(riseOffsetTransform);
}
TextChunk location = new HorizontalTextChunk(renderInfo.getText(), segment.getStartPoint(), segment.getEndPoint(), renderInfo.getSingleSpaceWidth());
getLocationalResult().add(location);
}
public HorizontalTextExtractionStrategy() throws NoSuchFieldException, SecurityException
{
locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
locationalResultField.setAccessible(true);
textLineFinder = new TextLineFinder();
}
@SuppressWarnings("unchecked")
List<TextChunk> getLocationalResult()
{
try
{
return (List<TextChunk>) locationalResultField.get(this);
}
catch (IllegalArgumentException | IllegalAccessException e)
{
e.printStackTrace();
throw new RuntimeException(e);
}
}
final Field locationalResultField;
final TextLineFinder textLineFinder;
}
(HorizontalTextExtractionStrategy.java)
This TextExtractionStrategy
uses a TextLineFinder
to identify horizontal text lines and then uses these information to sort the text chunks.
Beware, this code uses reflection to access private parent class members. This might not be allowed in all environments. In such a case, simply copy the LocationTextExtractionStrategy
and directly insert the code.
Extracting the text
Now one can use this text extraction strategy to extract the text with inline superscripts and subscripts like this:
String extract(PdfReader reader, int pageNo) throws IOException, NoSuchFieldException, SecurityException
{
return PdfTextExtractor.getTextFromPage(reader, pageNo, new HorizontalTextExtractionStrategy());
}
(from ExtractSuperAndSubInLine.java)
The example text on page 11 of the OP's document, under "COMBUSTION EFFICIENCY", now is extracted like this:
monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO 2/(CO + CO 2 )]
The same approach using C# & iTextSharp
Explanations, warnings, and sample results from the Java-centric section still apply, here is the code:
I'm using iTextSharp 5.5.7.
A way to identify lines
public class TextLineFinder : IRenderListener
{
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
public void RenderText(TextRenderInfo renderInfo)
{
LineSegment ascentLine = renderInfo.GetAscentLine();
LineSegment descentLine = renderInfo.GetDescentLine();
float[] yCoords = new float[]{
ascentLine.GetStartPoint()[Vector.I2],
ascentLine.GetEndPoint()[Vector.I2],
descentLine.GetStartPoint()[Vector.I2],
descentLine.GetEndPoint()[Vector.I2]
};
Array.Sort(yCoords);
addVerticalUseSection(yCoords[0], yCoords[3]);
}
void addVerticalUseSection(float from, float to)
{
if (to < from)
{
float temp = to;
to = from;
from = temp;
}
int i=0, j=0;
for (; i<verticalFlips.Count; i++)
{
float flip = verticalFlips[i];
if (flip < from)
continue;
for (j=i; j<verticalFlips.Count; j++)
{
flip = verticalFlips[j];
if (flip < to)
continue;
break;
}
break;
}
bool fromOutsideInterval = i%2==0;
bool toOutsideInterval = j%2==0;
while (j-- > i)
verticalFlips.RemoveAt(j);
if (toOutsideInterval)
verticalFlips.Insert(i, to);
if (fromOutsideInterval)
verticalFlips.Insert(i, from);
}
public List<float> verticalFlips = new List<float>();
}
Sorting text chunks by those lines
public class HorizontalTextExtractionStrategy : LocationTextExtractionStrategy
{
public class HorizontalTextChunk : TextChunk
{
public HorizontalTextChunk(String stringValue, Vector startLocation, Vector endLocation, float charSpaceWidth, TextLineFinder textLineFinder)
: base(stringValue, startLocation, endLocation, charSpaceWidth)
{
this.textLineFinder = textLineFinder;
}
override public int CompareTo(TextChunk rhs)
{
if (rhs is HorizontalTextChunk)
{
HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs;
int rslt = CompareInts(getLineNumber(), horRhs.getLineNumber());
if (rslt != 0) return rslt;
return CompareFloats(StartLocation[Vector.I1], rhs.StartLocation[Vector.I1]);
}
else
return base.CompareTo(rhs);
}
public override bool SameLine(TextChunk a)
{
if