The backgrounds on why space between words sometimes is not properly recognized by iText(Sharp) or other PDF text extractors, have been explained in this answer to "itext java pdf to text creation": These 'spaces' are not necessarily created using a space character but instead using an operation creating a small gap. These operations are also used for other purposes (which do not break words), though, and so a text extractor must use heuristics to decide whether such a gap is a word break or not...
This especially implies that you never get a 100% secure word break detection.
What you can do, though, is to improve the heuristics used.
iText and iTextSharp standard text extraction strategies, e.g. assume a word break in a line if
a) there is a space character or
b) there is a gap at least as wide as half a space character.
Item a is a sure hit but item b may often fail in case of densely set text. The OP of the question to the answer referenced above got quite good results using a fourth of the width of a space character instead.
You can tweak these criteria by copying and changing the text extraction strategy of your choice.
In the SimpleTextExtractionStrategy
you find this criterion embedded in the renderText
method:
if (spacing > renderInfo.GetSingleSpaceWidth()/2f){
AppendTextChunk(' ');
}
In case of the LocationTextExtractionStrategy
this criterion meanwhile has been put into a method of its own:
/**
* Determines if a space character should be inserted between a previous chunk and the current chunk.
* This method is exposed as a callback so subclasses can fine tune the algorithm for determining whether a space should be inserted or not.
* By default, this method will insert a space if the there is a gap of more than half the font space character width between the end of the
* previous chunk and the beginning of the current chunk. It will also indicate that a space is needed if the starting point of the new chunk
* appears *before* the end of the previous chunk (i.e. overlapping text).
* @param chunk the new chunk being evaluated
* @param previousChunk the chunk that appeared immediately before the current chunk
* @return true if the two chunks represent different words (i.e. should have a space between them). False otherwise.
*/
protected bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk) {
float dist = chunk.DistanceFromEndOf(previousChunk);
if(dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f)
return true;
return false;
}
The intention for putting this into a method of its own was to merely require simple subclassing of the strategy and overriding this method to adjust the heuristics criteria. This works fine in case of the equivalent iText Java class but during the port to iTextSharp unfortunately no virtual
has been added to the declaration (as of version 5.4.4). Thus, currently copying the whole strategy is still necessary for iTextSharp.
@Bruno You might want to tell the iText -> iTextSharp porting team about this.
While you can fine tune text extraction at these code locations you should be aware that you will not find a 100% criterion here. Some reasons are:
- Gaps between words in densely set text can be smaller than kerning or other gaps for some optical effect inside words. Thus, there is no one-size-fits-all factor here.
- In PDFs not using the space character at all (as you can always use gaps, this is possible), the "width of a space character" might be some random value or not determinable at all!
- There are funny PDFs abusing the space character width (which can individually be stretched at any time for the operations to follow) to do some tabular formatting while using gaps for word breaking. In such a PDF the value of the current width of a space character cannot seriously be used to determine word breaks.
- Sometimes you find s i n g l e words in a line printed spaced out for emphasis. These will likely be parsed as a collection of one-letter words by most heuristics.
You can get better than the iText heuristics and those derived from it using other constants by taking into account the actual visual free space between all characters (using PDF rendering or font information analysis mechanisms), but for a perceivable improvement you have to invest much time.