The y-coordinates I get back for the lines in a table seem to be stretched beyond the coordinates of the text. There seems to be some transformation going on, but I cannot find it. If possible I would like to fix the problem within the scope of the PDFGraphicsStreamEngine as extended below, and not have to go back to the drawing board with the other input streams available in PDFBox.
I have extended PDFTextStripper
to acquire the location of every text glyph on the page:
public class MyPDFTextStripper extends PDFTextStripper {
private List<TextPosition> tps;
public MyPDFTextStripper() throws IOException {
tps = new ArrayList<>();
}
@Override
protected void writeString
(String text,
List<TextPosition> textPositions)
throws IOException {
tps.addAll(textPositions);
}
List<TextPosition> getTps() {
return tps;
}
}
and I have extended PDFGraphicsStreamEngine
to extract every line on the page as a Line2D
:
public class LineCatcher extends PDFGraphicsStreamEngine
{
private final GeneralPath linePath = new GeneralPath();
private List<Line2D> lines;
LineCatcher(PDPage page)
{
super(page);
lines = new ArrayList<>();
}
List<Line2D> getLines() {
return lines;
}
@Override
public void strokePath() throws IOException
{
Rectangle2D rect = linePath.getBounds2D();
Line2D line = new Line2D.Double(rect.getX(), rect.getY(),
rect.getX() + rect.getWidth(),
rect.getY() + rect.getHeight());
lines.add(line);
linePath.reset();
}
@Override
public void moveTo(float x, float y) throws IOException
{linePath.moveTo(x, y);}
@Override
public void lineTo(float x, float y) throws IOException
{linePath.lineTo(x, y);}
@Override
public Point2D getCurrentPoint() throws IOException
{return linePath.getCurrentPoint();}
//all other overridden methods can be left empty for the purposes of this problem.
}
I have written a simple program to demonstrate the problem:
public class PageAnalysis {
public static void main(String[] args) {
try (PDDocument doc = PDDocument.load(new File("onePage.pdf"))) {
PDPage page = doc.getPage(0);
MyPDFTextStripper ts = new MyPDFTextStripper();
ts.getText(doc);
List<TextPosition> tps = ts.getTps();
System.out.println("Y coordinates in text:");
Set<Integer> ySet = new HashSet<>();
for (TextPosition tp: tps) {
ySet.add((int)tp.getY());
}
List<Integer> yList = new ArrayList<>(ySet);
Collections.sort(yList);
for (int y: yList){
System.out.print(y + " ");
}
System.out.println();
System.out.println("Y coordinates in lines:");
LineCatcher lineCatcher = new LineCatcher(page);
lineCatcher.processPage(page);
List<Line2D> lines = lineCatcher.getLines();
ySet = new HashSet<>();
for (Line2D line: lines) {
ySet.add((int)line.getY2());
}
yList = new ArrayList<>(ySet);
Collections.sort(yList);
for (int y: yList){
System.out.print(y + " ");
}
System.out.println();
} catch (IOException e) {
e.printStackTrace();
}
}
}
The output from this is:
Y coordinates in text:
66 79 106 118 141 153 171 189 207 225 243 261 279 297 315 333 351 370 388 406 424 442 460 478 496 514 780
Y coordinates in lines:
322 340 358 376 394 412 430 448 466 484 502 520 538 556 574 593 611 629 647 665 683 713
The last number in the text list corresponds to the y-coordinate of the page number at the bottom. I cannot find what is going on with the y-coordinates of the lines, though it seems to be those which have been transformed (the media box is the same here as it was for the text, and it fits in with the text positions). The current transformation matrix has 1.0 for yScaling also.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…