Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
232 views
in Technique[技术] by (71.8m points)

c# - iTextSharp: Split pages size equals file size

Here is how I split a large PDF (144 mb):

public int SplitAndSave(string inputPath, string outputPath)
{
    FileInfo file = new FileInfo(inputPath);
    string name = file.Name.Substring(0, file.Name.LastIndexOf("."));

    using (PdfReader reader = new PdfReader(inputPath))
    {
        for (int pagenumber = 1; pagenumber <= reader.NumberOfPages; pagenumber++)
        {
            string filename = pagenumber.ToString() + ".pdf";

            Document document = new Document();
            PdfCopy copy = new PdfCopy(document, new FileStream(outputPath + "" + filename, FileMode.Create));

            document.Open();

            copy.AddPage(copy.GetImportedPage(reader, pagenumber));

            document.Close();
        }
        return reader.NumberOfPages;
    }
}

For most PDFs (small size, and I guess old format), all works fine. But for a bigger one (that perhaps are using something like refstreams for better compression), the split pages open as one page, but its size is equal to the original PDF's size. What can I do?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In case of your document Top_Gear_Magazine_2012_09.pdf the reason is indeed the one I mentioned: All pages refer to object 2 0 R as their /Resources, and the dictionary in 2 0 obj in turn references all images in the PDF.

To split that document into partial documents containing only the images required, you should preprocess the document by first finding out which images belong to which pages and then creating individual /Resources dictionaries for all pages.

As you already use iText in this context, you can also use it to find out which images are required for which pages. Use the iText parser package to initially parse the PDF page by page using a RenderListener implementation whose RenderImage method simply remembers which image objects are used on the current page. (As a special twist, iText hides the name of the image XObject in question; you get the indirect object, though, and can query its object and generation number which suffices for the next step.)

In a second step, you open the document in a PdfStamperand iterate over the pages. For each page you retrieve the /Resources dictionary and copy it, but only copy those XObjects references referencing one of the image objects whose object number and generation you remembered for the respective page during the first step. Finally you set the diminished copy as the /Resources dictionary of the page in question.

The resulting PDF should split just fine.

PS A very similar issue recently came up on the iText mailing list. In that thread the solution recipe given here has been improved, to get around the difficulties caused by iText hiding the xobject name, I now would propose to intervene before the name is lost by using a different ContentOperator for "Do", here the Java version:

class Do implements ContentOperator 
{ 
    public void invoke(PdfContentStreamProcessor processor, PdfLiteral operator, ArrayList<PdfObject> operands) throws IOException 
    { 
        PdfName xobjectName = (PdfName)operands.get(0); 
        names.add(xobjectName); 
    } 

    final List<PdfName> names = new ArrayList<PdfName>(); 
} 

This content operator simply collects the names of the used xobjects, i.e. the xobject resources to keep for the given page.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...