The folder structure the OP tries to replicate while extracting portfolio files is specified in the Adobe? Supplement to the ISO 32000, BaseVersion: 1.7, ExtensionLevel: 3. Thus, it is not part of the current PDF standard and, therefore, PDF processing software is not required to understand this kind of information. It looks like being scheduled for addition to the upcoming PDF-2 (ISO 32000-2) standard, though.
To extract portfolio files into the associated folder structure, therefore, we have to retrieve the folder information as specified in the Adobe? Supplement:
Beginning with extension level 3, a portable collection can contain a Folders object for the purpose of
organizing files into a hierarchical structure. The structure is represented by a tree with a single root folder
acting as the common ancestor for all other folders and files in the collection. The single root folder is
referenced in the Folders
entry of Table 8.6 on page 29.
Table 8.6c describes the entries in a folder dictionary
ID
integer (Required; ExtensionLevel 3) A non-negative integer value
representing the unique folder identification number. Two folders
shall not share the same ID
value.
The folder ID
value appears as part of the name tree key of any file
associated with this folder. A detailed description of the
association between folder and files can be found after this table.
Name
text string (Required; ExtensionLevel 3) A file name representing the name of
the folder. Two sibling folders shall not share the same name
following case normalization.
Child
dictionary (Required if the folder has any descendents; ExtensionLevel 3) An
indirect reference to the first child folder of this folder.
Next
dictionary (Required for all but the last item at each level; ExtensionLevel 3) An
indirect reference to the next sibling folder at this level.
(section 8.2.4 Collections)
E.g. like this:
static Map<Integer, File> retrieveFolders(PdfReader reader, File baseDir) throws DocumentException
{
Map<Integer, File> result = new HashMap<Integer, File>();
PdfDictionary root = reader.getCatalog();
PdfDictionary collection = root.getAsDict(PdfName.COLLECTION);
if (collection == null)
throw new DocumentException("Document has no Collection dictionary");
PdfDictionary folders = collection.getAsDict(FOLDERS);
if (folders == null)
throw new DocumentException("Document collection has no folders dictionary");
collectFolders(result, folders, baseDir);
return result;
}
static void collectFolders(Map<Integer, File> collection, PdfDictionary folder, File baseDir)
{
PdfString name = folder.getAsString(PdfName.NAME);
File folderDir = new File(baseDir, name.toString());
folderDir.mkdirs();
PdfNumber id = folder.getAsNumber(PdfName.ID);
collection.put(id.intValue(), folderDir);
PdfDictionary next = folder.getAsDict(PdfName.NEXT);
if (next != null)
collectFolders(collection, next, baseDir);
PdfDictionary child = folder.getAsDict(CHILD);
if (child != null)
collectFolders(collection, child, folderDir);
}
final static PdfName FOLDERS = new PdfName("Folders");
final static PdfName CHILD = new PdfName("Child");
(excerpt from PortfolioFileExtraction.java)
and use these retrieved folder information when writing the files.
The association of files and folders is specified in the Adobe? Supplement like this:
As previously mentioned, files in the EmbeddedFiles
name tree are associated with folders by a special
naming convention applied to the name tree key strings. Strings that conform to the following rules serve
to associate the corresponding file with a folder:
- The name tree keys are PDF text strings.
- The first character, excluding any byte order marker, is U+003C, the LESS-THAN SIGN (<).
- The following characters shall one or more digits (0 to 9) followed by the closing U+003E, the
GREATER-THAN SIGN (>)
- The remainder of the string is a file name.
The section of the string enclosed by LESS-THAN SIGN GREATER-THAN SIGN(<>) is interpreted as a
numeric value that specifies the ID value of the folder with which the file is associated. The value shall
correspond to a folder ID. The section of the string following the folder ID tag represents the file name of
the embedded file.
Files in the EmbeddedFiles name tree that do not conform to these rules shall be treated as associated
with the root folder.
(section 8.2.4 Collections)
Your methods can be extended to do so like this:
public static void extractAttachmentsWithFolders(PdfReader reader, String dir) throws IOException, DocumentException
{
File folder = new File(dir);
folder.mkdirs();
Map<Integer, File> folders = retrieveFolders(reader, folder);
PdfDictionary root = reader.getCatalog();
PdfDictionary names = root.getAsDict(PdfName.NAMES);
System.out.println("" + names.getKeys().toString());
PdfDictionary embedded = names.getAsDict(PdfName.EMBEDDEDFILES);
System.out.println("" + embedded.toString());
PdfArray filespecs = embedded.getAsArray(PdfName.NAMES);
for (int i = 0; i < filespecs.size();)
{
extractAttachment(reader, folders, folder, filespecs.getAsString(i++), filespecs.getAsDict(i++));
}
}
protected static void extractAttachment(PdfReader reader, Map<Integer, File> dirs, File dir, PdfString name, PdfDictionary filespec) throws IOException
{
PRStream stream;
FileOutputStream fos;
String filename;
PdfDictionary refs = filespec.getAsDict(PdfName.EF);
File dirHere = dir;
String nameString = name.toUnicodeString();
if (nameString.startsWith("<"))
{
int closing = nameString.indexOf('>');
if (closing > 0)
{
int folderId = Integer.parseInt(nameString.substring(1, closing));
File folderFile = dirs.get(folderId);
if (folderFile != null)
dirHere = folderFile;
}
}
for (PdfName key : refs.getKeys())
{
stream = (PRStream) PdfReader.getPdfObject(refs.getAsIndirectObject(key));
filename = filespec.getAsString(key).toString();
fos = new FileOutputStream(new File(dirHere, filename));
fos.write(PdfReader.getStreamBytes(stream));
fos.flush();
fos.close();
}
}
(excerpt from PortfolioFileExtraction.java)
Applying these methods to your sample PDF (e.g. using the test method testSamplePortfolio11Folders
in PortfolioFileExtraction.java) one gets
Root
│ ThumbImpression.pdf
│
├───Folder 1
│ │ EStampPdf.pdf
│ │ Presentation.pdf
│ │
│ ├───Folder 11
│ │ │ Test.pdf
│ │ │
│ │ └───Folder 111
│ └───Folder 12
└───Folder 2
SealDeed.pdf