I'm trying to support as much Unicode as I can in the PDF files I'm writing. I want to be able to output utf8 strings and have them display correctly in the PDF.
I see in the libharu encodings documentation (https://github.com/libharu/libharu/wiki/Encodings) that there are many single-byte code pages I can access, and special functions for accessing multi-byte code pages if I want Chinese, Japanese, and Korean. But my understanding if that if I wanted to use all of those pages and functions to write arbitrary utf8 strings, I'd have to write a bunch of code to break my utf8 strings into segments that each use a specific code page, and then do whatever code page swapping is necessary, with reverse mapping of each of my segments from utf8 to the given code page before outputting it. That seems like a lot of error-prone work compared to just being able to say "write this utf8 string".
To be able to write utf8 strings I'm using this code:
myPdf = HPDF_New( PdfErrorHandler, NULL );
HPDF_UseUTFEncodings( myPdf );
HPDF_SetCurrentEncoder( myPdf, "UTF-8" );
const char *f = HPDF_LoadTTFontFromFile( myPdf, "path/to/verdana.ttf", HPDF_TRUE );
HPDF_Font myFont = HPDF_GetFont( myPdf, f, "UTF-8" );
... go on to use myFont to write various text strings
That works, and I can write utf8 strings with accented Latin characters, and Cyrillic and Greek characters, and they show correctly in the PDF.
However, because I used that HPDF_TRUE
to embed the font in my file, it increases the size of my file significantly. I am in fact using four fonts (verdana.ttf, verdanab.ttf, verdanai.ttf, and verdanaz.ttf), and they add over 600k to my file size, as compared to when I was using the "built-in" libharu fonts (which leave the file tiny, just a few k).
(I did try using HPDF_FALSE
to not embed the fonts, but then my files open with random Latin characters.)
I'm trying to understand conceptually why it's necessary to embed fonts in my PDF, if I'm using a font like verdana that is going to be on the end user's system anyway. (I don't even care if it's verdana -- any standard sans serif font would do.) I've certainly created lots of PDF files by other means (e.g., exporting from Word) containing Greek, Cyrillic, Chinese, and other characters, and yet they are small. So is this embedding-to-use-utf8 requirement just a quirk of libharu?
Plus, even with that 600k bulk my files made with libharu show Chinese characters as blocks. I read on a libharu documentation page that libharu only supports one and two-byte utf8 sequences, which includes most everything except Chinese, Japanese, and Korean. So does this mean I'm embedding verdana.ttf, the majority of which is Chinese, Japanese, and Korean glyphs, and I can't even access them?
In any case, Chinese, Japanese, and Korean are not important for my current application, but just for the two-byte utf8 sequences I'm trying to understand if there's a way for me to use them in libharu without having to embed big fonts in my file.
See Question&Answers more detail:
os