PDFs are one of the most popular and convenient ways to disseminate information in the increasingly digital world. They can be opened in practically any desktop or mobile environment, and support almost all modern languages.
At SimulTrans, we receive dozens of requests to localize PDFs for technical translation, etc. every month. Some of them are scanned documents and the original sources are not available to our clients.
However, in some cases those PDFs are generated out of desktop publishing applications like InDesign, Quark, FrameMaker, or the output of Content Management Systems.
While we are able to provide quotes based exclusively on PDFs, there are several advantages of providing the actual source files for analysis:
Quotes based on source files are more accurate than those based on PDFs
PDFs are an output rather than an actual source. This means that the quality of that output can vary hugely. Very few Computer Assisted Translation (CAT) tools can handle their translation in native format. For those that do, the resulting translated document can be a hit or miss because:
- any non-selectable section may not be included in the scope
- graphics are usually excluded
- small print cannot be rendered correctly
In order to provide a more precise project scope to base a quote on, it is possible to convert the PDF into a localization-friendly format like Word.
Using advanced Optical Character Recognition (OCR) converters, SimulTrans can extract the localizable content (including graphics and screenshots) and provide industry standard logs and Desktop Publishing (DTP) times.
However, even though these log files can provide a fair idea of the scope (wordcount, for example), they are not entirely accurate. And the longer and more complex the PDF, the more variations we can expect.
Why does this happen?
- Readable Content
Not even the most advanced OCR can convert low resolution text correctly in a PDF created with less-than-optimal quality, so variations in the word count are inevitable when compared to the analysis of actual sources.
For instance, a simple heading like "User Manual" would be rendered as "Us er Man ua l", increasing the word count from 2 to 5. The impact could be bigger if the source language of your files is not English.
- Graphic Content Ambiguity
It is not always possible to distinguish which graphics are editable or not in a PDF. In a low resolution PDF, most of the graphic text will not be selectable and will have to be assumed as non-editable; it may not be possible to determine whether there are source graphics available or not, which could unnecessarily increase the estimate costs.
- Reusable Content
Repeated text that is used in several pages of a document can be marked as footers, headers and/or cross-references in most of the new generation desktop publishing software applications.
This means that analysing the sources will only count this text once; however, when analyzing a PDF, it could be counted repeatedly, inflating the word count.
Why sources matter?
Providing the full set of source files, created from your publishing software application will allow SimulTrans to detect any potential issues with them, like translation-unfriendly formatting (for example, line breaks or indentations that result in unneeded segmentation), missing graphics or fonts or localization.
This way we can fine-tune any potential localization issues upfront, at analysis stage, and avoid any surprises towards the end of a project that could impact on timeframe and cost.
So remember, sending your full set of source files, fonts, and graphics for your documentation translation project, is the best way to get an accurate proposal and schedule.
Of course, there is a way to provide you with a quote and translated files if you need to localize a set of PDFs and don't have or cannot find the source files. That particular process will be covered in a future blog, so stay tuned!