Reduce the size of a PDF file consisting of scanned images
Ghostscript
Platforms: Linux, Unix, MacOS X, Windows
Requires: Ghostscript
Ghostscript delivers, in my opinion, better results than ImageMagick when down-sampling PDF documents entirely comprised of images. Although the arguments are rather lengthy if used often it makes sense to write a script for it.
The assumption is that the documents to be converted were scanned at a high resolution. Ghostscripts pdfwrite device has a switch called PDFSETTINGS that predefines the output settings. The names are aligned with Adobe Distiller and yield similar results.
Lowest resolution with the presets is /screen and results in a very small pdf file that can be viewed on screen but is not fit for printing.
gs -q -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/screen -sOutputFile=OUTPUT.pdf INPUT.pdf
A pdf that should at least print without too many artifacts should be converted with /printer. Additionally we use a device independent color conversion strategy.
gs -q -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/printer \ -sColorConversionStrategy=/UseDeviceIndependentColor \ -sOutputFile=OUTPUT.pdf INPUT.pdf
At the time of this writing the following PDFSETTINGS were available. In order of quality, lowest first.
- /screen
- /ebook
- /default
- /printer
- /prepress
For more information consult the documentation for ps2pdf of your ghostscript version.
ImageMagick
Platforms: Linux, Unix, MacOS X, Windows
Requires: ImageMagick, Ghostscript
It is wise to scan documents at the highest resolution possible as downsampling can be done at any point. The fastest way I found so far the tools of the ImageMagick suite or with GhostScript.
Downsampling to 150 dpi.
Note: Not setting the -compress will set the internal image type to TIFF. For full control of the outcome it is advisable to set the -compress option.
convert -density 150 INPUT.pdf OUTPUT.pdf
Downsampling a PDF with images scanned at a high resolution to 150dpi converting the internal image to JPEG at a ratio of 80%. Useful for sending by mail.
convert -density 150 -compression jpeg -quality 80 INPUT.pdf OUTPUT.pdf
Convert images to PDF documents
ImageMagick
Platforms: Linux, Unix, MacOS X, Windows
Requires: ImageMagick, Ghostscript
PDFs from scans are a very common occurence these days. Depending on the purpose conversion is sometimes required. It helps to understand how PDFs store the raster data internally to make to choose best option for the task at hand.
An overview of the can be found at wikipedia In short the below examples produce either an embedded JPEG or TIFF.
Assuming you have a few images laying around that need to be converted to a PDF file.
convert [-repage <format>] -compress <algorithm> [-quality <quality in %>] INPUT.tif OUTPUT.pdf
Creating a A4 PDF with lossy JPEG compression at a compression ration of 80%. A higher number under quality yields a clearer image but requires more disk space.
Note: If the page is already the correct size -repage is not required. The -quality option is optional but if you want to retain full control over the outcome I would suggest you use it.
convert -repage a4 -compress jpeg -quality 80 INPUT01.tif INPUT02.tif INPUT03.tif OUTPUT.pdf
For lossless PDFs in size A4 using the the TIFF format for storage there are two options either LZW compression or ZIP. ZIP seems to be a bit more efficient. Note the -quality field is not required.
convert -repage a4 -compress lzw INPUT01.tif INPUT02.tif INPUT03.tif OUTPUT.pdf
or
convert -repage a4 -compress zip INPUT01.tif INPUT02.tif INPUT03.tif OUTPUT.pdf
GUI Tools
There are some Windows GUI tools that can do the task as well
- Images2PDF - Windows - FOSS
- i2pdf - Windows - Freeware
Extract images from PDF files
pdfimages
Platforms: Linux, MacOS X [not confirmed]
Requirments: Poppler
Getting all the images out of a PDF file can be quite a task. The Poppler library comes with some handy tools that can be of tremendous help. To extract images from pdf pdfimages is easy to use.
pdfimages INPUT.pdf PREFIX
will result in files with the names
INPUT-000.ppm INPUT-001.ppm . . . INPUT-999.ppm
A better way is to preserve embedded JPEG images as such. Assuming a PDF with only embedded JPEGs we use the -j option.
pdfimages -j INPUT.pdf PREFIX
will yield the following files
INPUT-000.jpg INPUT-001.jpg . . . INPUT-999.jpg
ImageMagick
Platforms: Linux, Unix, MacOS X, Windows
Requires: ImageMagick, Ghostscript
ImageMagick offers some hand here as well and is particularly handy when you want to convert to some format that the pdfimages command does not offer. Always use the -density option or ImageMagick will downsample the images. When converting to JPEGs set the -quality option to keep get the picture quality you desire.
In the below examples we set the dpi to 300
convert -density 300x300 MULTIPAGE.pdf OUPUT-%d.png
or
convert -density 300x300 -quality 100 MULTIPAGE.pdf OUPUT-%d.jpg
Ghostscript
Platforms: Linux, Unix, MacOS X, Windows
Requires: Ghostscript
Ghostscript can do the job here as well but it is rather verbose in its syntax. To get the write image type the output device has to be specified.
To convert to a 16bit PNG with a resolution of 300 dpi the following will do. Note: Not setting the -r option will most likely result in a low quality image.
gs -q -dBATCH -dNOPAUSE -sDEVICE=png16 -r300 -sOutputFile=OUTPUT-%03d.png INPUT.pdf
To do the same but produce a colored TIFF
gs -q -dBATCH -dNOPAUSE -sDEVICE=tiff24nc -r300 -sOutputFile=OUTPUT-%03d.tif INPUT.pdf
GUI Tools
- PDFMod - Linux/Unix - FOSS - utilizing Poppler.
Merge multiple PDF files
pdfunite
Platforms: Linux, MacOS X [not confirmed]
Requirments: Poppler
Merging more than one PDF into single document can be done with pdfunite coming with Poppler.
pdfunite INPUT01.pdf INPUT02.pdf OUTPUT.pdf
Note: pdfunite does not alert you if the output pdf already exists and will happily clobber it!
pdftk
Platforms: Linux, MacOS X, Windows
Requirments: pdftk
pdftk will do the job as well but it's syntax is a bit more verbose.
pdftk INPUT01.pdf INPUT02.pdf cat output OUTPUT.pdf
ImageMagick
Platforms: Linux, Unix, MacOS X, Windows
Requires: ImageMagick, Ghostscript
Disclaimer: Only use ImageMagick for PDFs consisting entirely of images. Compound documents will end up being converted to one image per page.
ImageMagick is not really efficient joining PDFs but it will allow for downsampling of images on the fly.
Note: According to my tests not using -compress and -density will create a PDF with embeded TIFF images. If the original PDF was compressed in a JPEG format this may result in a file larger in size than the original it was converted from.
convert -compress jpeg -quality 90 -density 300x300 -adjoin INPUT01.pdf INPUT02.pdf OUTPUT.pdf
GUI Tools
- PDF Chain - Linux/Unix - FOSS - Frontend for pdftk
- PDFTK Builder - Windows - FOSS - Frontend for pdftk
- GUIPDFTK - Linux, Windows - FOSS - Frontend for pdftk
- PDF Split and Merge - Linux, MacOS, Windows - FOSS
- PDFMod - Linux/Unix - FOSS - utilizing Poppler
Rotate pages
pdftk
Platforms: Linux, MacOS X, Windows
Requirments: pdftk
pdftk is the easiest tool I found for this purpose. Not only can it rotate all pages in the PDF but individual pages as well.
Rotation in pdftk is either in absolute values or relative to the current page orientation.
- N 0° (North; absolute)
- E 90° (East; absolute)
- S 180° (South; absolute)
- W 270° (West; absolute)
- L -90° (Left; relative)
- R +90° (Right; relative)
- D +180° (relative)
Rotate all pages by 90°. cat expects a page range which is from the first page 1 to the last page end which then in turn is followed by the rotation instruction for East. E
pdftk INPUT01.pdf cat 1-endE output OUTPUT.pdf
Alternatively L or R can be used.
pdftk INPUT01.pdf cat 1-endL output OUTPUT.pdf
Assuming a PDF document with 4 pages and every page requires a different rotation the following command will do the trick.
pdftk INPUT01.pdf cat 1E 2S 3W 4N output OUTPUT.pdf
GUI Tools
- PDF Chain - Linux/Unix - FOSS - Frontend for pdftk
- PDFTK Builder - Windows - FOSS - Frontend for pdftk
- GUIPDFTK - Linux, Windows - FOSS - Frontend for pdftk
- PDF Split and Merge - Linux, MacOS, Windows - FOSS