Skip to main content

Extract pages from a PDF file in Ubuntu 10.10

·4 mins

Yesterday I got an odd task at hand. One of senior members in my team and really amazing person I must say, e-mailed me few PDFs of Linux Journal from past months, and asked if I could extract the troubleshooting articles from them and compile them as a one single pdf, which we can keep for future references, plus this was needed as he has promised the other team to do this in return of the PDFs he gets from their subscription :)

So, then I begin my search on tools and ways to do this in Ubuntu 10.10.  Yes, for the past 3-4 weeks Ubuntu has been my main operating system, not like earlier when I have always kept one windows machine with me. This time I thought lets move completely to Linux without taking any whatsoever help from windows, and let me tell you its been really going great, even better than windows. But more on this later.

Quick Google search revealed that there are a number of ways to extract a range of pages from PDF files. Main article which explained three methods and a little handy script was on linuxjournal, which prompted me to look in detail all three methods.

There are PDF related toolkits for extracting pages from PDF or you can use Ghostscript directly for command line option, and also there are graphic applications as well. So I decided to put them all together here.

First: Use of poppler-tools and psutils. One can extract a range of pages from a larger PDF file using these tools. Like, if you want to extract pages 18–22 of the PDF file one_big_file.pdf, you could use the following command:

$ pdftops one-big-file.pdf - | psselect -p18-22 | ps2pdf - new-file-name.pdf

The pdftops command converts the PDF file to PostScript and psselect command selects the relevant pages from the PostScript, then ps2pdf command converts the selected PostScript into a new PDF file.

Second:Using pdftk toolkit For example, to extract pages 18-22 from a big PDF file.

Splitting pages from one big file:

 $ pdftk A=one_big_file.pdf cat A18-22 output new_file_name.pdf

Joining pages into one big file:

$ pdftk file1.pdf file2.pdf cat output single_big_file.pdf

for more options like attaching files, filling forms, etc., check this link

Third:Using Ghostscript Use of Ghostscript, which unlike pdftk is installed nearly everywhere and you’ve been using it in the last command anyway, goes like following.

 $ gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER
 -dFirstPage=18 -dLastPage=22
 -sOutputFile=new_file_name.pdf one_big_file.pdf

Merging files with Ghostscript

 $ gs -q -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER
 -sOutputFile=one_big_file.pdf file1.pdf file2.pdf file3.pdf

When using Ghostscript to combine PDF files, you can add any PDF-related option to the command line. For example, you can compress the file, target it to an eBook reader, or encrypt it. See the Ghostscript documentation for more information.

Conclusion Regarding speed and efficiency of the processing and more important the quality of the output file, the first method above is for sure the worst of the three. The conversion of the original PDF to PostScript and back to PDF (known as “refrying”) is very unlikely to completely preserve advanced PDF features (such as transparency information, font hinting, overprinting information, color profiles, trapping instructions, etc.).

The 3rd method uses Ghostscript only (which the 1st one uses anyway, because ps2pdf is nothing more than a wrapper script around a more or less complicated Ghostscript command line. The 3rd method also preserves all the important PDF objects on your pages as they are, without any “roundtrip” conversions.

Little extra The only drawback of the 3rd method is that it’s a longer and more complicated command line to type. But you can overcome that drawback if you save it as a bash function. Just put these lines in your ~/.bashrc file:

 function pdf-extract()
 # this function uses 3 arguments:
 # $1 is the first page of the range to extract
 # $2 is the last page of the range to extract
 # $3 is the input file
 # output file will be named "inputfile_pXX-pYY.pdf"
 gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER

Now you only need to type (after starting a new copy bash or sourcing .bashrc) the following: $ pdf-extract 22 36 inputfile.pdf which will result in the file inputfile_p22-p36.pdf in the same directory as the input file.

For a graphic option

– Sandeep Sidhu