The installation process creates a program group for PDF2TXT on the Windows start menu, containing choices to launch PDF2TXT, read Documentation for PDF2TXT, and uninstall PDF2TXT. Also created is a desktop shortcut with an associated hot key, enabling PDF2TXT to be conveniently launched by pressing Control+Alt+Shift+P. Another shortcut is placed in the Send To folder so that a PDF may be viewed in PDF2TXT via the context menu in Windows Explorer.
By default, the PDF source is the folder c:\pdf2TXT\pdf. Any source may be chosen, however, and the program remembers the last one used.
Similarly, an edit box and associated button let you specify the target folder for converted files. These will have the same base name, but an extension of .txt instead of .pdf. The default target folder is c:\pdf2TXT\txt. Note that the PDF source may be either a file or folder, but the TXT target is always a folder.
OCR is a much slower and more error-prone process, but it may be the best option when the usual methods do not work. This technique uses Google Tesseract, the best open source OCR available, which is not as good as commercial OCR packages. Due to technical issues, there is not a simple way of aborting an OCR process that has already started. It is possible, however, by launching another copy of PDF2TXT, which clears the deck during its startup phase.
Another checkbox lets you additionally produce a .htm target file -- in HTML format. This uses a different conversion technique, originally posted at
http://EmpowermentZone.com/pdf2htm.zip
This may be worth trying if the .txt result is unsatisfactory. It may also be useful for webmasters who want to post AN HTML alternative to a PDF. The conversion translates visual aspects of the PDF such as fonts, but not structural elements such as headings, unfortunately. To further increase conversion options, a different technique is also used for producing the .txt file with this checkbox, using the PDFToText.exe utility that is also seperately available at
http://www.foolabs.com/xpdf/home.html
You can navigate the viewing area with standard windows keystrokes, e.g., Control+Home or Control+End to go to the top or bottom of text. Control+F lets you search forward for a string of characters, and Control+Shift+F lets you search backward. F3 searches for the same string again in the forward direction, and Shift+F3 searches again backward. Control+G lets you go to a percent completion point through the file being viewed. Control+K sets a bookmark for the file, Control+Shift+K clears it, and Alt+K goes to it.
You can press Shift with arrow keys to select text or Control+A to select all. Alternatively, you can press F8 to set the starting point of a selection, navigate to the ending point desired, and then press Shift+F8 to select the text between these points.
Press Control+C to copy selected text to the clipboard. Alternatively, press Control+Shift+C, or Alt+F8, to copy and append to the clipboard, adding to rather than replacing its existing text. A form feed or page break character (ANSI code 12) will separate each clip copied there. Control+F8 is a shortcut that copies all text in the viewing area without having to select it first, equivalent to Control+A followd by Control+C.
If you invoke the Open button and choose a PDF from its sub dialog, the text of the PDF will be placed in the viewing area, and keyboard focus will go there. If you invoke the Select button to choose a PDF folder instead of a file, its list of PDFs will be shown. A status bar at the bottom of the dialog indicates the current position in the viewing area.
This feature lets you easily explore the PDFs in a folder, one after another. Initially, You might display a list of files by pressing Alt+L when the PDF source is a folder. You can then arrow down through the list until you find a PDF you want to view. At that point, press Alt+L to view the file. When you want to continue exploring the folder list again, press Alt+L to return to it at the position of the file you last viewed.
The check box labeled "Move PDF when done" will transfer a PDF to a subfolder called "Done" after a successful conversion. This is a subfolder of the PDF2TXT program folder, with a default location of c:\pdf2TXT\done. The benefit of this check box is that PDF files are stored away for backup after they have been converted to text. This setting is unchecked by default.
The checkbox labeled "Replace TXT if found" determines whether to skip a conversion if a corresponding target file already exists. If you do not check the setting to move source files when done, you may want to check this setting so that unnecessary time is not spent on repeatedly converting PDF files left in the source folder, since they then will be skipped if corresponding target files already exist. This setting is checked by default.
The Append check box determines whether a detailed conversion log file is newly created each time a conversion is run. This setting is checked by default so that previous information is not lost. A section below further describes the log file.
Press Escape if you need to abort a batch conversion of many files that is taking too long! Note that this program is relatively quick, however, compared to other available methods of converting PDF files to text. Moreover, its batch mode feature lets you run conversions unattended.
The source for a conversion is treated differently if the viewing area has focus. If viewing a list of PDFs in a folder or on a web page, then PDF2TXT regards the source as the file name on the current line (the one containing the caret). Thus, you can cursor to a PDF of interest and press Enter to convert it to text. If successfully converted, PDF2TXT assumes you may also want to examine its content in the viewing area, so a Look command is automatically performed as well (see below). If there is a conversion error, however, PDF2TXT leaves the error message in the viewing area. If you have been examining a list of PDFs and decide you want to convert them all rather than a single file, navigate to the top line of the viewing area that lists the number of PDFs in the list, and then press Enter.
If the source edit box already specifies what you want to view, or a path is easy to type into it, then the Look button is quicker to use than the Open or Select sub dialog. Activating the Look button takes the current source specification and goes to a view of either the text of a source file or the list of a source folder, putting focus in the view area so you can read the information.
The Defaults button restores the default configuration settings of PDF2TXT. You can use it to return to the initial folders and checkbox settings.
The Explorer button lets you browse the source, target, or done folder with Windows Explorer. It allows you to examine files that either have been converted or would not convert--thus needing other approaches to access their content.
The Quit button closes PDF2TXT. Alt+F4 does the same thing.
The Help button displays this complete documentation in the default web browser. For context-sensitive help on a particular control, press F1 when it has focus. Hence, you can tab through the dialog and press F1 on each control to learn how to use it.
The Look button works with a URL source similarly to a local file or folder. For example, you can press Alt+L to view a list of PDFs on a web page. The toggling feature, described above, is also supported, allowing you to consecutively examine the PDFs linked to a web page. If you view a PDF on the Internet, PDF2TXT will automatically download a copy to the PDF subfolder of the program folder, e.g., to
c:\pdf2txt\pdf
The Convert button also works with a URL source. Thus, you can easily convert all PDFs on a web page with a single batch operation!
There is a choice to view the log file in the PDF2TXT program group off the Start Menu. You can also get to the file via the Explore button of the PDF2TXT program, choosing the Done folder to navigate with Windows Explorer. Additionally, you can open the file in another application through its direct path (default settings):
c:\pdf2txt\done\log.txt
If the log file grows larger than you want, simply delete it or uncheck the setting that configures PDF2TXT to append to an existing log file. Each use of the Convert button would then generate a new log file. This information is more detailed than the results placed in the viewing area.
When the .pdf extension is associated with the PDF2TXT program (explained in another section), Windows Explorer or Internet Explorer will open a PDF file by launching PDF2TXT with the name of the PDF passed as a parameter on the command line. If PDF2TXT is launched with more than one command line parameter, however, the program will assume you want to run it in console rather than GUI mode. The syntax for parameters is described as follows. If a PDF source file, folder, or URL is specified, it must be the first parameter. If a TXT target folder is specified, it must be the second parameter. The source or target must be enclosed in quotes if its name contains spaces.
All parameters besides source and target names begin with a space and forward slash (/), followed by the hot key letter in the dialog corresponding to the setting affected. A trailing plus (+) sign in the parameter indicates a status of On, and a minus (-) sign indicates Off. The plus sign can also be omitted to indicate On. Capitalization does not matter. Here is a list of parameters:
a = Automatic, console mode (use /a- to force GUI mode with multiple parameters)
i = Include subfolders
m = Move PDF when done
r = Replace TXT if found
d = Default settings (no /d- is defined)
g = Grab URL as source from Internet Explorer (no /g- is defined)
For example, to convert all files using default settings except for the Move setting, you could enter:
pdf2txt /d /m
To use current settings except grab a URL as source, enter:
pdf2txt /a /g
To convert files from a temporary folder to the current folder, enter:
pdf2txt "c:\temp files" .
To do the same, but in GUI rather than console mode, enter:
pdf2txt "c:\temp files" . /a-
For greater console mode convenience, another version of PDF2TXT, having the abbreviated name p2t.exe, is also available in the program folder. This version only runs in console mode, whether zero, one, or more parameters are specified. It uses "standard output" to display conversion results. The shorter executable name means less characters to type on the command line. For example, to run a batch conversion in console mode using the current settings of PDF2TXT, you could simply enter
p2t
Like DOS commands generally, the above assumes that you have either made c:\pdf2txt the current directory or included it in a PATH statement.
When the .pdf extension is associated with PDF2TXT, an application such as Windows Explorer when opening a file, or Internet Explorer after downloading a file, will pass the name of the PDF as a command-line parameter to pdf2txt.exe. When the program is launched in this way, it automatically invokes the Look button, placing text of the PDF in the viewing area and putting keyboard focus there.
An alternate text extraction technique is tried if the first one fails, using the GetText.exe utility that is also available seperately at
http://www.kryltech.com
The file GetText.txt in the PDF2TXT program folder contains the license for this utility.
The OCR is done by incorporating the open source PDF2OCR package, available at
http://EmpowermentZone.com/pdf2ocr.zip
Some status messages are spoken with the JAWS, System Access, or Window-Eyes screen reader if currently active. These direct speech messages are produced with APIs via a component of the SayTools library, which is also available seperately at
http://EmpowermentZone.com/saysetup.exe
The PowerBASIC code to PDF2TXT, itself (but not commercial libraries used), is open source under the Lesser General Public License (LGPL), documented at
http://gnu.org
This Windows program is the successor to my first version of PDF2TXT, developed several years ago as a DOS-based, command-line only utility. Ideas and feedbak from the discussion list
ProgrammingBlind@FreeLists.org
have aided the design and testing of PDF2TXT. The latest version is available at the same address,
http://EmpowermentZone.com/p2tsetup.exe
You can download it with the Elevate Version hotkey, F11. This checks whether a newer version is available, and offers to install it.
Jamal Mazrui
jamal@EmpowermentZone.com