This page is obsolete! Edit

Please visit the new wiki

Procmail can be used to analyze the content of PDF files, and organize them based on the content. Although procmail is designed to filter email, its method for email filtering can be adapted to offer powerful automated PDF organization, similar to what iTunes does for music.

The way it works is injecting phony emails, generated from a list of PDF files. The procmail script uses pdfinfo to extract the metadata, and makes this information part of the email header. Then pdftotext is used to extract the textual content from the PDF, which then becomes part of the body of the phony email. This payload continues to masquerade as an email message, while procmail rules score and take actions on it.

This linux command would grab all PDF files from the file tree, and pass them to a procmail script:

find . -iname \*pdf\
   | while read filename; do echo "From document\nFilename: $filename"; done\
   | formail -s procmail -m docfiler.prc

This is an example of the docfiler.prc script, which is written to detect and file a bank statement.

# docfiler.prc


PDFINFO  = /usr/bin/pdfinfo
PDF2TEXT = /usr/bin/pdftotext
SED      = /bin/sed

* ^From document
* ^Filename: \/.*\.pdf
  INFO=`$PDFINFO $NAME | tr '\n' '+'`
  CONTENT=`$PDF2TEXT $NAME /dev/stdout`

  # extract PDF metadata, and make it part of an email header.
  :0 hfw
  | $SED -e "/Filename:.*pdf/a\
$INFO" | tr '+' '\n'

  # extract PDF text, and make it part of an email body.
  :0 bfw
  | echo $CONTENT


# Detect bank statements and copy them to a folder named
# "bank_statement".  8 points if the word "credit card" is found, 6
# points if "account" or "balance" appear in the text, etc.  When 15
# points is reached, the PDF passes as a bank statement, so it is
# copied.
* H  ?? -15^0
* H  ??   1^0 ^Filename: \/.*\.pdf
* H  ??   3^0 ^Producer: *bank
* HB ??   8^0 (credit card|bank)
* HB ??   6^0 account
* HB ??   6^0 balance
* HB ??   3^0 min
* HB ??   9^0 due date
* HB ??   3^0 closing date
| cp $MATCH $HOME/bank_statement/

:0 :