Odd as it may seem we have only just set up a proper document management system at the office. That is not really a fair thing to say - we have management for all the documents we ever generate, and a way of sensibly recording and filing and referencing the paper documents we get. We scanned some things (mainly cheques). What we have added now is a generic scan and store system that covers letters we get, scans of cheques, and purchase invoices/receipts.
The key part of any system to manage paper documents is a scanner, and this is where things get interesting. There are may cheap USB flatbed scanners, and they work. But they are not really up to the job. So I went for a sheet feeding scanner.
First mistake was getting an Epson GT-S85N. It claims to be a network scanner, but is in fact a USB scanner with an Edimax network attached USB host adapter. It is no different to a windows based USB scanner with a different cable. That has gone back.
What we ended up getting, based on advice from a company called Response Technical Services was the Canon ScanFront 300. This really is a network scanner.
It is a fast, duplex, sheet feeding scanner that will email or ftp or write to a shared windows folder. With a bit of tweaking you can make it just sit there with a number of "job buttons" on its full colour touch screen. You drop in a document, press the button, job done.
To be honest, the screen is overkill. It would be much cheaper if they had not used windows internally (yuck) or had the big screen. A few simple buttons for pre-set jobs would have done. Also, I see no IPv6 address, which is odd, and annoying, and against our usual policy on purchasing equipment. I'll ask Canon about that :-)
I am, however, impressed with how simple it is to use. I am also impressed that it has a half decent OCR built in, embedding searchable text in the PDF it sends.
I am also impressed with some of the details, like the way it will de-skew an image, and crop cleanly, and skip blank pages, and so on. I can scan business cards, and cheques, and even plastic cards if I want. I can scan horrid thermal receipts. And it has no trouble with 30 page contracts. It pretty much "just works", which is all you can ask for really.
Initially I made our systems use the built in OCR (pdftotext is the command on linux to extract the text from the PDF). This was pretty good, but some tests showed that tesseract was actually better. The trick was to use pdfimages to extract the scans from the PDF and run through tesseract without any rescaling (which is what gs would have done, and I tried first). The resulting details, including OCR are stored in a mysql database and linked in to our accounts system which is what tracks documents we create. One annoying small detail was finding how many pages a PDF has, and eventually I used pdftk.
I also allowed upload of PDFs in our back end systems so we don't have to print stuff, scan it, and shred it! However, for purchase invoices we have few enough that box files and paper copies make sense for the simple logistics of handling a VAT inspection. We should, however, be able to scan and shred lots of other paperwork we get.
The other nice touch that I added to our back end systems was using zbarimg to pick out any barcodes on documents and store that too. Can be useful. It will allow us to put stickers on documents that are keyed in, and use that to automatically tie in to the right record when later scanned, etc. We'll have to work out the details. I also tried dmtxlib to extract datamatrix barcodes, but that is unbearably slow and I may have to find something else or not bother. Shame as I like IEC16022 Datamatrix barcodes.
I am surprised they did not include any sort of digital signing and timestamp in the scanner. It would have been simple to do and provided a way to prove the scan was not later edited. I wonder if there is a web service to do that, and if not, we could make one - a simple API passed an SHA1 and returning a signature...
Anyway, all good fun...
Update: It does IPv6!