Monday, 11 March 2013

Document management

Odd as it may seem we have only just set up a proper document management system at the office. That is not really a fair thing to say - we have management for all the documents we ever generate, and a way of sensibly recording and filing and referencing the paper documents we get. We scanned some things (mainly cheques). What we have added now is a generic scan and store system that covers letters we get, scans of cheques, and purchase invoices/receipts.

The key part of any system to manage paper documents is a scanner, and this is where things get interesting. There are may cheap USB flatbed scanners, and they work. But they are not really up to the job. So I went for a sheet feeding scanner.

First mistake was getting an Epson GT-S85N. It claims to be a network scanner, but is in fact a USB scanner with an Edimax network attached USB host adapter. It is no different to a windows based USB scanner with a different cable. That has gone back.

What we ended up getting, based on advice from a company called Response Technical Services was the Canon ScanFront 300. This really is a network scanner.

It is a fast, duplex, sheet feeding scanner that will email or ftp or write to a shared windows folder. With a bit of tweaking you can make it just sit there with a number of "job buttons" on its full colour touch screen. You drop in a document, press the button, job done.

To be honest, the screen is overkill. It would be much cheaper if they had not used windows internally (yuck) or had the big screen. A few simple buttons for pre-set jobs would have done. Also, I see no IPv6 address, which is odd, and annoying, and against our usual policy on purchasing equipment. I'll ask Canon about that :-)

I am, however, impressed with how simple it is to use. I am also impressed that it has a half decent OCR built in, embedding searchable text in the PDF it sends.

I am also impressed with some of the details, like the way it will de-skew an image, and crop cleanly, and skip blank pages, and so on. I can scan business cards, and cheques, and even plastic cards if I want. I can scan horrid thermal receipts. And it has no trouble with 30 page contracts. It pretty much "just works", which is all you can ask for really.

Initially I made our systems use the built in OCR (pdftotext is the command on linux to extract the text from the PDF). This was pretty good, but some tests showed that tesseract was actually better. The trick was to use pdfimages to extract the scans from the PDF and run through tesseract without any rescaling (which is what gs would have done, and I tried first). The resulting details, including OCR are stored in a mysql database and linked in to our accounts system which is what tracks documents we create. One annoying small detail was finding how many pages a PDF has, and eventually I used pdftk.

I also allowed upload of PDFs in our back end systems so we don't have to print stuff, scan it, and shred it! However, for purchase invoices we have few enough that box files and paper copies make sense for the simple logistics of handling a VAT inspection. We should, however, be able to scan and shred lots of other paperwork we get.

The other nice touch that I added to our back end systems was using zbarimg to pick out any barcodes on documents and store that too. Can be useful. It will allow us to put stickers on documents that are keyed in, and use that to automatically tie in to the right record when later scanned, etc. We'll have to work out the details. I also tried dmtxlib to extract datamatrix barcodes, but that is unbearably slow and I may have to find something else or not bother. Shame as I like IEC16022 Datamatrix barcodes.

I am surprised they did not include any sort of digital signing and timestamp in the scanner. It would have been simple to do and provided a way to prove the scan was not later edited. I wonder if there is a web service to do that, and if not, we could make one - a simple API passed an SHA1 and returning a signature...

Anyway, all good fun...

Update: It does IPv6!


  1. Ah! Glad to see we're not the only victim of the Epson "Network" scanner that isn't.

    We too have got one of those, with the edimax jobbie and it is utterly unreliable. Unfortunately we've had it too long to do anything about it now (it was actually connected via USB to the main users PC for years).

  2. I did some research into this a while ago, the wikipedia article on it is a good resource . Apparently there's even an RFC, an ISO standard and an X. standard.

  3. Does it come with an insane 'safety guide' like the (otherwise excellent) fujitsu Scansnap?
    Highlights of that are:
    - not to scan while driving
    - to be careful handling documents lest you get paper cuts
    - to check the power cord - *once a month*

  4. We have here an Epson Aculaser CX11NF - an all in one Networked Colour Laser, Scanner, Copier & Fax. It only prints and scans Simplex, though (well, it does Manual Duplex printing but IMO that doesn't one wants to stand at a printer shifting paper as it comes out...)

    Despite it's shortcomings, it's actually not a terrible printer but it does have one awful problem; the scanner part, driven by a windows app, will not communicate with the scanner unless it's on the same subnet / broadcast domain.
    The printer part will, quite happily, as there's a route between the networks, but try and put the scanner on a different subnet from the PC you're trying to scan with, and it all breaks down. Bloody thing.