How we digitise millions of historic newspaper pages and make them searchable online
Our scanning team now works on-site at the British Library's state-of-the-art Newspaper Storage Building in Yorkshire, England
What do we digitise?
Because our scanning team is based within the British Library's newspaper storage facilities, we have extensive access to the collection. One benefit of being able to access the original bound volumes of newspapers and periodicals is that, unlike many other newspaper digitisation projects, we have been able to scan some of the rarest and most fragile newspapers in the collection. We have even scanned single pages more than two feet wide! As well as scanning from the original paper volumes, we also make use of the library's extensive microfilm collection to bring you more pages as quickly as we can.
Investment in processes and equipment = high quality images
We've made a significant investment in state of the art scanning equipment, including 5 huge A0 (allowing us to scan pages up to a square metre, or just over a square yard, in size) colour scanners. This means that whether we are working from the original paper or microfilm, our digital images are of incredibly high quality. We've also spent a lot of time in the past few years making our scanning process as efficient as possible, with frequent quality checkpoints throughout the whole process. This not only makes us faster, but it also means that we don't have to go back and rescan pages later.
Creating a searchable index of millions of newspaper pages
Once we have scanned the newspaper pages in Yorkshire, England, the images are moved to our operations centre in Dundee, Scotland. Here we use Optical Character Recognition (OCR) technology to "read" the pages and create a text version of what is on the page. This technology is developing all the time, and although it is far from perfect, we are constantly refining it to make sure that we can build as accurate an index of what is within the pages as we can.
Once we have built a text index for each of the pages, this information is fed into our search system, allowing you to search billions of words within seconds. This is also the point at which the individual pages are assembled into individual newspapers, and then into entire year runs of newspapers, and the extra information about where and when the newspapers were published is added to the search system. We then break the images for each page down into a series of smaller images - this is so that we can offer a very fast "zooming" facility within the website, allowing you to zoom in and out of pages within seconds. After final quality checks, we can now publish the new pages online.