Introduction to accessible PDFs
[Narrator:] Welcome. In this chapter you will learn how to create accessible PDFs. But, before we start, we should remember what PDF is and why it exists. 25 years ago many different document editors existed, each using its own file format. Workflows often failed when exchanging documents because of file format incompatibilities.
Let's assume I sent a document to a colleague using the Word Perfect authoring tool. Unfortunately, my colleague was not able to open the document because he was using Microsoft Word, which was not able to read the Word Perfect file format. So we both agreed to use the same word processor. I sent my document again, now using the Microsoft Word file format to my colleague. But again this approach failed as my colleague was using Word version 4, and I had stored my document using the Word file format version 6. So we both agreed to use the same program and the same program version. I sent my document again, but my colleague claimed that the layout of the document was damaged and unreadable. The reason was the different font sets installed on our machines. The fonts I had used to write the document did not exist on my colleague's machine. So we both agreed to use the same program, the same program version and the same fonts. After another trial, my colleague claimed that the layout still did not appear to be correct. The reason was different operating systems. I was using a Mac, he was using Windows.
It is obvious that there was a need for a new file format, which preserved the layout over different operating systems and authoring platforms. The Portable Document Format was developed by Adobe in the 1990s. Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it.
PDF combines three technologies: a subset of the PostScript page description programming language, for generating the layout and graphics; a font-embedding system, to allow fonts to travel with the documents; a data format, which supports compression, to bundle all page elements and any associated content into a single file.
One of the key elements of PDF’s success was the free Adobe PDF Reader program. This allowed an author to send any document to any person using a desktop computer. The recipient just needed to download a recent copy of the reader software and was immediately able to see and read the document presented in the layout intended by the author.
Today PDF has become the de facto standard for the exchange of documents lay outed for printing. It is used on an everyday working basis all around the world, for professional and home use.
PDF Standards
Recent PDF versions have been defined using a set of ISO standards. PDF is not one standard using a single file format, but consists of a family of different file formats optimised for different applications. This table shows the different PDF standards. There are PDF versions for different applications; for example, printing, archiving, engineering, variable data printing or universal access. Each of these standards serves different needs and targets different user groups.
PDF was a proprietary format controlled by Adobe until it was standardised by ISO. Today a rich set of ISO standards exists defining the PDF file format.
The first PDF standard including accessibility features was ISO 19005:2005, which is commonly known as PDF/A or PDF for archiving. This version is specialised for use in the archiving and the long-term preservation of electronic documents. Part 1 of the standard specifies two levels of conformance for PDF files. Level B focuses on the reliable reproduction of a document's visual appearance, while Level A conformance includes all Level B requirements in addition to features intended to improve a document's accessibility.
Additional requirements are:
- The definition of the document language.
- The embedding of a hierarchical document structure.
- Tagged text spans and descriptive text for images and symbols.
- Character mappings to Unicode.
Level A conformance was intended to increase the accessibility of conforming files for users with physical impairments by allowing assistive software, such as screen readers, to more precisely extract and interpret a file's content.
Reading a PDF
What makes a PDF document accessible? Let's look at a few example documents to gain a better understanding. Here we have a simple text document in PDF format. We would like to read this document using a screen reader. For this example, we are using the VoiceOver screen reader on macOS. Please note that the result would be similar if using a different screen reader on a different platform.
[Screen reader:] Preview pdf_read_demo_1 pdf, page 1 of 3, window, document, group.
[Narrator:] The screen reader indicates the current program and the name of the opened document. Let's try to read the content.
[Screen reader:] In document, group, 3 items, image. No visible title to interact. Page 2. In page 2, image. No visible title to interact.
[Narrator:] The screen reader cannot read any text. Let's look a little closer at this document.
[Screen reader:] Page 1, image.
[Narrator:] As we can see, this is not normal text. This is a graphic of text. We can see how the text becomes blurry when zooming into the raster image. The screen reader cannot detect any readable text information in a graphic-only PDF. PDF files like this are often created when scanning a paper based document. The scanner software embedded the scanned image inside the PDF without any text.
Let's try another PDF. We zoom into the document to make sure that we do not have a scanned image. Looks good... Let's try to read this document.
[Screen reader:] In document, group, 3 items, image. No visible title to interact. Page 2. In page 2 content is empty. In page 2, empty. Content is empty.
[Narrator:] It seems that the screen reader again has problems reading the content. But why? Let's try to select text using the mouse.
[Screen reader:] Image.
[Narrator:] We cannot do this. The document does not contain any text. The source of the problem is that this document uses outlined fonts, which means each single character of this document is not a letter, but a vector graphic of this letter. Again we have a graphic-only document.
The first document used raster images, whereas this document uses vector images. Why would somebody create such a document? In the past, graphic designers often had problems caused by different font versions when delivering documents to a printing company. To avoid these font problems, they created documents with outlined fonts, which needed no fonts at all.
Let's try another document. We select a few lines with the mouse to make sure that this document contains text. As we can see, this document contains text. Now we try to read it with the screen reader.
[Screen reader:] In document, group. 3 items, page 1. In page 1, 16 items, 1, 1.1. Alice’s Adventures in Wonderland. 1.2. Alice was beginning to get very tired of sitting by her sister on the bank, and on having nothing to do. Once or twice she had peeped into the book her sister was reading, but it has no pictures or conversations…
[Narrator:] Let's speed this up a bit.
[Screen reader:] [Inaudible] … sending presents to the one’s own feet and how odd the directions will look. Down the rabbit hole Image.
[Narrator:] This works much better, but it's still not perfect. There is a graphic here, but the screen reader cannot offer us any information about the content of this graphic. We can read text, but some parts of the document are not read in the correct sequence. The elements of a document need a reading sequence. This becomes even more important if a graphic designer decides that your document needs a more sophisticated layout.
[Screen reader:] In page 1, 16 items. A mad tea party. There was a table set out under a tree in front of the house, and she sat down in a large armchair at one end of the table. The March, the Hatter were having tea at it. A dormouse was sitting between them, fast asleep, and the other… Hare and… Have some wine, the March Hare said in an encouraging tone. Two were using it as a cushion resting their elbows on it…
[Narrator:] Even though we can read all the text, the content is read to us in a useless sequence. This example makes it clear that the reading sequence is essential for reading documents with a screen reader.
Text for accessibility
Let's summarise what we have learned so far. Accessibility requires text, not an image of a text. Whether this image is raster based or vector based is of no importance to a screen reader. A screen reader can only detect text with real characters. Text elements require a reading sequence. This becomes even more important if the text content is not presented sequentially, but is broken into several parts, repeatedly interrupted by banners, graphics or any other complementary information blocks.
All non-text information, like images and graphics, require an alternative text describing the element, so that a non-visual user can get an idea of its content. All this additional information can be added to a PDF using tagging.
Let's talk about what tagging is and what it looks like. For our example we will use LibreOffice. We use this program to demonstrate that creating a tagged PDF is not a feature available only in Microsoft Word or Adobe InDesign.
We see a document that includes chapter headings, graphics, lists and a table. We will create two PDF files following two different procedures. The first approach is to create the PDF using the printer driver. We select File, Print, PDF, Save as PDF, and save the document on our desktop using the file name "demo1.pdf". The printer driver uses the visual information of the document, including text and images, to convert it to a PDF. Please note that a printer driver has no idea of the meaning of a page element. It does not know if a text is a heading or if it is part of a list or a table. It just knows that the letters using a defined font and size should be placed at a specific position on the page. LibreOffice offers a function to export a document in PDF directly.
The second PDF will be saved using File, Export as, Export as PDF. A dialogue box opens and invites us to select several parameters. We will select the Tagged PDF, which adds the structure of the document, and the Export bookmarks. We will save this document using the file name "demo2.pdf". We can now see two files on our desktop. We can open the files in Adobe Acrobat Pro to see what information has been added to the second file. Here we can compare the two files. On the left-hand side, the PDF created via the printer driver; on the right-hand side, the second PDF created via the PDF export function.
At first glance, we notice that the second PDF has bookmarks, which the first one does not have. Additionally, the second PDF offers tagging information, which the first one does not have.
If we look closer at the tagging information, we can see tags for headings, images, lists and tables. We can see that tags can be nested, so a table has table rows, and table rows have table data cells. All this tagging information is evaluated by a screen reader to present an improved reading experience to a user with a visual disability.
Four ways to create a PDF
In the remaining sections of this chapter we will talk about four different ways to create a PDF.
If the document editor, like Microsoft Word or Adobe InDesign, offers the possibility of exporting a tagged PDF, then this is always the preferred PDF creation procedure. Only the document editor contains the structure of the document. While editing the document, the user defines headings, lists, tables and other structural information. This information can be embedded in the form of tags in the resulting PDF document to support screen readers.
Some document editors do not offer this functionality. In this case, a PDF can only be created via a printer driver. The resulting PDF will have no tagging information to support users with screen readers.
Sometimes we do not have the document in any digital form. In this case, we need to scan it. Using an optical character recognition, which is, for example, built into Adobe Acrobat Pro, we can convert images of text into text. Optical Character Recognition, which is often shortened to OCR, is a technology that converts different types of documents, such as scanned paper documents or images captured by a digital camera into editable and searchable text data. Please note that the quality of the text recognition is highly dependent on the quality of the scan, the quality of the OCR program and the applied dictionaries. Sometimes we do not have a dictionary matching the language of the text. Once the text recognition has been finished, the document can be saved as an untagged PDF.
If we have an untagged PDF, we can improve its accessibility by converting it manually into a tagged PDF using Adobe Acrobat Pro. This is a very time-consuming task, which will almost never produce a perfect result. Every tag has to be created or repaired manually to add the correct document structure information. This should always be the last option to choose when creating an accessible PDF.
Where to continue?
You have learned about PDF and how tagging information improves its accessibility. In the chapters that follow you will learn how to build a document structure in your preferred document editor. You will see how to create a navigation structure, which can be used by a screen reader to create an improved reading experience.
Depending on your personal interests, you could continue with one of the following chapters:
- From Word to PDF
- Adobe InDesign
- Other authoring tools
[Automated voice:] Accessibility. For more information visit: op.europa.eu/en/web/accessibility.