Copyright © 2014 W3C® (MIT, ERCIM, Keio, Beihang), Some Rights Reserved: this document is dual-licensed, CC-BY and W3C Document License. W3C liability, trademark and document use rules apply.
The ruby markup model currently described in the HTML specification is limited in its support for a number of features, notably jukugo and double-sided ruby, as well as inline ruby. This specification addresses these issues by introducing new elements and changing the ruby processing model. Specific care has been taken to ensure that authoring remains as simple as possible.
This document was largely developed to address the shortcomings listed in Use Cases & Exploratory Approaches for Ruby Markup. [ruby-use-cases]
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This specification is an extension specification to HTML.
This document was published by the HTML Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them to public-html@w3.org (subscribe, archives). All comments are welcome.
Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
The following changes are made to HTML by this document:
ruby
nested inside ruby
now represents its
children (in other words, it is no longer meaningful, which reflects implementations);
rb
and rtc
elements have been introduced
to support cases in which explicit bases and containers are needed (explicit based are
relatively common in existing content);
Some further changes will be required to HTML if this document is merged in. Of those, some are for paraphernalia (element index and the such) while others are more involved (changes to parsing to auto-close some elements) and in some cases optional.
Changes to [WEBVTT] will be required in order to match this model. Also,
rb
needs to be removed from the obsolete elements.
Add rb
and rtc
to the list of elements in
generate
implied end tags.
Add rb
and rtc
to the step A start tag whose tag name
is one of: "rp", "rt" in The
"in body" insertion mode. Within that step, if the start tag is "rt" then generate
implied end tags excepting "rtc", otherwise just generate implied end tags.
Add rb
and rtc
to the elements that preclude a parse error
in both An end tag whose tag name is "body" and
An end tag whose tag name is "html" in The
"in body" insertion mode.
The term inter-element whitespace is defined in the HTML specification. [HTML5]
The interfaces Text and Element are defined in the DOM specification. [DOM]
ruby
elementHTMLElement
.
The ruby
element allows one or more spans of phrasing content to be marked
with ruby annotations. Ruby annotations are short runs of text presented alongside base
text, primarily used in East Asian typography as a guide for pronunciation or to include
other annotations. In Japanese, this form of typography is also known as
furigana. A more complete introduction to ruby can be found in the Use Cases
& Exploratory Approaches for Ruby Markup document as well as in CSS Ruby
Module Level 1. [ruby-use-cases] [css3-ruby]
The content model of ruby
elements consists of one or more of the following
sequences:
rb
elements.
rt
or rtc
elements, each of which either immediately
preceded or followed by an rp
elements.
The ruby
, rb
, rtc
, and rt
elements can
be used for a variety of kinds of annotations, including in particular (though by no means
limited to) those described below. For more details on Japanese Ruby in particular, and how
to render Ruby for Japanese, see Requirements for Japanese Text Layout.
[JLREQ] The rp
element can be used as fallback content when ruby
rendering is not supported.
Annotations (the ruby text) are associated individually with each ideographic character (the base text). In Japanese this is typically hiragana or katakana characters used to provide readings of kanji characters.
<ruby>base<rt>annotation</ruby>
When no rb
element is used, the base is implied, as above. But you can also
make it explicit. This can be useful notably for styling, or when consecutive bases are
to be treated as a group, as in the jukugo ruby example further down.
<ruby><rb>base<rt>annotation</ruby>
In the following example, notice how each annotation corresponds to a single base character.
<ruby>日<rt>に</rt></ruby><ruby>本<rt>ほん</rt></ruby> <ruby>語<rt>ご</rt></ruby>で<ruby>書<rt>か</rt></ruby> いた<ruby>作<rt>さく</rt></ruby><ruby>文<rt>ぶん</rt></ruby>です。
Ruby text interspersed in regular text provides structure akin to the following image:
This example can also be written as follows, using one ruby
element with
two segments of base text and two annotations (one for each) rather than two
back-to-back ruby
elements each with one base text segment and annotation
(as in the markup above):
<ruby>日<rt>に</rt>本<rt>ほん</rt>語<rt>ご</rt></ruby> で<ruby>書<rt>か</rt></ruby> いた<ruby>作<rt>さく</rt>文<rt>ぶん</rt></ruby>です。
Group ruby is often used where phonetic annotations don't map to discreet base characters, or for semantic glosses that span the whole base text. For example, the word "today" is written with the characters 今日, literally "this day". But it's pronounced きょう (kyou), which can't be broken down into a "this" part and a "day" part. In typical rendering, you can't split text that is annotated with group ruby; it has to wrap as a single unit onto the next line. When a ruby text annotation maps to a base that is comprised of more than one character, then that base is grouped.
The following group ruby:
Can be marked up as follows:
<ruby>今日<rt>きょう</ruby>
Jukugo refers to a Japanese compound noun, i.e. a word made up of more than one kanji character. Jukugo ruby is a term that is used not to describe ruby annotations over jukugo text, but rather to describe ruby with a behaviour slightly different from mono or group ruby. Jukugo ruby is similar to mono ruby, in that there is a strong association between ruby text and individual base characters, but the ruby text is typically rendered as grouped together over multiple ideographs when they are on the same line.
The distinction is captured in this example:
Which can be marked up as follows:
<ruby>法<rb>華<rb>経<rt>ほ<rt>け<rt>きょう</ruby>
In this example, each rt
element is paired with its respective
rb
element, the difference with an interleaved
rb
/rt
approach being that the sequences of both base text and
ruby annotations are implicitly placed in common containers so that the grouping
information is captured.
For more details on Jukugo Ruby rendering, see Appendix F in the Requirements for Japanese Text Layout and Use Case C: Jukugo ruby in the Use Cases & Exploratory Approaches for Ruby Markup. [JLREQ] [ruby-use-cases]
In some contexts, for instance when the font size or line height are too small for ruby to be readable, it is desirable to inline the ruby annotation such that it appears in parentheses after the text it annotates. This also provides a convenient fallback strategy for user agents that do not support rendering ruby annotations.
Inlining takes grouping into account. For example, Tokyo is written with two kanji characters, 東, which is pronounced とう, and 京, which is pronounced きょう. Each base character should be annotated individually, but the fallback should be 東京(とうきょう) not 東(とう)京(きょう). This can be marked up as follows:
<ruby>東<rb>京<rt>とう<rt>きょう</ruby>
Note that the above markup will enable the usage of parentheses when inlining for
browsers that support ruby layout, but for those that don't it will fail to provide
parenthetical fallback. This is where the rp
element is useful. It can be
inserted into the above example to provide the appropriate fallback when ruby layout is
not supported:
<ruby>東<rb>京<rp>(<rt>とう<rt>きょう<rp>)</ruby>
Sometimes, ruby can be used to annotate a base twice.
In the following example, the Chinese word for San Francisco (旧金山, i.e. “old gold mountain”) is annotated both using pinyin to give the pronunciation, and with the original English.
Which is marked up as follows:
<ruby><rb>旧<rb>金<rb>山<rt>jiù<rt>jīn<rt>shān<rtc>San Francisco</ruby>
In this example, a single base run of three base characters is annotated with three
pinyin ruby text segments in a first (implicit) container, and an rtc
element is introduced in order to provide a second single ruby text annotation
being the city's English name.
We can also revisit our jukugo example above with 上手 ("skill") to show how it can be annotation in both kana and romaji phonetics while at the same time maintaining the pairing to bases and annotation grouping information.
Which is marked up as follows:
<ruby><rb>上<rb>手<rt>じよう<rt>ず<rtc><rt>jou<rt>zu</ruby>
Text that is a direct child of the rtc
element implicitly produces a ruby
text segment as if it were contained in an rt
element. In this contrived
example, this is shown with some symbols that are given names in English and French with
annotations intended to appear on either side of the base symbol.
<ruby> ♥<rt>Heart<rtc lang=fr>Cœur</rtc> ☘<rt>Shamrock<rtc lang=fr>Trèfle</rtc> ✶<rt>Star<rtc lang=fr>Étoile </ruby>
Similarly, text directly inside a ruby
element implicitly produces a ruby
base as if it were contained in an rb
element, and rt
children
of ruby
are implicitly contained in an rtc
container. In
effect, the above example is equivalent (in meaning, though not in the DOM it produces)
to the following:
<ruby> <rb>♥</rb><rtc><rt>Heart</rt></rtc><rtc lang=fr><rt>Cœur</rt></rtc> <rb>☘</rb><rtc><rt>Shamrock</rt></rtc><rtc lang=fr><rt>Trèfle</rt></rtc> <rb>✶</rb><rtc><rt>Star</rt></rtc><rtc lang=fr><rt>Étoile</rt></rtc> </ruby>
Within a ruby element, content is parcelled into a series of ruby segments. Each ruby segment is described by:
rb
element.
rtc
elements, or to sequences of rt
elements implicitly recognised as contained in an anonymous ruby text container.
Each ruby text container is described by zero or more ruby text annotations each of which is a DOM range that may contain
phrasing content or an rt
element, and an annotations range that is a range
including all the annotations for that container. A ruby text container is also
known (primarily in a CSS context) as a ruby annotation container.
Furthermore, a ruby element contains ignored ruby content. Ignored ruby content
does not form part of the document's semantics. It consists of some inter-element
whitespace and rp
elements, the latter of which are used for legacy user
agents that do not support ruby at all.
The process of annotation pairing associates ruby annotations with ruby bases. Within each ruby segment, each ruby base in the ruby base container is paired with one ruby text annotation from the ruby text container, in order. If there are not enough ruby text annotations in a ruby annotation container, the last one is associated with any excess ruby bases. (If there are not any in the ruby annotation container, an anonymous empty one is assumed to exist.) If there are not enough ruby bases, any remaining ruby text annotations are assumed to be associated with empty, anonymous bases inserted at the end of the ruby base container.
Note that the terms ruby segment, ruby base, ruby text annotation, ruby text container, ruby base container, and ruby annotation container have their equivalents in CSS Ruby Module Level 1. [css3-ruby]
Informally, the segmentation and categorisation algorithm below performs a simple set of
tasks. First it processes adjacent rb
elements, text nodes, and non-ruby
elements into a list of bases. Then it processes any number of rtc
elements or
sequences of rt
elements that are considered to automatically map to an
anonymous ruby text container. Put together these data items form a ruby
segment as detailed in the data model above. It will continue to produce such segments
until it reaches the end of the content of a given ruby
element. The complexity
of the algorithm below compared to this informal description stems from the need to support
an author-friendly syntax and being mindful of inter-element white space.
At any particular time, the segmentation and categorisation of content of a ruby
element
is the result that would be obtained from running the following algorithm:
ruby
element for which the algorithm is
being run.
ruby
element ancestor, then abort these
steps.
rp
element, then increment
index by one and jump to the step labelled process a ruby child. (Note
that this has the effect of including this element in any range that we are currently
processing. This is done intentionally so that misplaced rp
can be
processed correctly; semantically they are ignored all the same.)
rt
element, then run these substeps:
rtc
element, then run these
substeps:
rtc
element and a DOM range whose start is the
boundary point (root, index)
and whose end is the boundary point (root, index plus
one). Append this new ruby annotation container at the end of current
annotation containers.
rt
element, an
rtc
element, or an rp
element, then set
index to the value of lookahead index and jump to the step
labelled process a ruby child.
rb
element, then run these substeps:
When the steps above say to commit a ruby segment, it means to run the following steps at that point in the algorithm:
When the steps above say to commit the base range, it means to run the following steps at that point in the algorithm:
When the steps above say to commit current annotations, it means to run the following steps at that point in the algorithm:
When the steps above say to commit an automatic base, it means to run the following steps at that point in the algorithm:
rb
elementruby
element.HTMLElement
.
The rb
element marks the base text component of a ruby annotation. When it is
the child of a ruby
element, it doesn't represent anything itself, but its parent ruby
element uses it as part of determining what it represents.
An rb
element that is not a child of a ruby
element
represents the same thing as its children.
rt
elementruby
or of an rtc
element.HTMLElement
.
The rt
element marks the ruby text component of a ruby annotation. When it is
the child of a ruby
element or of an rtc
element that is itself
the child of a ruby
element, it doesn't represent anything itself, but its ancestor ruby
element uses it as part of determining what it represents.
An rt
element that is not a child of a ruby
element or of an
rtc
element that is itself the child of a ruby
element
represents the same thing as its children.
rtc
elementruby
element.HTMLElement
.
The rtc
element marks a ruby text container for ruby text components in
a ruby annotation. When it is the child of a ruby
element it doesn't represent anything itself, but its parent ruby
element uses it as part of determining what it represents.
An rtc
element that is not a child of a ruby
element
represents the same thing as its children.
When an rtc
element is processed as part of the segmentation and
categorisation of content for a ruby
element, the following algorithm
defines how to process an rtc
element:
rtc
element for which the algorithm is
being run.
rt
element, then run these substeps:
When the steps above say to commit an automatic annotation, it means to run the following steps at that point in the algorithm:
rp
elementruby
element, either immediately before or immediately
after an rt
or rtc
element.
HTMLElement
.
The rp
element is used to provide fallback text to be shown by user agents that
don't support ruby annotations. One widespread convention is to provide parentheses around
the ruby text component of a ruby annotation.
The contents of the rp
elements are typically not displayed by user agents
which do support ruby annotations
An rp
element that is a child of a ruby
element represents nothing. An rp
element whose parent element is not a ruby
element represents its
children.
The example shown previously, in which each ideograph in the text 漢字 is annotated with its phonetic reading, could be expanded
to use rp
so that in legacy user agents the readings are in parentheses (please
note that white space has been introduced into this example in order to make it more
readable):
...
<ruby>
漢
<rb>字</rb>
<rp> (</rp>
<rt>かん</rt>
<rt>じ</rt>
<rp>) </rp>
</ruby>
...
In conforming user agents the rendering would be as above, but in user agents that do not support ruby, the rendering would be:
... 漢字 (かんじ) ...
When there are multiple annotations for a segment, rp
elements can also be
placed between the annotations. Here is another copy of an earlier contrived example showing
some symbols with names given in English and French using double-sided annotations, but this
time with rp
elements as well:
<ruby> ♥<rp>: </rp><rt>Heart</rt><rp>, </rp><rtc><rt lang=fr>Cœur</rt></rtc><rp>.</rp> ☘<rp>: </rp><rt>Shamrock</rt><rp>, </rp><rtc><rt lang=fr>Trèfle</rt></rtc><rp>.</rp> ✶<rp>: </rp><rt>Star</rt><rp>, </rp><rtc><rt lang=fr>Étoile</rt></rtc><rp>.</rp> </ruby>
This would make the example render as follows in non-ruby-capable user agents:
♥: Heart, Cœur. ☘: Shamrock, Trèfle. ✶: Star, Étoile.
Additions to the default style sheet are made by CSS Ruby Module Level 1. [css3-ruby]
Much of this document is deeply indebted to Fantasai and Richard Ishida.