Unicode is currently the world standard for encoding text. It supports all of the world's major writing systems. With its version 15.1 of 2023/09/12, it defines 149,813 characters and 161 scripts. This presentation starts with the, seemingly, simple example of the polar bear emoji. It then defines the key terms of any such standard. It then asks how a software system can render orthographic characters into glyphs, i.e., to render characters into (combined) glyphs. It introduces the concept of abstract characters and describes a brief history of encoding standards, from ASCII to Unicode. It shows how, by adding one level of indirection, the Unicode standard answers this question. It then presents code examples to display text written in Unicode: HarfBuzz (for shaping) and FreeType (for rendering).
3. Inspired by Bear plus snowflake equals polar bear - ⌨ⅸ࢙
(andysalerno.com)
3
4. Preamble
Languages and scripts
● English, French… use the Latin alphabet
● Greek, Cypriot Greek use the Greek alphabet
● Korean, Cia-Cia… use the Hangul alphabet (alphabetic syllabary)
● Russian, Bulgarian, Tajik… use the Cyrillic script (different alphabets)
● Arabic alphabet, Chinese logograms, Japanese logograms/syllabaries…
and emojis!
4
6. Question
How is it possible for a software system to display some many different scripts
and abide by the rules of so many languages?
6
7. Background
Text and code points
● “A code point is the atomic unit of
information. Text is a sequence of code
points.”
Glyphs
● “[S]pecific shape, design [… ] of a
character”
Typefaces
● Set of glyphs in a common design
● Baumans, Impact
Fonts
● Particular sets of glyphs in typefaces
● 10pt Pacifico, 18pt Pacifico
● ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ (12pt
Pacifico)
● Sans serif font, serif font
7
8. Background
Phonemes /•/
● /i/, /e/, /a/, /o/, /u/
Graphemes ⟨•⟩
● Correspond to sounds, i.e., phonemes
● “[S]mallest functional unit[s] of a writing system”
● “sequence of one or more code points [...] displayed as a single,
graphical unit [corresponding to] a single element of the writing system”
8
9. Background
Graphemes depends on writing systems
Writing systems imply orthographies
● ⟨a⟩ in French
● ⟨á⟩ in Northern Sámi…
… But not in Danish!
● ⟨á⟩ is sorted differently from ⟨a⟩ in Sámi
● ⟨á⟩ is a variant of/sorted as ⟨a⟩ in Danish
9
10. Background
Graphemes ⟨•⟩
● /ʃ/ in English ⟨sh⟩, in French ⟨ch⟩, in Greek, in Korean*, in Russian ⟨Ш⟩
● /a/ in English (≠ /æ/), in French ⟨a⟩, in Korean ⟨ㅏ⟩, in Russian ⟨а⟩
* except in “씨”
10
11. Background
Orthographic characters “•”
● “[A] distinct unit of writing in some writing system”
○ In French “a” but not “¸” or “^”
○ In Inuktitut “ᐱ” (/pi/)
○ In German “ß” but not in French*
○ In Korean “아” but not “ㅇ” or “ㅏ”
* guess what is the upper case of “ß”? 11
12. Background
Phonemes, graphemes, orthographic characters, and glyphs
● /f/ ➝ ⟨f⟩ ➝ “f” (1-to-1*)
● /s/ ➝ ⟨σ⟩ ➝ “σ” ⊻ “ς” (1-to-n** xor)
● /a/ ➝ ⟨ㅇ⟩+⟨ㅏ⟩ ➝ “아”*** (1-to-n** and)
● /f/ + /i/ ➝ ⟨f⟩ + ⟨i⟩ ➝ “fi” ∨ “fi” (n-to-1** ligature)
* in 18pt Lobster
** in 24pt Times New Roman
*** a syllable, really? 12
13. New Question
From:
How is it possible for a software system to display some many different scripts
and abide by the rules of so many languages?”
To:
How is it possible for a software system to render orthographic characters
into glyphs, i.e., to render characters into (combined) glyphs?
13
14. Abstract Characters
Abstract characters ‘•’
● A distinct unit of textual information in some information system
● In French ‘a’ and also, maybe ‘¸’ or ‘^’
● In Korean ‘아’ and also, maybe ‘ㅇ’ or ‘ㅏ’
● Orthographic characters (or graphemes?)
+ Tabulations, spaces, control characters…
14
16. Very Naive Encoding
Problems with this very naive encoding
● Strict 1-to-1 mapping between abstract characters and glyphs
● Limit of 128 different abstract characters and glyphs
○ Could be solved by using more bytes, e.g., 8 bits
● Impossible to handle right-to-left scripts and “stackabled” scripts
○ In Tibetan script ཨམཎིཔ ྨེ ྃ
ཨམཎིཔ ྨེ ྃ
ཨམཎིཔ ྨེ ྃ
ཨམཎིཔ ྨེ ྃ is actually
● Impossible to handle language-specific rules
○ Not solvable “as is”
● What about sorting, capitalisation (e.g., “ß” becomes “SS”)...
16
17. Very Naive Encoding
Impossible to handle any language-specific rule
● In Greek, the grapheme sigma has two different glyphs
○ “κοσμοσ” must be “κόσμος”
● In Arabic, a same grapheme has different glyphs depending on its
position in the text, e.g., ⟨ġayn⟩
○ Isolated ﻍ
○ Beginning ﻏ
○ Middle ﻐ
○ End ﻎ
17
19. Very Smart Encoding
Separate abstract characters and glyphs
● u0041 ⇄ ??? ⇄
‘A’
○ 1-to-1 mapping
● u11BCu1161 ➝ ??? ➝ ‘아’
○ n-to-n mapping
● u11B7u1161u11AD ➝ ??? ➝ ‘많’
○ n-to-m mapping*
* u1161 is ᆭ but u11AB is ᆫ and u11C2 is ᇂ, design choice?
19
20. Encoding and Displaying
Analyse abstract characters and display glyphs
● A font
○ Contains glyphs
○ Contains rules for ligatures, e.g., “fi” vs. “fi”
○ Contains rules for positioning, e.g., “ ﻍ” or “ﻏ”
○ Contains rules for marks and accents, e.g., “é” or “ӂ”
○ May not have glyphs for all abstract characters! The infamous � and
● Shaping
○ Maps codepoints in a text to the corresponding glyphs provided in the font
● Rendering
○ Converts given glyphs into bitmaps to display on a canvas
20
27. Encoding and Displaying - Other Examples
€
#define TEXT "u20AC"
#define SCRIPT HB_SCRIPT_LATIN
#define LANGUAGE
hb_language_from_string("en", -1)
#define FONT "times.ttf"
1 code point (3 bytes)
1 glyph, ID = 188
€
#define TEXT "u20AC"
#define SCRIPT HB_SCRIPT_LATIN
#define LANGUAGE
hb_language_from_string("en", -1)
#define FONT "himalaya.ttf"
1 code point (3 bytes)
1 glyph, ID = 201
27
29. Encoding and Displaying - Other Examples
#define TEXT
U0001F469u200DU0001F9B0
#define SCRIPT HB_SCRIPT_LATIN
#define LANGUAGE
hb_language_from_string(en, -1)
#define FONT seguiemj.ttf
3 code points (11 bytes)
1 glyph, ID = 10849
#define TEXT
U0001F469U0001F3FEu200Du2…
#define SCRIPT HB_SCRIPT_LATIN
#define LANGUAGE
hb_language_from_string(en, -1)
#define FONT seguiemj.ttf
1 code point (35 bytes)
● Glyph ID = 7452 *
● Glyph ID = 7466
● Glyph ID = 3
● Glyph ID = 7461 29
31. Encoding and Displaying - Other Examples
њ (woman; U1F469)
+ (sheaf of rice; U1F33E)
= Ṣ (woman farmer; U1F469U200DU1F33E)*
я (man; U1F468)
+ ೊ (factory; U1F3ED)
= ″ (man factory worker; U1F468U200DU1F3ED)
* The + is actually u200D, the zero-width joiner
њ (woman; U1F469)
+ ✈ (plane; U2708)
= ỉֽ (woman pilot; U1F469U200DU2708UFE0F)
њ (woman; U1F469)
+ ⶰ (rocket; U1F680)
= ḳ
31
32. New Question - Answer
How is it possible for a software system to render orthographic characters
into glyphs, i.e., to render characters into (combined) glyphs?
● Complex interactions between scripts, languages, displaying
● Complex, multi-tier process between shaping and rendering
● Add one level of indirection between
○ Abstract characters
○ Glyphs
● Share the responsibility
○ Unicode, fonts, and other pieces of software
32
33. Discussions
Some numbers (as of version 14.0.0 of Sep. 14th, 2021)
● Total possible number of code points
○ 17 planes × 65,536 characters − 2,048 surrogates − 66 non-characters = 1,111,998
● Current number of used code points
○ 144,697
● Number of scripts
○ 159
■ Including Cirth, Tengwar, and Klingon!
● Number of emojis
○ 3,633 (not “unique”)
33
34. Discussions
Left-to-Right, Right-to-Left…
● Glyphs with “heads”
○ Egyptian hieroglyphs: and
● Both directions
○ Chinese, Japanese, and Korean: 海南航空 and 空航南海
● Mirrored glyphs every other line (boustrophedon)
○ Ancient Greek and old Hungarian:
● Reversed writing direction every second line
○ Moon’s type:
34
35. Discussions
Unicode code points encoding
● UTF-8 with variable numbers of bytes per code points
● UTF-16 and UTF-32
Unicode and glyphs manipulations
● Counting
● Sorting
● …
35
36. Discussions
Unicode encodes abstract characters, not glyphs
● u0041 ➝ A called LATIN CAPITAL LETTER A
● u0410 ➝ А called CYRILLIC CAPITAL LETTER A
● u0391 ➝ Α called GREEK CAPITAL LETTER
ALPHA
IDN Homograph Attack
● wikipedia.org ≠ wikipеdiа.org
○ Using a tool like https://unicode-table.com/en/
36
37. Conclusion
How is it possible for a software system to render orthographic characters
into glyphs, i.e., to render characters into (combined) glyphs?
● Complex, multi-tier process
● Uses one level of indirection!
● Needs dedicated pieces of software
○ Shaping
○ Rendering
○ Processing
■ Sorting
■ Counting
■ Casing
■ …
37
38. Thanks
● Bobh and Drowe at SIL
● Behdad Esfahbod, Khaled Hosny, David L. Rowe at HarfBuzz
38
39. References
● Bear plus snowflake equals polar bear - ⌨ⅸ࢙ (andysalerno.com)
● Understanding characters, keystrokes, codepoints and glyphs (sil.org)
● Characters, glyphs, code-points, and byte-sequences (tcl-lang.org)
● Why does Unicode have separate codepoints for characters with identical
glyphs? (stackexchange.org)
● What's the difference between a character, a code point, a glyph and a
grapheme? (stackoverflow.org)
● Know : Languages List and their Writing direction
(propelsteps.wordpress.com)
● Item of the Month, June 2010: William Moon and Moon Type
(sabrinamessenger.blogspot.com)
39