An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Implementation

வ+❄=⠀
Yann-Gaël Guéhéneuc
Ptidej Team
2022/04/XXX

Inspired by Bear plus snowflake equals polar bear - ⌨ⅸ࢙
(andysalerno.com)
3

Preamble
Languages and scripts
● English, French… use the Latin alphabet
● Greek, Cypriot Greek use the Greek alphabet
● Korean, Cia-Cia… use the Hangul alphabet (alphabetic syllabary)
● Russian, Bulgarian, Tajik… use the Cyrillic script (different alphabets)
● Arabic alphabet, Chinese logograms, Japanese logograms/syllabaries…
and emojis!
4

Preamble
Languages and scripts
● Bonjour
● Γειά σου
● 여보세요
● Привет
● ‫ﺎ‬ً‫ﺑ‬‫ﻣﺭﺣ‬, 你好, こんにちは... and Ⅼӣ
5

Question
How is it possible for a software system to display some many different scripts
and abide by the rules of so many languages?
6

Background
Text and code points
● “A code point is the atomic unit of
information. Text is a sequence of code
points.”
Glyphs
● “[S]pecific shape, design [… ] of a
character”
Typefaces
● Set of glyphs in a common design
● Baumans, Impact
Fonts
● Particular sets of glyphs in typefaces
● 10pt Pacifico, 18pt Pacifico
● ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ (12pt
Pacifico)
● Sans serif font, serif font
7

Background
Phonemes /•/
● /i/, /e/, /a/, /o/, /u/
Graphemes ⟨•⟩
● Correspond to sounds, i.e., phonemes
● “[S]mallest functional unit[s] of a writing system”
● “sequence of one or more code points [...] displayed as a single,
graphical unit [corresponding to] a single element of the writing system”
8

Background
Graphemes depends on writing systems
Writing systems imply orthographies
● ⟨a⟩ in French
● ⟨á⟩ in Northern Sámi…
… But not in Danish!
● ⟨á⟩ is sorted differently from ⟨a⟩ in Sámi
● ⟨á⟩ is a variant of/sorted as ⟨a⟩ in Danish
9

Background
Graphemes ⟨•⟩
● /ʃ/ in English ⟨sh⟩, in French ⟨ch⟩, in Greek, in Korean*, in Russian ⟨Ш⟩
● /a/ in English (≠ /æ/), in French ⟨a⟩, in Korean ⟨ㅏ⟩, in Russian ⟨а⟩
* except in “씨”
10

Background
Orthographic characters “•”
● “[A] distinct unit of writing in some writing system”
○ In French “a” but not “¸” or “^”
○ In Inuktitut “ᐱ” (/pi/)
○ In German “ß” but not in French*
○ In Korean “아” but not “ㅇ” or “ㅏ”
* guess what is the upper case of “ß”? 11

Background
Phonemes, graphemes, orthographic characters, and glyphs
● /f/ ➝ ⟨f⟩ ➝ “f” (1-to-1*)
● /s/ ➝ ⟨σ⟩ ➝ “σ” ⊻ “ς” (1-to-n** xor)
● /a/ ➝ ⟨ㅇ⟩+⟨ㅏ⟩ ➝ “아”*** (1-to-n** and)
● /f/ + /i/ ➝ ⟨f⟩ + ⟨i⟩ ➝ “fi” ∨ “fi” (n-to-1** ligature)
* in 18pt Lobster
** in 24pt Times New Roman
*** a syllable, really? 12

New Question
From:
How is it possible for a software system to display some many different scripts
and abide by the rules of so many languages?”
To:
How is it possible for a software system to render orthographic characters
into glyphs, i.e., to render characters into (combined) glyphs?
13

Abstract Characters
Abstract characters ‘•’
● A distinct unit of textual information in some information system
● In French ‘a’ and also, maybe ‘¸’ or ‘^’
● In Korean ‘아’ and also, maybe ‘ㅇ’ or ‘ㅏ’
● Orthographic characters (or graphemes?)
+ Tabulations, spaces, control characters…
14

Encondings
Abstract characters depends on encodings
Very naive, 7-bit encoding, aka ASCII character encoding
● … ⇄ …
● 10 (0x0A) ⇄ LF
● … ⇄ …
● 65 (0x41) ⇄ ‘A’
● … ⇄ …
● 126 (0x7E) ⇄ ‘~’
● 127 (0x7F) ⇄ DEL
15
https://www.youtube.com/watch?v=MijmeoH9LT4

Very Naive Encoding
Problems with this very naive encoding
● Strict 1-to-1 mapping between abstract characters and glyphs
● Limit of 128 different abstract characters and glyphs
○ Could be solved by using more bytes, e.g., 8 bits
● Impossible to handle right-to-left scripts and “stackabled” scripts
○ In Tibetan script ཨམཎིཔ ྨེ ྃ
ཨམཎིཔ ྨེ ྃ
ཨམཎིཔ ྨེ ྃ
ཨམཎིཔ ྨེ ྃ is actually
● Impossible to handle language-specific rules
○ Not solvable “as is”
● What about sorting, capitalisation (e.g., “ß” becomes “SS”)...
16

Very Naive Encoding
Impossible to handle any language-specific rule
● In Greek, the grapheme sigma has two different glyphs
○ “κοσμοσ” must be “κόσμος”
● In Arabic, a same grapheme has different glyphs depending on its
position in the text, e.g., ⟨ġayn⟩
○ Isolated ‫ﻍ‬
○ Beginning ‫ﻏ‬
○ Middle ‫ﻐ‬
○ End ‫ﻎ‬
17

Solution
Usual software-engineering “trick”...
Add one level of indirection!
18

Very Smart Encoding
Separate abstract characters and glyphs
● u0041 ⇄ ??? ⇄
‘A’
○ 1-to-1 mapping
● u11BCu1161 ➝ ??? ➝ ‘아’
○ n-to-n mapping
● u11B7u1161u11AD ➝ ??? ➝ ‘많’
○ n-to-m mapping*
* u1161 is ᆭ but u11AB is ᆫ and u11C2 is ᇂ, design choice?
19

Encoding and Displaying
Analyse abstract characters and display glyphs
● A font
○ Contains glyphs
○ Contains rules for ligatures, e.g., “fi” vs. “fi”
○ Contains rules for positioning, e.g., “ ‫ﻍ‬” or “‫ﻏ‬”
○ Contains rules for marks and accents, e.g., “é” or “ӂ”
○ May not have glyphs for all abstract characters! The infamous � and
● Shaping
○ Maps codepoints in a text to the corresponding glyphs provided in the font
● Rendering
○ Converts given glyphs into bitmaps to display on a canvas
20

Encoding and Displaying - ཨམཎིཔ ྨེ ྃ*
u0F68u0F7Cu0F7Eu0F58u0F4Eu0F72u0F54
u0F51u0FA8u0F7Au0F67u0F71u0F74u0F83
{350, 462, 334, 324, 354,
330, 874, 403, 362, 859, 371}
Shaping
Rendering
* Oṃ maṇi padme hūṃ 21

These glyphs correspond to bitmaps:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Encoding and Displaying - ཨམཎིཔ ྨེ ྃ
14 code points (42 bytes)
11 glyphs:
1. Glyph ID = 350
2. Glyph ID = 462
3. Glyph ID = 334
4. Glyph ID = 324
5. Glyph ID = 354
6. Glyph ID = 330
7. Glyph ID = 874
8. Glyph ID = 403
9. Glyph ID = 362
10. Glyph ID = 859
11. Glyph ID = 371
22

Encoding and Displaying - Code!
u0F68u0F7Cu0F7Eu0F58u0F4Eu0F72u0F54
u0F51u0FA8u0F7Au0F67u0F71u0F74u0F83
{350, 462, 334, 324, 354,
330, 874, 403, 362, 859, 371}
Harfbuzz
FreeType
23

Encoding and Displaying - HarfBuzz (Shaping)
#include <stdio.h>
#include <string.h>
#include <hb.h>
#define TEXT "u0F68u0F7Cu0F7Eu0F58u0F4Eu0F72u0F54u0F51u0FA8…
#define FONT "himalaya.ttf"
int main()
{
printf("Text = "%s"n", TEXT);
printf("Length = %ld bytesnn", strlen(TEXT));
hb_buffer_t *buf;
buf = hb_buffer_create();
hb_buffer_add_utf8(buf, TEXT, -1, 0, -1);
hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
hb_buffer_set_script(buf, HB_SCRIPT_TIBETAN);
hb_buffer_set_language(buf, hb_language_from_string("bo", -
1));
hb_blob_t *blob = hb_blob_create_from_file("Fonts/" FONT);
hb_face_t *face = hb_face_create(blob, 0);
hb_font_t *font = hb_font_create(face);
hb_shape(font, buf, NULL, 0);
unsigned int glyph_count;
hb_glyph_info_t *glyph_info = hb_buffer_get_glyph_infos(buf,
&glyph_count);
hb_glyph_position_t *glyph_pos = hb_buffer_get_glyph_positions(buf,
&glyph_count);
hb_position_t cursor_x = 0;
hb_position_t cursor_y = 0;
for (unsigned int i = 0; i < glyph_count; i++) {
hb_codepoint_t glyphid = glyph_info[i].codepoint;
hb_position_t x_advance = glyph_pos[i].x_advance;
hb_position_t y_advance = glyph_pos[i].y_advance;
printf("Glyph ID = %dn", glyphid);
cursor_x += x_advance;
cursor_y += y_advance;
}
hb_buffer_destroy(buf);
hb_font_destroy(font);
hb_face_destroy(face);
hb_blob_destroy(blob);
return 0;
}
24

Encoding and Displaying - FreeType (Rendering)
#include <stdio.h>
#include <string.h>
#include <math.h>
#include "ft2build.h"
#include FT_FREETYPE_H
#define WIDTH 340
#define HEIGHT 18
#define FILENAME "himalaya.ttf"
#define GLYPH_IDS { 350, 462, 334, 324, 354, 330, 874, 403, 362, 859, 371 }
#define NUM_CHARS 11
int main(void)
{
FT_Library library;
FT_Init_FreeType(&library);
FT_Face face;
FT_New_Face(library, "Fonts/" FILENAME, 0, &face);
FT_Set_Char_Size(face, 36 * 64, 12 * 64, 0, 0);
FT_Vector pen;
pen.x = 0;
pen.y = (HEIGHT - 12) * 64;
FT_GlyphSlot slot = face->glyph;
for(int n = 0; n < NUM_CHARS; n++)
{
FT_Set_Transform(face, NULL, &pen);
FT_UInt glyph_ids[] = GLYPH_IDS;
FT_Error error = FT_Load_Glyph(face, glyph_ids[n],
FT_LOAD_RENDER);
if(error) continue;
draw_bitmap(&slot->bitmap, slot->bitmap_left,
HEIGHT - slot->bitmap_top);
pen.x += slot->advance.x;
pen.y += slot->advance.y;
}
show_image();
FT_Done_Face(face);
FT_Done_FreeType(library);
return 0;
}
25

u0F68u0F7Cu0F7Eu0F58u0F
4Eu0F72u0F54u0F51u0FA8u
0F7Au0F67u0F71u0F74u0F83
{350, 462, 334, 324, 354, 330,
874, 403, 362, 859, 371}
26
Encoding and Displaying - FreeType (Rendering)

Encoding and Displaying - Other Examples
€
#define TEXT "u20AC"
#define SCRIPT HB_SCRIPT_LATIN
#define LANGUAGE
hb_language_from_string("en", -1)
#define FONT "times.ttf"
1 code point (3 bytes)
1 glyph, ID = 188
€
#define TEXT "u20AC"
#define LANGUAGE
hb_language_from_string("en", -1)
#define FONT "himalaya.ttf"
1 glyph, ID = 201
27

€
u20AC
{ 201 }
28


#define TEXT
U0001F469u200DU0001F9B0
#define LANGUAGE
hb_language_from_string(en, -1)
#define FONT seguiemj.ttf
3 code points (11 bytes)
1 glyph, ID = 10849

#define TEXT
U0001F469U0001F3FEu200Du2…
#define LANGUAGE
hb_language_from_string(en, -1)
#define FONT seguiemj.ttf
● Glyph ID = 7452 *
● Glyph ID = 7466
● Glyph ID = 3
● Glyph ID = 7461 29

њ (woman; U1F469)
+ ؅ (sheaf of rice; U1F33E)
= Ṣ (woman farmer; U1F469U200DU1F33E)*
я (man; U1F468)
+ ೊ (factory; U1F3ED)
= ″ (man factory worker; U1F468U200DU1F3ED)
* The + is actually u200D, the zero-width joiner
њ (woman; U1F469)
+ ✈ (plane; U2708)
= ỉֽ (woman pilot; U1F469U200DU2708UFE0F)
њ (woman; U1F469)
+ ⶰ (rocket; U1F680)
= ḳ
31

New Question - Answer
● Complex interactions between scripts, languages, displaying
● Complex, multi-tier process between shaping and rendering
● Add one level of indirection between
○ Abstract characters
○ Glyphs
● Share the responsibility
○ Unicode, fonts, and other pieces of software
32

Discussions
Some numbers (as of version 14.0.0 of Sep. 14th, 2021)
● Total possible number of code points
○ 17 planes × 65,536 characters − 2,048 surrogates − 66 non-characters = 1,111,998
● Current number of used code points
○ 144,697
● Number of scripts
○ 159
■ Including Cirth, Tengwar, and Klingon!
● Number of emojis
○ 3,633 (not “unique”)
33

Discussions
Left-to-Right, Right-to-Left…
● Glyphs with “heads”
○ Egyptian hieroglyphs: and
● Both directions
○ Chinese, Japanese, and Korean: 海南航空 and 空航南海
● Mirrored glyphs every other line (boustrophedon)
○ Ancient Greek and old Hungarian:
● Reversed writing direction every second line
○ Moon’s type:
34

Discussions
Unicode code points encoding
● UTF-8 with variable numbers of bytes per code points
● UTF-16 and UTF-32
Unicode and glyphs manipulations
● Counting
● Sorting
● …
35

Discussions
Unicode encodes abstract characters, not glyphs
● u0041 ➝ A called LATIN CAPITAL LETTER A
● u0410 ➝ А called CYRILLIC CAPITAL LETTER A
● u0391 ➝ Α called GREEK CAPITAL LETTER
ALPHA
IDN Homograph Attack
● wikipedia.org ≠ wikipеdiа.org
○ Using a tool like https://unicode-table.com/en/
36

Conclusion
● Complex, multi-tier process
● Uses one level of indirection!
● Needs dedicated pieces of software
○ Shaping
○ Rendering
○ Processing
■ Sorting
■ Counting
■ Casing
■ …
37

Thanks
● Bobh and Drowe at SIL
● Behdad Esfahbod, Khaled Hosny, David L. Rowe at HarfBuzz
38

References
● Bear plus snowflake equals polar bear - ⌨ⅸ࢙ (andysalerno.com)
● Understanding characters, keystrokes, codepoints and glyphs (sil.org)
● Characters, glyphs, code-points, and byte-sequences (tcl-lang.org)
● Why does Unicode have separate codepoints for characters with identical
glyphs? (stackexchange.org)
● What's the difference between a character, a code point, a glyph and a
grapheme? (stackoverflow.org)
● Know : Languages List and their Writing direction
(propelsteps.wordpress.com)
● Item of the Month, June 2010: William Moon and Moon Type
(sabrinamessenger.blogspot.com)
39

An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Implementation

Recommended

Recommended

More Related Content

Similar to An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Implementation

Similar to An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Implementation (20)

More from Yann-Gaël Guéhéneuc

More from Yann-Gaël Guéhéneuc (20)

Recently uploaded

Recently uploaded (20)

An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Implementation