SlideShare a Scribd company logo
1 of 39
Download to read offline
1
வ+❄=⠀
Yann-Gaël Guéhéneuc
Ptidej Team
2022/04/XXX
Inspired by Bear plus snowflake equals polar bear - ⌨ⅸ࢙
(andysalerno.com)
3
Preamble
Languages and scripts
● English, French… use the Latin alphabet
● Greek, Cypriot Greek use the Greek alphabet
● Korean, Cia-Cia… use the Hangul alphabet (alphabetic syllabary)
● Russian, Bulgarian, Tajik… use the Cyrillic script (different alphabets)
● Arabic alphabet, Chinese logograms, Japanese logograms/syllabaries…
and emojis!
4
Preamble
Languages and scripts
● Bonjour
● Γειά σου
● 여보세요
● Привет
● ‫ﺎ‬ً‫ﺑ‬‫ﻣﺭﺣ‬, 你好, こんにちは... and Ⅼӣ
5
Question
How is it possible for a software system to display some many different scripts
and abide by the rules of so many languages?
6
Background
Text and code points
● “A code point is the atomic unit of
information. Text is a sequence of code
points.”
Glyphs
● “[S]pecific shape, design [… ] of a
character”
Typefaces
● Set of glyphs in a common design
● Baumans, Impact
Fonts
● Particular sets of glyphs in typefaces
● 10pt Pacifico, 18pt Pacifico
● ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ (12pt
Pacifico)
● Sans serif font, serif font
7
Background
Phonemes /•/
● /i/, /e/, /a/, /o/, /u/
Graphemes ⟨•⟩
● Correspond to sounds, i.e., phonemes
● “[S]mallest functional unit[s] of a writing system”
● “sequence of one or more code points [...] displayed as a single,
graphical unit [corresponding to] a single element of the writing system”
8
Background
Graphemes depends on writing systems
Writing systems imply orthographies
● ⟨a⟩ in French
● ⟨á⟩ in Northern Sámi…
… But not in Danish!
● ⟨á⟩ is sorted differently from ⟨a⟩ in Sámi
● ⟨á⟩ is a variant of/sorted as ⟨a⟩ in Danish
9
Background
Graphemes ⟨•⟩
● /ʃ/ in English ⟨sh⟩, in French ⟨ch⟩, in Greek, in Korean*, in Russian ⟨Ш⟩
● /a/ in English (≠ /æ/), in French ⟨a⟩, in Korean ⟨ㅏ⟩, in Russian ⟨а⟩
* except in “씨”
10
Background
Orthographic characters “•”
● “[A] distinct unit of writing in some writing system”
○ In French “a” but not “¸” or “^”
○ In Inuktitut “ᐱ” (/pi/)
○ In German “ß” but not in French*
○ In Korean “아” but not “ㅇ” or “ㅏ”
* guess what is the upper case of “ß”? 11
Background
Phonemes, graphemes, orthographic characters, and glyphs
● /f/ ➝ ⟨f⟩ ➝ “f” (1-to-1*)
● /s/ ➝ ⟨σ⟩ ➝ “σ” ⊻ “ς” (1-to-n** xor)
● /a/ ➝ ⟨ㅇ⟩+⟨ㅏ⟩ ➝ “아”*** (1-to-n** and)
● /f/ + /i/ ➝ ⟨f⟩ + ⟨i⟩ ➝ “fi” ∨ “fi” (n-to-1** ligature)
* in 18pt Lobster
** in 24pt Times New Roman
*** a syllable, really? 12
New Question
From:
How is it possible for a software system to display some many different scripts
and abide by the rules of so many languages?”
To:
How is it possible for a software system to render orthographic characters
into glyphs, i.e., to render characters into (combined) glyphs?
13
Abstract Characters
Abstract characters ‘•’
● A distinct unit of textual information in some information system
● In French ‘a’ and also, maybe ‘¸’ or ‘^’
● In Korean ‘아’ and also, maybe ‘ㅇ’ or ‘ㅏ’
● Orthographic characters (or graphemes?)
+ Tabulations, spaces, control characters…
14
Encondings
Abstract characters depends on encodings
Very naive, 7-bit encoding, aka ASCII character encoding
● … ⇄ …
● 10 (0x0A) ⇄ LF
● … ⇄ …
● 65 (0x41) ⇄ ‘A’
● … ⇄ …
● 126 (0x7E) ⇄ ‘~’
● 127 (0x7F) ⇄ DEL
15
https://www.youtube.com/watch?v=MijmeoH9LT4
Very Naive Encoding
Problems with this very naive encoding
● Strict 1-to-1 mapping between abstract characters and glyphs
● Limit of 128 different abstract characters and glyphs
○ Could be solved by using more bytes, e.g., 8 bits
● Impossible to handle right-to-left scripts and “stackabled” scripts
○ In Tibetan script ཨམཎིཔ ྨེ ྃ
ཨམཎིཔ ྨེ ྃ
ཨམཎིཔ ྨེ ྃ
ཨམཎིཔ ྨེ ྃ is actually
● Impossible to handle language-specific rules
○ Not solvable “as is”
● What about sorting, capitalisation (e.g., “ß” becomes “SS”)...
16
Very Naive Encoding
Impossible to handle any language-specific rule
● In Greek, the grapheme sigma has two different glyphs
○ “κοσμοσ” must be “κόσμος”
● In Arabic, a same grapheme has different glyphs depending on its
position in the text, e.g., ⟨ġayn⟩
○ Isolated ‫ﻍ‬
○ Beginning ‫ﻏ‬
○ Middle ‫ﻐ‬
○ End ‫ﻎ‬
17
Solution
Usual software-engineering “trick”...
Add one level of indirection!
18
Very Smart Encoding
Separate abstract characters and glyphs
● u0041 ⇄ ??? ⇄
‘A’
○ 1-to-1 mapping
● u11BCu1161 ➝ ??? ➝ ‘아’
○ n-to-n mapping
● u11B7u1161u11AD ➝ ??? ➝ ‘많’
○ n-to-m mapping*
* u1161 is ᆭ but u11AB is ᆫ and u11C2 is ᇂ, design choice?
19
Encoding and Displaying
Analyse abstract characters and display glyphs
● A font
○ Contains glyphs
○ Contains rules for ligatures, e.g., “fi” vs. “fi”
○ Contains rules for positioning, e.g., “ ‫ﻍ‬” or “‫ﻏ‬”
○ Contains rules for marks and accents, e.g., “é” or “ӂ”
○ May not have glyphs for all abstract characters! The infamous � and
● Shaping
○ Maps codepoints in a text to the corresponding glyphs provided in the font
● Rendering
○ Converts given glyphs into bitmaps to display on a canvas
20
Encoding and Displaying - ཨམཎིཔ ྨེ ྃ*
u0F68u0F7Cu0F7Eu0F58u0F4Eu0F72u0F54
u0F51u0FA8u0F7Au0F67u0F71u0F74u0F83
{350, 462, 334, 324, 354,
330, 874, 403, 362, 859, 371}
Shaping
Rendering
* Oṃ maṇi padme hūṃ 21
These glyphs correspond to bitmaps:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Encoding and Displaying - ཨམཎིཔ ྨེ ྃ
14 code points (42 bytes)
11 glyphs:
1. Glyph ID = 350
2. Glyph ID = 462
3. Glyph ID = 334
4. Glyph ID = 324
5. Glyph ID = 354
6. Glyph ID = 330
7. Glyph ID = 874
8. Glyph ID = 403
9. Glyph ID = 362
10. Glyph ID = 859
11. Glyph ID = 371
22
Encoding and Displaying - Code!
u0F68u0F7Cu0F7Eu0F58u0F4Eu0F72u0F54
u0F51u0FA8u0F7Au0F67u0F71u0F74u0F83
{350, 462, 334, 324, 354,
330, 874, 403, 362, 859, 371}
Harfbuzz
FreeType
23
Encoding and Displaying - HarfBuzz (Shaping)
#include <stdio.h>
#include <string.h>
#include <hb.h>
#define TEXT "u0F68u0F7Cu0F7Eu0F58u0F4Eu0F72u0F54u0F51u0FA8…
#define FONT "himalaya.ttf"
int main()
{
printf("Text = "%s"n", TEXT);
printf("Length = %ld bytesnn", strlen(TEXT));
hb_buffer_t *buf;
buf = hb_buffer_create();
hb_buffer_add_utf8(buf, TEXT, -1, 0, -1);
hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
hb_buffer_set_script(buf, HB_SCRIPT_TIBETAN);
hb_buffer_set_language(buf, hb_language_from_string("bo", -
1));
hb_blob_t *blob = hb_blob_create_from_file("Fonts/" FONT);
hb_face_t *face = hb_face_create(blob, 0);
hb_font_t *font = hb_font_create(face);
hb_shape(font, buf, NULL, 0);
unsigned int glyph_count;
hb_glyph_info_t *glyph_info = hb_buffer_get_glyph_infos(buf,
&glyph_count);
hb_glyph_position_t *glyph_pos = hb_buffer_get_glyph_positions(buf,
&glyph_count);
hb_position_t cursor_x = 0;
hb_position_t cursor_y = 0;
for (unsigned int i = 0; i < glyph_count; i++) {
hb_codepoint_t glyphid = glyph_info[i].codepoint;
hb_position_t x_advance = glyph_pos[i].x_advance;
hb_position_t y_advance = glyph_pos[i].y_advance;
printf("Glyph ID = %dn", glyphid);
cursor_x += x_advance;
cursor_y += y_advance;
}
hb_buffer_destroy(buf);
hb_font_destroy(font);
hb_face_destroy(face);
hb_blob_destroy(blob);
return 0;
}
24
Encoding and Displaying - FreeType (Rendering)
#include <stdio.h>
#include <string.h>
#include <math.h>
#include "ft2build.h"
#include FT_FREETYPE_H
#define WIDTH 340
#define HEIGHT 18
#define FILENAME "himalaya.ttf"
#define GLYPH_IDS { 350, 462, 334, 324, 354, 330, 874, 403, 362, 859, 371 }
#define NUM_CHARS 11
int main(void)
{
FT_Library library;
FT_Init_FreeType(&library);
FT_Face face;
FT_New_Face(library, "Fonts/" FILENAME, 0, &face);
FT_Set_Char_Size(face, 36 * 64, 12 * 64, 0, 0);
FT_Vector pen;
pen.x = 0;
pen.y = (HEIGHT - 12) * 64;
FT_GlyphSlot slot = face->glyph;
for(int n = 0; n < NUM_CHARS; n++)
{
FT_Set_Transform(face, NULL, &pen);
FT_UInt glyph_ids[] = GLYPH_IDS;
FT_Error error = FT_Load_Glyph(face, glyph_ids[n],
FT_LOAD_RENDER);
if(error) continue;
draw_bitmap(&slot->bitmap, slot->bitmap_left,
HEIGHT - slot->bitmap_top);
pen.x += slot->advance.x;
pen.y += slot->advance.y;
}
show_image();
FT_Done_Face(face);
FT_Done_FreeType(library);
return 0;
}
25
u0F68u0F7Cu0F7Eu0F58u0F
4Eu0F72u0F54u0F51u0FA8u
0F7Au0F67u0F71u0F74u0F83
{350, 462, 334, 324, 354, 330,
874, 403, 362, 859, 371}
26
Encoding and Displaying - FreeType (Rendering)
Encoding and Displaying - Other Examples
€
#define TEXT "u20AC"
#define SCRIPT HB_SCRIPT_LATIN
#define LANGUAGE
hb_language_from_string("en", -1)
#define FONT "times.ttf"
1 code point (3 bytes)
1 glyph, ID = 188
€
#define TEXT "u20AC"
#define SCRIPT HB_SCRIPT_LATIN
#define LANGUAGE
hb_language_from_string("en", -1)
#define FONT "himalaya.ttf"
1 code point (3 bytes)
1 glyph, ID = 201
27
€
u20AC
{ 201 }
28
Encoding and Displaying - Other Examples
Encoding and Displaying - Other Examples

#define TEXT
U0001F469u200DU0001F9B0
#define SCRIPT HB_SCRIPT_LATIN
#define LANGUAGE
hb_language_from_string(en, -1)
#define FONT seguiemj.ttf
3 code points (11 bytes)
1 glyph, ID = 10849

#define TEXT
U0001F469U0001F3FEu200Du2…
#define SCRIPT HB_SCRIPT_LATIN
#define LANGUAGE
hb_language_from_string(en, -1)
#define FONT seguiemj.ttf
1 code point (35 bytes)
● Glyph ID = 7452 *
● Glyph ID = 7466
● Glyph ID = 3
● Glyph ID = 7461 29
30
Encoding and Displaying - Other Examples
њ (woman; U1F469)
+ ؅ (sheaf of rice; U1F33E)
= Ṣ (woman farmer; U1F469U200DU1F33E)*
я (man; U1F468)
+ ೊ (factory; U1F3ED)
= ″ (man factory worker; U1F468U200DU1F3ED)
* The + is actually u200D, the zero-width joiner
њ (woman; U1F469)
+ ✈ (plane; U2708)
= ỉֽ (woman pilot; U1F469U200DU2708UFE0F)
њ (woman; U1F469)
+ ⶰ (rocket; U1F680)
= ḳ
31
New Question - Answer
How is it possible for a software system to render orthographic characters
into glyphs, i.e., to render characters into (combined) glyphs?
● Complex interactions between scripts, languages, displaying
● Complex, multi-tier process between shaping and rendering
● Add one level of indirection between
○ Abstract characters
○ Glyphs
● Share the responsibility
○ Unicode, fonts, and other pieces of software
32
Discussions
Some numbers (as of version 14.0.0 of Sep. 14th, 2021)
● Total possible number of code points
○ 17 planes × 65,536 characters − 2,048 surrogates − 66 non-characters = 1,111,998
● Current number of used code points
○ 144,697
● Number of scripts
○ 159
■ Including Cirth, Tengwar, and Klingon! 
● Number of emojis
○ 3,633 (not “unique”)
33
Discussions
Left-to-Right, Right-to-Left…
● Glyphs with “heads”
○ Egyptian hieroglyphs: and
● Both directions
○ Chinese, Japanese, and Korean: 海南航空 and 空航南海
● Mirrored glyphs every other line (boustrophedon)
○ Ancient Greek and old Hungarian:
● Reversed writing direction every second line
○ Moon’s type:
34
Discussions
Unicode code points encoding
● UTF-8 with variable numbers of bytes per code points
● UTF-16 and UTF-32
Unicode and glyphs manipulations
● Counting
● Sorting
● …
35
Discussions
Unicode encodes abstract characters, not glyphs
● u0041 ➝ A called LATIN CAPITAL LETTER A
● u0410 ➝ А called CYRILLIC CAPITAL LETTER A
● u0391 ➝ Α called GREEK CAPITAL LETTER
ALPHA
IDN Homograph Attack
● wikipedia.org ≠ wikipеdiа.org
○ Using a tool like https://unicode-table.com/en/
36
Conclusion
How is it possible for a software system to render orthographic characters
into glyphs, i.e., to render characters into (combined) glyphs?
● Complex, multi-tier process
● Uses one level of indirection!
● Needs dedicated pieces of software
○ Shaping
○ Rendering
○ Processing
■ Sorting
■ Counting
■ Casing
■ …
37
Thanks
● Bobh and Drowe at SIL
● Behdad Esfahbod, Khaled Hosny, David L. Rowe at HarfBuzz
38
References
● Bear plus snowflake equals polar bear - ⌨ⅸ࢙ (andysalerno.com)
● Understanding characters, keystrokes, codepoints and glyphs (sil.org)
● Characters, glyphs, code-points, and byte-sequences (tcl-lang.org)
● Why does Unicode have separate codepoints for characters with identical
glyphs? (stackexchange.org)
● What's the difference between a character, a code point, a glyph and a
grapheme? (stackoverflow.org)
● Know : Languages List and their Writing direction
(propelsteps.wordpress.com)
● Item of the Month, June 2010: William Moon and Moon Type
(sabrinamessenger.blogspot.com)
39

More Related Content

Similar to An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Implementation

Similar to An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Implementation (20)

Syntax Analysis.pdf
Syntax Analysis.pdfSyntax Analysis.pdf
Syntax Analysis.pdf
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdf
 
03 introduction to graph databases
03   introduction to graph databases03   introduction to graph databases
03 introduction to graph databases
 
Unicode Regular Expressions
Unicode Regular ExpressionsUnicode Regular Expressions
Unicode Regular Expressions
 
PHP for Grown-ups
PHP for Grown-upsPHP for Grown-ups
PHP for Grown-ups
 
Ballerina Tech Talk - May 2023
Ballerina Tech Talk - May 2023Ballerina Tech Talk - May 2023
Ballerina Tech Talk - May 2023
 
2020-11-26_swisstypefaces_Suisse.pdf
2020-11-26_swisstypefaces_Suisse.pdf2020-11-26_swisstypefaces_Suisse.pdf
2020-11-26_swisstypefaces_Suisse.pdf
 
Regular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondRegular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And Beyond
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
Cfg part i
Cfg   part iCfg   part i
Cfg part i
 
Parsing Expression Grammars
Parsing Expression GrammarsParsing Expression Grammars
Parsing Expression Grammars
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tour
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tour
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
C Programming
C ProgrammingC Programming
C Programming
 
Regular expression presentation for the HUB
Regular expression presentation for the HUBRegular expression presentation for the HUB
Regular expression presentation for the HUB
 
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
 
Gremlin's Anatomy
Gremlin's AnatomyGremlin's Anatomy
Gremlin's Anatomy
 
Perl Moderno
Perl ModernoPerl Moderno
Perl Moderno
 
An Introduction to Groovy for Java Developers
An Introduction to Groovy for Java DevelopersAn Introduction to Groovy for Java Developers
An Introduction to Groovy for Java Developers
 

More from Yann-Gaël Guéhéneuc

Evolution and Examples of Java Features, from Java 1.7 to Java 22
Evolution and Examples of Java Features, from Java 1.7 to Java 22Evolution and Examples of Java Features, from Java 1.7 to Java 22
Evolution and Examples of Java Features, from Java 1.7 to Java 22
Yann-Gaël Guéhéneuc
 
Consequences and Principles of Software Quality v0.3
Consequences and Principles of Software Quality v0.3Consequences and Principles of Software Quality v0.3
Consequences and Principles of Software Quality v0.3
Yann-Gaël Guéhéneuc
 
On Reflection in OO Programming Languages v1.6
On Reflection in OO Programming Languages v1.6On Reflection in OO Programming Languages v1.6
On Reflection in OO Programming Languages v1.6
Yann-Gaël Guéhéneuc
 

More from Yann-Gaël Guéhéneuc (20)

Advice for writing a NSERC Discovery grant application v0.5
Advice for writing a NSERC Discovery grant application v0.5Advice for writing a NSERC Discovery grant application v0.5
Advice for writing a NSERC Discovery grant application v0.5
 
Ptidej Architecture, Design, and Implementation in Action v2.1
Ptidej Architecture, Design, and Implementation in Action v2.1Ptidej Architecture, Design, and Implementation in Action v2.1
Ptidej Architecture, Design, and Implementation in Action v2.1
 
Evolution and Examples of Java Features, from Java 1.7 to Java 22
Evolution and Examples of Java Features, from Java 1.7 to Java 22Evolution and Examples of Java Features, from Java 1.7 to Java 22
Evolution and Examples of Java Features, from Java 1.7 to Java 22
 
Consequences and Principles of Software Quality v0.3
Consequences and Principles of Software Quality v0.3Consequences and Principles of Software Quality v0.3
Consequences and Principles of Software Quality v0.3
 
Some Pitfalls with Python and Their Possible Solutions v0.9
Some Pitfalls with Python and Their Possible Solutions v0.9Some Pitfalls with Python and Their Possible Solutions v0.9
Some Pitfalls with Python and Their Possible Solutions v0.9
 
An Explanation of the Halting Problem and Its Consequences
An Explanation of the Halting Problem and Its ConsequencesAn Explanation of the Halting Problem and Its Consequences
An Explanation of the Halting Problem and Its Consequences
 
Are CPUs VMs Like Any Others? v1.0
Are CPUs VMs Like Any Others? v1.0Are CPUs VMs Like Any Others? v1.0
Are CPUs VMs Like Any Others? v1.0
 
Informaticien(ne)s célèbres (v1.0.2, 19/02/20)
Informaticien(ne)s célèbres (v1.0.2, 19/02/20)Informaticien(ne)s célèbres (v1.0.2, 19/02/20)
Informaticien(ne)s célèbres (v1.0.2, 19/02/20)
 
Well-known Computer Scientists v1.0.2
Well-known Computer Scientists v1.0.2Well-known Computer Scientists v1.0.2
Well-known Computer Scientists v1.0.2
 
On Java Generics, History, Use, Caveats v1.1
On Java Generics, History, Use, Caveats v1.1On Java Generics, History, Use, Caveats v1.1
On Java Generics, History, Use, Caveats v1.1
 
On Reflection in OO Programming Languages v1.6
On Reflection in OO Programming Languages v1.6On Reflection in OO Programming Languages v1.6
On Reflection in OO Programming Languages v1.6
 
ICSOC'21
ICSOC'21ICSOC'21
ICSOC'21
 
Vissoft21.ppt
Vissoft21.pptVissoft21.ppt
Vissoft21.ppt
 
Service computation20.ppt
Service computation20.pptService computation20.ppt
Service computation20.ppt
 
Serp4 iot20.ppt
Serp4 iot20.pptSerp4 iot20.ppt
Serp4 iot20.ppt
 
Msr20.ppt
Msr20.pptMsr20.ppt
Msr20.ppt
 
Iwesep19.ppt
Iwesep19.pptIwesep19.ppt
Iwesep19.ppt
 
Icsoc20.ppt
Icsoc20.pptIcsoc20.ppt
Icsoc20.ppt
 
Icsoc18.ppt
Icsoc18.pptIcsoc18.ppt
Icsoc18.ppt
 
Icsm20.ppt
Icsm20.pptIcsm20.ppt
Icsm20.ppt
 

Recently uploaded

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Recently uploaded (20)

Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 

An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Implementation

  • 1. 1
  • 3. Inspired by Bear plus snowflake equals polar bear - ⌨ⅸ࢙ (andysalerno.com) 3
  • 4. Preamble Languages and scripts ● English, French… use the Latin alphabet ● Greek, Cypriot Greek use the Greek alphabet ● Korean, Cia-Cia… use the Hangul alphabet (alphabetic syllabary) ● Russian, Bulgarian, Tajik… use the Cyrillic script (different alphabets) ● Arabic alphabet, Chinese logograms, Japanese logograms/syllabaries… and emojis! 4
  • 5. Preamble Languages and scripts ● Bonjour ● Γειά σου ● 여보세요 ● Привет ● ‫ﺎ‬ً‫ﺑ‬‫ﻣﺭﺣ‬, 你好, こんにちは... and Ⅼӣ 5
  • 6. Question How is it possible for a software system to display some many different scripts and abide by the rules of so many languages? 6
  • 7. Background Text and code points ● “A code point is the atomic unit of information. Text is a sequence of code points.” Glyphs ● “[S]pecific shape, design [… ] of a character” Typefaces ● Set of glyphs in a common design ● Baumans, Impact Fonts ● Particular sets of glyphs in typefaces ● 10pt Pacifico, 18pt Pacifico ● ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ (12pt Pacifico) ● Sans serif font, serif font 7
  • 8. Background Phonemes /•/ ● /i/, /e/, /a/, /o/, /u/ Graphemes ⟨•⟩ ● Correspond to sounds, i.e., phonemes ● “[S]mallest functional unit[s] of a writing system” ● “sequence of one or more code points [...] displayed as a single, graphical unit [corresponding to] a single element of the writing system” 8
  • 9. Background Graphemes depends on writing systems Writing systems imply orthographies ● ⟨a⟩ in French ● ⟨á⟩ in Northern Sámi… … But not in Danish! ● ⟨á⟩ is sorted differently from ⟨a⟩ in Sámi ● ⟨á⟩ is a variant of/sorted as ⟨a⟩ in Danish 9
  • 10. Background Graphemes ⟨•⟩ ● /ʃ/ in English ⟨sh⟩, in French ⟨ch⟩, in Greek, in Korean*, in Russian ⟨Ш⟩ ● /a/ in English (≠ /æ/), in French ⟨a⟩, in Korean ⟨ㅏ⟩, in Russian ⟨а⟩ * except in “씨” 10
  • 11. Background Orthographic characters “•” ● “[A] distinct unit of writing in some writing system” ○ In French “a” but not “¸” or “^” ○ In Inuktitut “ᐱ” (/pi/) ○ In German “ß” but not in French* ○ In Korean “아” but not “ㅇ” or “ㅏ” * guess what is the upper case of “ß”? 11
  • 12. Background Phonemes, graphemes, orthographic characters, and glyphs ● /f/ ➝ ⟨f⟩ ➝ “f” (1-to-1*) ● /s/ ➝ ⟨σ⟩ ➝ “σ” ⊻ “ς” (1-to-n** xor) ● /a/ ➝ ⟨ㅇ⟩+⟨ㅏ⟩ ➝ “아”*** (1-to-n** and) ● /f/ + /i/ ➝ ⟨f⟩ + ⟨i⟩ ➝ “fi” ∨ “fi” (n-to-1** ligature) * in 18pt Lobster ** in 24pt Times New Roman *** a syllable, really? 12
  • 13. New Question From: How is it possible for a software system to display some many different scripts and abide by the rules of so many languages?” To: How is it possible for a software system to render orthographic characters into glyphs, i.e., to render characters into (combined) glyphs? 13
  • 14. Abstract Characters Abstract characters ‘•’ ● A distinct unit of textual information in some information system ● In French ‘a’ and also, maybe ‘¸’ or ‘^’ ● In Korean ‘아’ and also, maybe ‘ㅇ’ or ‘ㅏ’ ● Orthographic characters (or graphemes?) + Tabulations, spaces, control characters… 14
  • 15. Encondings Abstract characters depends on encodings Very naive, 7-bit encoding, aka ASCII character encoding ● … ⇄ … ● 10 (0x0A) ⇄ LF ● … ⇄ … ● 65 (0x41) ⇄ ‘A’ ● … ⇄ … ● 126 (0x7E) ⇄ ‘~’ ● 127 (0x7F) ⇄ DEL 15 https://www.youtube.com/watch?v=MijmeoH9LT4
  • 16. Very Naive Encoding Problems with this very naive encoding ● Strict 1-to-1 mapping between abstract characters and glyphs ● Limit of 128 different abstract characters and glyphs ○ Could be solved by using more bytes, e.g., 8 bits ● Impossible to handle right-to-left scripts and “stackabled” scripts ○ In Tibetan script ཨམཎིཔ ྨེ ྃ ཨམཎིཔ ྨེ ྃ ཨམཎིཔ ྨེ ྃ ཨམཎིཔ ྨེ ྃ is actually ● Impossible to handle language-specific rules ○ Not solvable “as is” ● What about sorting, capitalisation (e.g., “ß” becomes “SS”)... 16
  • 17. Very Naive Encoding Impossible to handle any language-specific rule ● In Greek, the grapheme sigma has two different glyphs ○ “κοσμοσ” must be “κόσμος” ● In Arabic, a same grapheme has different glyphs depending on its position in the text, e.g., ⟨ġayn⟩ ○ Isolated ‫ﻍ‬ ○ Beginning ‫ﻏ‬ ○ Middle ‫ﻐ‬ ○ End ‫ﻎ‬ 17
  • 19. Very Smart Encoding Separate abstract characters and glyphs ● u0041 ⇄ ??? ⇄ ‘A’ ○ 1-to-1 mapping ● u11BCu1161 ➝ ??? ➝ ‘아’ ○ n-to-n mapping ● u11B7u1161u11AD ➝ ??? ➝ ‘많’ ○ n-to-m mapping* * u1161 is ᆭ but u11AB is ᆫ and u11C2 is ᇂ, design choice? 19
  • 20. Encoding and Displaying Analyse abstract characters and display glyphs ● A font ○ Contains glyphs ○ Contains rules for ligatures, e.g., “fi” vs. “fi” ○ Contains rules for positioning, e.g., “ ‫ﻍ‬” or “‫ﻏ‬” ○ Contains rules for marks and accents, e.g., “é” or “ӂ” ○ May not have glyphs for all abstract characters! The infamous � and ● Shaping ○ Maps codepoints in a text to the corresponding glyphs provided in the font ● Rendering ○ Converts given glyphs into bitmaps to display on a canvas 20
  • 21. Encoding and Displaying - ཨམཎིཔ ྨེ ྃ* u0F68u0F7Cu0F7Eu0F58u0F4Eu0F72u0F54 u0F51u0FA8u0F7Au0F67u0F71u0F74u0F83 {350, 462, 334, 324, 354, 330, 874, 403, 362, 859, 371} Shaping Rendering * Oṃ maṇi padme hūṃ 21
  • 22. These glyphs correspond to bitmaps: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Encoding and Displaying - ཨམཎིཔ ྨེ ྃ 14 code points (42 bytes) 11 glyphs: 1. Glyph ID = 350 2. Glyph ID = 462 3. Glyph ID = 334 4. Glyph ID = 324 5. Glyph ID = 354 6. Glyph ID = 330 7. Glyph ID = 874 8. Glyph ID = 403 9. Glyph ID = 362 10. Glyph ID = 859 11. Glyph ID = 371 22
  • 23. Encoding and Displaying - Code! u0F68u0F7Cu0F7Eu0F58u0F4Eu0F72u0F54 u0F51u0FA8u0F7Au0F67u0F71u0F74u0F83 {350, 462, 334, 324, 354, 330, 874, 403, 362, 859, 371} Harfbuzz FreeType 23
  • 24. Encoding and Displaying - HarfBuzz (Shaping) #include <stdio.h> #include <string.h> #include <hb.h> #define TEXT "u0F68u0F7Cu0F7Eu0F58u0F4Eu0F72u0F54u0F51u0FA8… #define FONT "himalaya.ttf" int main() { printf("Text = "%s"n", TEXT); printf("Length = %ld bytesnn", strlen(TEXT)); hb_buffer_t *buf; buf = hb_buffer_create(); hb_buffer_add_utf8(buf, TEXT, -1, 0, -1); hb_buffer_set_direction(buf, HB_DIRECTION_LTR); hb_buffer_set_script(buf, HB_SCRIPT_TIBETAN); hb_buffer_set_language(buf, hb_language_from_string("bo", - 1)); hb_blob_t *blob = hb_blob_create_from_file("Fonts/" FONT); hb_face_t *face = hb_face_create(blob, 0); hb_font_t *font = hb_font_create(face); hb_shape(font, buf, NULL, 0); unsigned int glyph_count; hb_glyph_info_t *glyph_info = hb_buffer_get_glyph_infos(buf, &glyph_count); hb_glyph_position_t *glyph_pos = hb_buffer_get_glyph_positions(buf, &glyph_count); hb_position_t cursor_x = 0; hb_position_t cursor_y = 0; for (unsigned int i = 0; i < glyph_count; i++) { hb_codepoint_t glyphid = glyph_info[i].codepoint; hb_position_t x_advance = glyph_pos[i].x_advance; hb_position_t y_advance = glyph_pos[i].y_advance; printf("Glyph ID = %dn", glyphid); cursor_x += x_advance; cursor_y += y_advance; } hb_buffer_destroy(buf); hb_font_destroy(font); hb_face_destroy(face); hb_blob_destroy(blob); return 0; } 24
  • 25. Encoding and Displaying - FreeType (Rendering) #include <stdio.h> #include <string.h> #include <math.h> #include "ft2build.h" #include FT_FREETYPE_H #define WIDTH 340 #define HEIGHT 18 #define FILENAME "himalaya.ttf" #define GLYPH_IDS { 350, 462, 334, 324, 354, 330, 874, 403, 362, 859, 371 } #define NUM_CHARS 11 int main(void) { FT_Library library; FT_Init_FreeType(&library); FT_Face face; FT_New_Face(library, "Fonts/" FILENAME, 0, &face); FT_Set_Char_Size(face, 36 * 64, 12 * 64, 0, 0); FT_Vector pen; pen.x = 0; pen.y = (HEIGHT - 12) * 64; FT_GlyphSlot slot = face->glyph; for(int n = 0; n < NUM_CHARS; n++) { FT_Set_Transform(face, NULL, &pen); FT_UInt glyph_ids[] = GLYPH_IDS; FT_Error error = FT_Load_Glyph(face, glyph_ids[n], FT_LOAD_RENDER); if(error) continue; draw_bitmap(&slot->bitmap, slot->bitmap_left, HEIGHT - slot->bitmap_top); pen.x += slot->advance.x; pen.y += slot->advance.y; } show_image(); FT_Done_Face(face); FT_Done_FreeType(library); return 0; } 25
  • 26. u0F68u0F7Cu0F7Eu0F58u0F 4Eu0F72u0F54u0F51u0FA8u 0F7Au0F67u0F71u0F74u0F83 {350, 462, 334, 324, 354, 330, 874, 403, 362, 859, 371} 26 Encoding and Displaying - FreeType (Rendering)
  • 27. Encoding and Displaying - Other Examples € #define TEXT "u20AC" #define SCRIPT HB_SCRIPT_LATIN #define LANGUAGE hb_language_from_string("en", -1) #define FONT "times.ttf" 1 code point (3 bytes) 1 glyph, ID = 188 € #define TEXT "u20AC" #define SCRIPT HB_SCRIPT_LATIN #define LANGUAGE hb_language_from_string("en", -1) #define FONT "himalaya.ttf" 1 code point (3 bytes) 1 glyph, ID = 201 27
  • 28. € u20AC { 201 } 28 Encoding and Displaying - Other Examples
  • 29. Encoding and Displaying - Other Examples #define TEXT U0001F469u200DU0001F9B0 #define SCRIPT HB_SCRIPT_LATIN #define LANGUAGE hb_language_from_string(en, -1) #define FONT seguiemj.ttf 3 code points (11 bytes) 1 glyph, ID = 10849 #define TEXT U0001F469U0001F3FEu200Du2… #define SCRIPT HB_SCRIPT_LATIN #define LANGUAGE hb_language_from_string(en, -1) #define FONT seguiemj.ttf 1 code point (35 bytes) ● Glyph ID = 7452 * ● Glyph ID = 7466 ● Glyph ID = 3 ● Glyph ID = 7461 29
  • 30. 30
  • 31. Encoding and Displaying - Other Examples њ (woman; U1F469) + ؅ (sheaf of rice; U1F33E) = Ṣ (woman farmer; U1F469U200DU1F33E)* я (man; U1F468) + ೊ (factory; U1F3ED) = ″ (man factory worker; U1F468U200DU1F3ED) * The + is actually u200D, the zero-width joiner њ (woman; U1F469) + ✈ (plane; U2708) = ỉֽ (woman pilot; U1F469U200DU2708UFE0F) њ (woman; U1F469) + ⶰ (rocket; U1F680) = ḳ 31
  • 32. New Question - Answer How is it possible for a software system to render orthographic characters into glyphs, i.e., to render characters into (combined) glyphs? ● Complex interactions between scripts, languages, displaying ● Complex, multi-tier process between shaping and rendering ● Add one level of indirection between ○ Abstract characters ○ Glyphs ● Share the responsibility ○ Unicode, fonts, and other pieces of software 32
  • 33. Discussions Some numbers (as of version 14.0.0 of Sep. 14th, 2021) ● Total possible number of code points ○ 17 planes × 65,536 characters − 2,048 surrogates − 66 non-characters = 1,111,998 ● Current number of used code points ○ 144,697 ● Number of scripts ○ 159 ■ Including Cirth, Tengwar, and Klingon! ● Number of emojis ○ 3,633 (not “unique”) 33
  • 34. Discussions Left-to-Right, Right-to-Left… ● Glyphs with “heads” ○ Egyptian hieroglyphs: and ● Both directions ○ Chinese, Japanese, and Korean: 海南航空 and 空航南海 ● Mirrored glyphs every other line (boustrophedon) ○ Ancient Greek and old Hungarian: ● Reversed writing direction every second line ○ Moon’s type: 34
  • 35. Discussions Unicode code points encoding ● UTF-8 with variable numbers of bytes per code points ● UTF-16 and UTF-32 Unicode and glyphs manipulations ● Counting ● Sorting ● … 35
  • 36. Discussions Unicode encodes abstract characters, not glyphs ● u0041 ➝ A called LATIN CAPITAL LETTER A ● u0410 ➝ А called CYRILLIC CAPITAL LETTER A ● u0391 ➝ Α called GREEK CAPITAL LETTER ALPHA IDN Homograph Attack ● wikipedia.org ≠ wikipеdiа.org ○ Using a tool like https://unicode-table.com/en/ 36
  • 37. Conclusion How is it possible for a software system to render orthographic characters into glyphs, i.e., to render characters into (combined) glyphs? ● Complex, multi-tier process ● Uses one level of indirection! ● Needs dedicated pieces of software ○ Shaping ○ Rendering ○ Processing ■ Sorting ■ Counting ■ Casing ■ … 37
  • 38. Thanks ● Bobh and Drowe at SIL ● Behdad Esfahbod, Khaled Hosny, David L. Rowe at HarfBuzz 38
  • 39. References ● Bear plus snowflake equals polar bear - ⌨ⅸ࢙ (andysalerno.com) ● Understanding characters, keystrokes, codepoints and glyphs (sil.org) ● Characters, glyphs, code-points, and byte-sequences (tcl-lang.org) ● Why does Unicode have separate codepoints for characters with identical glyphs? (stackexchange.org) ● What's the difference between a character, a code point, a glyph and a grapheme? (stackoverflow.org) ● Know : Languages List and their Writing direction (propelsteps.wordpress.com) ● Item of the Month, June 2010: William Moon and Moon Type (sabrinamessenger.blogspot.com) 39