Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Unicode (and Python)
1. Unicode (and Python)
Juan Manuel Gimeno Illa
jmgimeno@diei.udl.cat
November 2008
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 1 / 21
2. Outline
1 Before Unicode
2 Unicode
Unicode Concepts
Encodings
3 Python’s Unicode Support
Unicode String Type
Source Code Encoding
4 Bibliography
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 2 / 21
3. Before Unicode
Before Unicode
In the beginning, computing was mainly centered in North America
and done in English. Characters were stored one-per-byte by using
either
ASCII (7 bits)
EBCDIC (8 bits)
In other parts of the world, different ways of storing their characters
were invented
Japan: various flavours of JIS encodings
Russian: KOI8
India: ISCI standard
Also, there were some proprietary encodings defined by operating
system vendors
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
4. Before Unicode
Before Unicode
In the beginning, computing was mainly centered in North America
and done in English. Characters were stored one-per-byte by using
either
ASCII (7 bits)
EBCDIC (8 bits)
In other parts of the world, different ways of storing their characters
were invented
Japan: various flavours of JIS encodings
Russian: KOI8
India: ISCI standard
Also, there were some proprietary encodings defined by operating
system vendors
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
5. Before Unicode
Before Unicode
In the beginning, computing was mainly centered in North America
and done in English. Characters were stored one-per-byte by using
either
ASCII (7 bits)
EBCDIC (8 bits)
In other parts of the world, different ways of storing their characters
were invented
Japan: various flavours of JIS encodings
Russian: KOI8
India: ISCI standard
Also, there were some proprietary encodings defined by operating
system vendors
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
6. Before Unicode
Before Unicode
In the beginning, computing was mainly centered in North America
and done in English. Characters were stored one-per-byte by using
either
ASCII (7 bits)
EBCDIC (8 bits)
In other parts of the world, different ways of storing their characters
were invented
Japan: various flavours of JIS encodings
Russian: KOI8
India: ISCI standard
Also, there were some proprietary encodings defined by operating
system vendors
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
7. Before Unicode
Before Unicode
In the beginning, computing was mainly centered in North America
and done in English. Characters were stored one-per-byte by using
either
ASCII (7 bits)
EBCDIC (8 bits)
In other parts of the world, different ways of storing their characters
were invented
Japan: various flavours of JIS encodings
Russian: KOI8
India: ISCI standard
Also, there were some proprietary encodings defined by operating
system vendors
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
8. Before Unicode
Before Unicode
In the beginning, computing was mainly centered in North America
and done in English. Characters were stored one-per-byte by using
either
ASCII (7 bits)
EBCDIC (8 bits)
In other parts of the world, different ways of storing their characters
were invented
Japan: various flavours of JIS encodings
Russian: KOI8
India: ISCI standard
Also, there were some proprietary encodings defined by operating
system vendors
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
9. Before Unicode
Before Unicode
In the beginning, computing was mainly centered in North America
and done in English. Characters were stored one-per-byte by using
either
ASCII (7 bits)
EBCDIC (8 bits)
In other parts of the world, different ways of storing their characters
were invented
Japan: various flavours of JIS encodings
Russian: KOI8
India: ISCI standard
Also, there were some proprietary encodings defined by operating
system vendors
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
10. Before Unicode
Before Unicode
In the beginning, computing was mainly centered in North America
and done in English. Characters were stored one-per-byte by using
either
ASCII (7 bits)
EBCDIC (8 bits)
In other parts of the world, different ways of storing their characters
were invented
Japan: various flavours of JIS encodings
Russian: KOI8
India: ISCI standard
Also, there were some proprietary encodings defined by operating
system vendors
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
11. Before Unicode
ISO-8859-*
For the huge number of people in America, Europe, and the Middle
East who use relatively small alphabets, there was ISO-8859
left ASCII as ASCII (range 0 to 127)
used the range 128 through 255 for different purposes
1-4 Different accented characters (e.g. latin-1)
5 Cyrillic
6 Arabic
7 Greek
8 Hebrew
9 Turkish
10 Nordic languages
But you could only be using one at a time, so one couldn’t easily mix
Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
12. Before Unicode
ISO-8859-*
For the huge number of people in America, Europe, and the Middle
East who use relatively small alphabets, there was ISO-8859
left ASCII as ASCII (range 0 to 127)
used the range 128 through 255 for different purposes
1-4 Different accented characters (e.g. latin-1)
5 Cyrillic
6 Arabic
7 Greek
8 Hebrew
9 Turkish
10 Nordic languages
But you could only be using one at a time, so one couldn’t easily mix
Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
13. Before Unicode
ISO-8859-*
For the huge number of people in America, Europe, and the Middle
East who use relatively small alphabets, there was ISO-8859
left ASCII as ASCII (range 0 to 127)
used the range 128 through 255 for different purposes
1-4 Different accented characters (e.g. latin-1)
5 Cyrillic
6 Arabic
7 Greek
8 Hebrew
9 Turkish
10 Nordic languages
But you could only be using one at a time, so one couldn’t easily mix
Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
14. Before Unicode
ISO-8859-*
For the huge number of people in America, Europe, and the Middle
East who use relatively small alphabets, there was ISO-8859
left ASCII as ASCII (range 0 to 127)
used the range 128 through 255 for different purposes
1-4 Different accented characters (e.g. latin-1)
5 Cyrillic
6 Arabic
7 Greek
8 Hebrew
9 Turkish
10 Nordic languages
But you could only be using one at a time, so one couldn’t easily mix
Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
15. Before Unicode
ISO-8859-*
For the huge number of people in America, Europe, and the Middle
East who use relatively small alphabets, there was ISO-8859
left ASCII as ASCII (range 0 to 127)
used the range 128 through 255 for different purposes
1-4 Different accented characters (e.g. latin-1)
5 Cyrillic
6 Arabic
7 Greek
8 Hebrew
9 Turkish
10 Nordic languages
But you could only be using one at a time, so one couldn’t easily mix
Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
16. Before Unicode
ISO-8859-*
For the huge number of people in America, Europe, and the Middle
East who use relatively small alphabets, there was ISO-8859
left ASCII as ASCII (range 0 to 127)
used the range 128 through 255 for different purposes
1-4 Different accented characters (e.g. latin-1)
5 Cyrillic
6 Arabic
7 Greek
8 Hebrew
9 Turkish
10 Nordic languages
But you could only be using one at a time, so one couldn’t easily mix
Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
17. Before Unicode
Huston, Huston, . . .
Clearly this was an very unsatisfactory situation
ISO-2022 provided a partial solution allowing to shift encodings in
the middle of a string
it was difficult to use
so it wasn’t widespread
What was needed was an universal way to refer to all the different
characters in all the alphabets
ISO/IEC 10646
Unicode
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
18. Before Unicode
Huston, Huston, . . .
Clearly this was an very unsatisfactory situation
ISO-2022 provided a partial solution allowing to shift encodings in
the middle of a string
it was difficult to use
so it wasn’t widespread
What was needed was an universal way to refer to all the different
characters in all the alphabets
ISO/IEC 10646
Unicode
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
19. Before Unicode
Huston, Huston, . . .
Clearly this was an very unsatisfactory situation
ISO-2022 provided a partial solution allowing to shift encodings in
the middle of a string
it was difficult to use
so it wasn’t widespread
What was needed was an universal way to refer to all the different
characters in all the alphabets
ISO/IEC 10646
Unicode
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
20. Before Unicode
Huston, Huston, . . .
Clearly this was an very unsatisfactory situation
ISO-2022 provided a partial solution allowing to shift encodings in
the middle of a string
it was difficult to use
so it wasn’t widespread
What was needed was an universal way to refer to all the different
characters in all the alphabets
ISO/IEC 10646
Unicode
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
21. Before Unicode
Huston, Huston, . . .
Clearly this was an very unsatisfactory situation
ISO-2022 provided a partial solution allowing to shift encodings in
the middle of a string
it was difficult to use
so it wasn’t widespread
What was needed was an universal way to refer to all the different
characters in all the alphabets
ISO/IEC 10646
Unicode
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
22. Before Unicode
Huston, Huston, . . .
Clearly this was an very unsatisfactory situation
ISO-2022 provided a partial solution allowing to shift encodings in
the middle of a string
it was difficult to use
so it wasn’t widespread
What was needed was an universal way to refer to all the different
characters in all the alphabets
ISO/IEC 10646
Unicode
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
23. Before Unicode
Huston, Huston, . . .
Clearly this was an very unsatisfactory situation
ISO-2022 provided a partial solution allowing to shift encodings in
the middle of a string
it was difficult to use
so it wasn’t widespread
What was needed was an universal way to refer to all the different
characters in all the alphabets
ISO/IEC 10646
Unicode
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
24. Unicode Unicode Concepts
Unicode’s Solution
One encoding for all scripts of the world
ASCII compatibility (even Latin-1)
Includes character meta data
Case mapping information
Character category information
Accounts for scripts using different orientations
Enables sorting and normalization support
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
25. Unicode Unicode Concepts
Unicode’s Solution
One encoding for all scripts of the world
ASCII compatibility (even Latin-1)
Includes character meta data
Case mapping information
Character category information
Accounts for scripts using different orientations
Enables sorting and normalization support
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
26. Unicode Unicode Concepts
Unicode’s Solution
One encoding for all scripts of the world
ASCII compatibility (even Latin-1)
Includes character meta data
Case mapping information
Character category information
Accounts for scripts using different orientations
Enables sorting and normalization support
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
27. Unicode Unicode Concepts
Unicode’s Solution
One encoding for all scripts of the world
ASCII compatibility (even Latin-1)
Includes character meta data
Case mapping information
Character category information
Accounts for scripts using different orientations
Enables sorting and normalization support
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
28. Unicode Unicode Concepts
Unicode’s Solution
One encoding for all scripts of the world
ASCII compatibility (even Latin-1)
Includes character meta data
Case mapping information
Character category information
Accounts for scripts using different orientations
Enables sorting and normalization support
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
29. Unicode Unicode Concepts
Unicode’s Solution
One encoding for all scripts of the world
ASCII compatibility (even Latin-1)
Includes character meta data
Case mapping information
Character category information
Accounts for scripts using different orientations
Enables sorting and normalization support
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
30. Unicode Unicode Concepts
Unicode’s Solution
One encoding for all scripts of the world
ASCII compatibility (even Latin-1)
Includes character meta data
Case mapping information
Character category information
Accounts for scripts using different orientations
Enables sorting and normalization support
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
31. Unicode Unicode Concepts
Unicode’s Terminology
Grapheme This is what users regard as a character
- Andr´e
Code points This is an Unicode encoding of the string
- AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
- Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
Code Units This is what the implementation stores (e.g. UTF-8
- Andre0xCC 0x81
This can be explored in Linux using the program gucharmap
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
32. Unicode Unicode Concepts
Unicode’s Terminology
Grapheme This is what users regard as a character
- Andr´e
Code points This is an Unicode encoding of the string
- AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
- Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
Code Units This is what the implementation stores (e.g. UTF-8
- Andre0xCC 0x81
This can be explored in Linux using the program gucharmap
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
33. Unicode Unicode Concepts
Unicode’s Terminology
Grapheme This is what users regard as a character
- Andr´e
Code points This is an Unicode encoding of the string
- AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
- Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
Code Units This is what the implementation stores (e.g. UTF-8
- Andre0xCC 0x81
This can be explored in Linux using the program gucharmap
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
34. Unicode Unicode Concepts
Unicode’s Terminology
Grapheme This is what users regard as a character
- Andr´e
Code points This is an Unicode encoding of the string
- AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
- Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
Code Units This is what the implementation stores (e.g. UTF-8
- Andre0xCC 0x81
This can be explored in Linux using the program gucharmap
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
35. Unicode Unicode Concepts
Unicode’s Terminology
Grapheme This is what users regard as a character
- Andr´e
Code points This is an Unicode encoding of the string
- AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
- Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
Code Units This is what the implementation stores (e.g. UTF-8
- Andre0xCC 0x81
This can be explored in Linux using the program gucharmap
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
36. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
37. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
38. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
39. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
40. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
41. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
42. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
43. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
44. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
45. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
46. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
47. Unicode Unicode Concepts
Unicode Organization
Unicode currently defines just under 100000 code points but it has
space for upto 1114112
They are organized into 17 planes of 216 = 65536 characters,
numbered 0 to 16
Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
well everything useful
The characters in BMP are laid out more or less West to East
ASCII characters from 0 to 127
Latin-1 characters from 128 to 255
Then moving East in Europe (Greek, Cyrillic)
Next Middle East (Arabic, Hebrew)
Then the Indus (scripts of India)
Next Southeast Asia (Thai, Laotian and so on)
and ending with China, Japan and Korea
Planes 1 to 16 are sometimes called astral planes that include exotic,
rare and historically important characters (old italic, byzantine
musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
48. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
49. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
50. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
51. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
52. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
53. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
54. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
55. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
56. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
57. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
58. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
59. Unicode Unicode Concepts
Code Points
Each code point (“character”) gets a number and a name
The number is usually given in hexadecimal and prefixed by U+
(Note that it is not a 16 bit number due to the astral planes !!!)
Unicode includes tables with useful character properties (metadata)
such as
this is a number
this is uppercase
this is punctuation
The standard also provides
a helpful picture of a reasonably typical rendition
rules for line-breaking
hyphenation
sorting
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
60. Unicode Encodings
Encodings
Along with the code points, Unicode also defines methods for storing
them in byte sequences in a computer
There are three approaches named UTF-8, UTF-16 and UTF-32
UTF stands for Unicode Transformation Format or UCS
Transformation Format where UCS stands for Unicode Character
Set
The characters we will use in the explanations are:
Number Name Plane
U+0026 (38) AMPERSAND BMP
U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP
U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP
U+10346 (66374) GOTHIC LETTER FAIHU Astral
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
61. Unicode Encodings
Encodings
Along with the code points, Unicode also defines methods for storing
them in byte sequences in a computer
There are three approaches named UTF-8, UTF-16 and UTF-32
UTF stands for Unicode Transformation Format or UCS
Transformation Format where UCS stands for Unicode Character
Set
The characters we will use in the explanations are:
Number Name Plane
U+0026 (38) AMPERSAND BMP
U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP
U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP
U+10346 (66374) GOTHIC LETTER FAIHU Astral
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
62. Unicode Encodings
Encodings
Along with the code points, Unicode also defines methods for storing
them in byte sequences in a computer
There are three approaches named UTF-8, UTF-16 and UTF-32
UTF stands for Unicode Transformation Format or UCS
Transformation Format where UCS stands for Unicode Character
Set
The characters we will use in the explanations are:
Number Name Plane
U+0026 (38) AMPERSAND BMP
U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP
U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP
U+10346 (66374) GOTHIC LETTER FAIHU Astral
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
63. Unicode Encodings
Encodings
Along with the code points, Unicode also defines methods for storing
them in byte sequences in a computer
There are three approaches named UTF-8, UTF-16 and UTF-32
UTF stands for Unicode Transformation Format or UCS
Transformation Format where UCS stands for Unicode Character
Set
The characters we will use in the explanations are:
Number Name Plane
U+0026 (38) AMPERSAND BMP
U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP
U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP
U+10346 (66374) GOTHIC LETTER FAIHU Astral
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
64. Unicode Encodings
UTF-32
The simplest way to storing characters: you use 32 bits (4 bytes) to
store each character
So we store 38, 1046, 20013 and 66374 as 32 bit integers
For Latin-1 characters it wastes too much space
Problems with C strings because most bytes are zero (use wchar t)
There are lots of ways of storing 4 byte integers among 4 bytes
(remember big-endian and little-endian?)
So if you send one of these 4-byte integers to another machine
problems occur if they use different orderings
Solutions:
Explicitness UTF-32BE and UTF-32LE encodings
Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
NO-BREAK SPACE) and the guarantee that U+FFFE
will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
65. Unicode Encodings
UTF-32
The simplest way to storing characters: you use 32 bits (4 bytes) to
store each character
So we store 38, 1046, 20013 and 66374 as 32 bit integers
For Latin-1 characters it wastes too much space
Problems with C strings because most bytes are zero (use wchar t)
There are lots of ways of storing 4 byte integers among 4 bytes
(remember big-endian and little-endian?)
So if you send one of these 4-byte integers to another machine
problems occur if they use different orderings
Solutions:
Explicitness UTF-32BE and UTF-32LE encodings
Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
NO-BREAK SPACE) and the guarantee that U+FFFE
will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
66. Unicode Encodings
UTF-32
The simplest way to storing characters: you use 32 bits (4 bytes) to
store each character
So we store 38, 1046, 20013 and 66374 as 32 bit integers
For Latin-1 characters it wastes too much space
Problems with C strings because most bytes are zero (use wchar t)
There are lots of ways of storing 4 byte integers among 4 bytes
(remember big-endian and little-endian?)
So if you send one of these 4-byte integers to another machine
problems occur if they use different orderings
Solutions:
Explicitness UTF-32BE and UTF-32LE encodings
Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
NO-BREAK SPACE) and the guarantee that U+FFFE
will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
67. Unicode Encodings
UTF-32
The simplest way to storing characters: you use 32 bits (4 bytes) to
store each character
So we store 38, 1046, 20013 and 66374 as 32 bit integers
For Latin-1 characters it wastes too much space
Problems with C strings because most bytes are zero (use wchar t)
There are lots of ways of storing 4 byte integers among 4 bytes
(remember big-endian and little-endian?)
So if you send one of these 4-byte integers to another machine
problems occur if they use different orderings
Solutions:
Explicitness UTF-32BE and UTF-32LE encodings
Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
NO-BREAK SPACE) and the guarantee that U+FFFE
will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
68. Unicode Encodings
UTF-32
The simplest way to storing characters: you use 32 bits (4 bytes) to
store each character
So we store 38, 1046, 20013 and 66374 as 32 bit integers
For Latin-1 characters it wastes too much space
Problems with C strings because most bytes are zero (use wchar t)
There are lots of ways of storing 4 byte integers among 4 bytes
(remember big-endian and little-endian?)
So if you send one of these 4-byte integers to another machine
problems occur if they use different orderings
Solutions:
Explicitness UTF-32BE and UTF-32LE encodings
Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
NO-BREAK SPACE) and the guarantee that U+FFFE
will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
69. Unicode Encodings
UTF-32
The simplest way to storing characters: you use 32 bits (4 bytes) to
store each character
So we store 38, 1046, 20013 and 66374 as 32 bit integers
For Latin-1 characters it wastes too much space
Problems with C strings because most bytes are zero (use wchar t)
There are lots of ways of storing 4 byte integers among 4 bytes
(remember big-endian and little-endian?)
So if you send one of these 4-byte integers to another machine
problems occur if they use different orderings
Solutions:
Explicitness UTF-32BE and UTF-32LE encodings
Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
NO-BREAK SPACE) and the guarantee that U+FFFE
will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
70. Unicode Encodings
UTF-32
The simplest way to storing characters: you use 32 bits (4 bytes) to
store each character
So we store 38, 1046, 20013 and 66374 as 32 bit integers
For Latin-1 characters it wastes too much space
Problems with C strings because most bytes are zero (use wchar t)
There are lots of ways of storing 4 byte integers among 4 bytes
(remember big-endian and little-endian?)
So if you send one of these 4-byte integers to another machine
problems occur if they use different orderings
Solutions:
Explicitness UTF-32BE and UTF-32LE encodings
Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
NO-BREAK SPACE) and the guarantee that U+FFFE
will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
71. Unicode Encodings
UTF-32
The simplest way to storing characters: you use 32 bits (4 bytes) to
store each character
So we store 38, 1046, 20013 and 66374 as 32 bit integers
For Latin-1 characters it wastes too much space
Problems with C strings because most bytes are zero (use wchar t)
There are lots of ways of storing 4 byte integers among 4 bytes
(remember big-endian and little-endian?)
So if you send one of these 4-byte integers to another machine
problems occur if they use different orderings
Solutions:
Explicitness UTF-32BE and UTF-32LE encodings
Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
NO-BREAK SPACE) and the guarantee that U+FFFE
will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
72. Unicode Encodings
UTF-32
The simplest way to storing characters: you use 32 bits (4 bytes) to
store each character
So we store 38, 1046, 20013 and 66374 as 32 bit integers
For Latin-1 characters it wastes too much space
Problems with C strings because most bytes are zero (use wchar t)
There are lots of ways of storing 4 byte integers among 4 bytes
(remember big-endian and little-endian?)
So if you send one of these 4-byte integers to another machine
problems occur if they use different orderings
Solutions:
Explicitness UTF-32BE and UTF-32LE encodings
Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
NO-BREAK SPACE) and the guarantee that U+FFFE
will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
73. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
74. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
75. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
76. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
77. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
78. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
79. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
80. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
81. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
82. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
83. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
84. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
85. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
86. Unicode Encodings
UTF-16
UTF-16 stores Unicode characters in 16 bit chunks
all the BMP characters appear as themselves
some trickery is needed for the astral plane ones
There are two blocks of code points in the BMP called surrogate blocks
High surrogates from U+D800 to U+DBFF
Low surrogates from U+DC00 to U+DFFF
Astral plane characters are splitted into two characters
first, 0x10000 = 216 is subtracted from the code point
next, its 20 bits are splitted using the low surrogate for the low ten bits
and the high for the high ones
This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
with 216 characters each
So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
It also has ordering problems so the UTF-16BE, UTF-16LE or use of
the BOM
Nightmare in C: embedded zeros and not same size as wchar t
The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
87. Unicode Encodings
UTF-8
UTF-8 was invented by Ken Thompson on September 2, 1992, on a
placemat in a New Jersey diner with Rob Pike.
It works like this:
characters whose value is less that 128 (ASCII) are encoded as
themselves in one byte
the rest will have its bits ripped apart and deal out into several (from
two to four) bytes as follows:
The first byte has a bunch of high-order one bits telling how many
bytes are used to encode the character, followed by a zero bit
The rest of the bytes each begin with a single one byte followed by a
zero bit
The bits of the character are dealt out in the space left over after these
signalling bits
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
88. Unicode Encodings
UTF-8
UTF-8 was invented by Ken Thompson on September 2, 1992, on a
placemat in a New Jersey diner with Rob Pike.
It works like this:
characters whose value is less that 128 (ASCII) are encoded as
themselves in one byte
the rest will have its bits ripped apart and deal out into several (from
two to four) bytes as follows:
The first byte has a bunch of high-order one bits telling how many
bytes are used to encode the character, followed by a zero bit
The rest of the bytes each begin with a single one byte followed by a
zero bit
The bits of the character are dealt out in the space left over after these
signalling bits
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
89. Unicode Encodings
UTF-8
UTF-8 was invented by Ken Thompson on September 2, 1992, on a
placemat in a New Jersey diner with Rob Pike.
It works like this:
characters whose value is less that 128 (ASCII) are encoded as
themselves in one byte
the rest will have its bits ripped apart and deal out into several (from
two to four) bytes as follows:
The first byte has a bunch of high-order one bits telling how many
bytes are used to encode the character, followed by a zero bit
The rest of the bytes each begin with a single one byte followed by a
zero bit
The bits of the character are dealt out in the space left over after these
signalling bits
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
90. Unicode Encodings
UTF-8
UTF-8 was invented by Ken Thompson on September 2, 1992, on a
placemat in a New Jersey diner with Rob Pike.
It works like this:
characters whose value is less that 128 (ASCII) are encoded as
themselves in one byte
the rest will have its bits ripped apart and deal out into several (from
two to four) bytes as follows:
The first byte has a bunch of high-order one bits telling how many
bytes are used to encode the character, followed by a zero bit
The rest of the bytes each begin with a single one byte followed by a
zero bit
The bits of the character are dealt out in the space left over after these
signalling bits
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
91. Unicode Encodings
UTF-8
UTF-8 was invented by Ken Thompson on September 2, 1992, on a
placemat in a New Jersey diner with Rob Pike.
It works like this:
characters whose value is less that 128 (ASCII) are encoded as
themselves in one byte
the rest will have its bits ripped apart and deal out into several (from
two to four) bytes as follows:
The first byte has a bunch of high-order one bits telling how many
bytes are used to encode the character, followed by a zero bit
The rest of the bytes each begin with a single one byte followed by a
zero bit
The bits of the character are dealt out in the space left over after these
signalling bits
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
92. Unicode Encodings
UTF-8
UTF-8 was invented by Ken Thompson on September 2, 1992, on a
placemat in a New Jersey diner with Rob Pike.
It works like this:
characters whose value is less that 128 (ASCII) are encoded as
themselves in one byte
the rest will have its bits ripped apart and deal out into several (from
two to four) bytes as follows:
The first byte has a bunch of high-order one bits telling how many
bytes are used to encode the character, followed by a zero bit
The rest of the bytes each begin with a single one byte followed by a
zero bit
The bits of the character are dealt out in the space left over after these
signalling bits
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
93. Unicode Encodings
UTF-8
UTF-8 was invented by Ken Thompson on September 2, 1992, on a
placemat in a New Jersey diner with Rob Pike.
It works like this:
characters whose value is less that 128 (ASCII) are encoded as
themselves in one byte
the rest will have its bits ripped apart and deal out into several (from
two to four) bytes as follows:
The first byte has a bunch of high-order one bits telling how many
bytes are used to encode the character, followed by a zero bit
The rest of the bytes each begin with a single one byte followed by a
zero bit
The bits of the character are dealt out in the space left over after these
signalling bits
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
100. Unicode Encodings
UTF-8
UTF-8 is a kind of racist favouring us with round-eyes
anglophones get one byte per character
most people west of the Indus river get away with two bytes
India and points east need three bytes per character
Processing UTF-8 characters sequentially is about as efficient as in
any other encoding
But you can’t easily index into a buffer (this is the same as UTF-16)
count characters
array of positions
UTF-8 has no embedded zero bytes so some C routines work
No byte-ordering problems
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
101. Unicode Encodings
UTF-8
UTF-8 is a kind of racist favouring us with round-eyes
anglophones get one byte per character
most people west of the Indus river get away with two bytes
India and points east need three bytes per character
Processing UTF-8 characters sequentially is about as efficient as in
any other encoding
But you can’t easily index into a buffer (this is the same as UTF-16)
count characters
array of positions
UTF-8 has no embedded zero bytes so some C routines work
No byte-ordering problems
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
102. Unicode Encodings
UTF-8
UTF-8 is a kind of racist favouring us with round-eyes
anglophones get one byte per character
most people west of the Indus river get away with two bytes
India and points east need three bytes per character
Processing UTF-8 characters sequentially is about as efficient as in
any other encoding
But you can’t easily index into a buffer (this is the same as UTF-16)
count characters
array of positions
UTF-8 has no embedded zero bytes so some C routines work
No byte-ordering problems
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
103. Unicode Encodings
UTF-8
UTF-8 is a kind of racist favouring us with round-eyes
anglophones get one byte per character
most people west of the Indus river get away with two bytes
India and points east need three bytes per character
Processing UTF-8 characters sequentially is about as efficient as in
any other encoding
But you can’t easily index into a buffer (this is the same as UTF-16)
count characters
array of positions
UTF-8 has no embedded zero bytes so some C routines work
No byte-ordering problems
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
104. Unicode Encodings
UTF-8
UTF-8 is a kind of racist favouring us with round-eyes
anglophones get one byte per character
most people west of the Indus river get away with two bytes
India and points east need three bytes per character
Processing UTF-8 characters sequentially is about as efficient as in
any other encoding
But you can’t easily index into a buffer (this is the same as UTF-16)
count characters
array of positions
UTF-8 has no embedded zero bytes so some C routines work
No byte-ordering problems
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
105. Unicode Encodings
UTF-8
UTF-8 is a kind of racist favouring us with round-eyes
anglophones get one byte per character
most people west of the Indus river get away with two bytes
India and points east need three bytes per character
Processing UTF-8 characters sequentially is about as efficient as in
any other encoding
But you can’t easily index into a buffer (this is the same as UTF-16)
count characters
array of positions
UTF-8 has no embedded zero bytes so some C routines work
No byte-ordering problems
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
106. Unicode Encodings
UTF-8
UTF-8 is a kind of racist favouring us with round-eyes
anglophones get one byte per character
most people west of the Indus river get away with two bytes
India and points east need three bytes per character
Processing UTF-8 characters sequentially is about as efficient as in
any other encoding
But you can’t easily index into a buffer (this is the same as UTF-16)
count characters
array of positions
UTF-8 has no embedded zero bytes so some C routines work
No byte-ordering problems
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
107. Unicode Encodings
UTF-8
UTF-8 is a kind of racist favouring us with round-eyes
anglophones get one byte per character
most people west of the Indus river get away with two bytes
India and points east need three bytes per character
Processing UTF-8 characters sequentially is about as efficient as in
any other encoding
But you can’t easily index into a buffer (this is the same as UTF-16)
count characters
array of positions
UTF-8 has no embedded zero bytes so some C routines work
No byte-ordering problems
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
108. Unicode Encodings
UTF-8
UTF-8 is a kind of racist favouring us with round-eyes
anglophones get one byte per character
most people west of the Indus river get away with two bytes
India and points east need three bytes per character
Processing UTF-8 characters sequentially is about as efficient as in
any other encoding
But you can’t easily index into a buffer (this is the same as UTF-16)
count characters
array of positions
UTF-8 has no embedded zero bytes so some C routines work
No byte-ordering problems
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
109. Unicode Encodings
UTF-8
UTF-8 is a kind of racist favouring us with round-eyes
anglophones get one byte per character
most people west of the Indus river get away with two bytes
India and points east need three bytes per character
Processing UTF-8 characters sequentially is about as efficient as in
any other encoding
But you can’t easily index into a buffer (this is the same as UTF-16)
count characters
array of positions
UTF-8 has no embedded zero bytes so some C routines work
No byte-ordering problems
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
110. Python’s Unicode Support Unicode String Type
Python’s Unicode type
Python has a built-in Unicode type
Unicode string literals has the same syntax as the normal ones, with a
u or U prefixing the quotes (e.g. uquot;This is Unicodequot;)
Unicode literals can include the escape sequence uXXXX to denote
character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g.
uquot;u0026u0416u4e2dU00010346quot;)
Unicode characters can be named using the escape sequence
N{name} (e.g. uquot;N{Ampersand}quot;)
unichr(i) returns a Unicode String with character i (the inverse is
ord)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21
111. Python’s Unicode Support Unicode String Type
Python’s Unicode type
Python has a built-in Unicode type
Unicode string literals has the same syntax as the normal ones, with a
u or U prefixing the quotes (e.g. uquot;This is Unicodequot;)
Unicode literals can include the escape sequence uXXXX to denote
character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g.
uquot;u0026u0416u4e2dU00010346quot;)
Unicode characters can be named using the escape sequence
N{name} (e.g. uquot;N{Ampersand}quot;)
unichr(i) returns a Unicode String with character i (the inverse is
ord)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21
112. Python’s Unicode Support Unicode String Type
Python’s Unicode type
Python has a built-in Unicode type
Unicode string literals has the same syntax as the normal ones, with a
u or U prefixing the quotes (e.g. uquot;This is Unicodequot;)
Unicode literals can include the escape sequence uXXXX to denote
character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g.
uquot;u0026u0416u4e2dU00010346quot;)
Unicode characters can be named using the escape sequence
N{name} (e.g. uquot;N{Ampersand}quot;)
unichr(i) returns a Unicode String with character i (the inverse is
ord)
J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21