SlideShare a Scribd company logo
1 of 138
Download to read offline
Unicode (and Python)

                                      Juan Manuel Gimeno Illa
                                        jmgimeno@diei.udl.cat

                                          November 2008




J.M.Gimeno (jmgimeno@diei.udl.cat)             Unicode          November 2008   1 / 21
Outline

 1   Before Unicode

 2   Unicode
       Unicode Concepts
       Encodings

 3   Python’s Unicode Support
       Unicode String Type
       Source Code Encoding

 4   Bibliography



J.M.Gimeno (jmgimeno@diei.udl.cat)   Unicode   November 2008   2 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Terminology



    Grapheme This is what users regard as a character
             - Andr´e
  Code points This is an Unicode encoding of the string
              - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
              - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
   Code Units This is what the implementation stores (e.g. UTF-8
              - Andre0xCC 0x81
 This can be explored in Linux using the program gucharmap




J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   7 / 21
Unicode   Unicode Concepts


Unicode’s Terminology



    Grapheme This is what users regard as a character
             - Andr´e
  Code points This is an Unicode encoding of the string
              - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
              - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
   Code Units This is what the implementation stores (e.g. UTF-8
              - Andre0xCC 0x81
 This can be explored in Linux using the program gucharmap




J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   7 / 21
Unicode   Unicode Concepts


Unicode’s Terminology



    Grapheme This is what users regard as a character
             - Andr´e
  Code points This is an Unicode encoding of the string
              - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
              - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
   Code Units This is what the implementation stores (e.g. UTF-8
              - Andre0xCC 0x81
 This can be explored in Linux using the program gucharmap




J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   7 / 21
Unicode   Unicode Concepts


Unicode’s Terminology



    Grapheme This is what users regard as a character
             - Andr´e
  Code points This is an Unicode encoding of the string
              - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
              - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
   Code Units This is what the implementation stores (e.g. UTF-8
              - Andre0xCC 0x81
 This can be explored in Linux using the program gucharmap




J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   7 / 21
Unicode   Unicode Concepts


Unicode’s Terminology



    Grapheme This is what users regard as a character
             - Andr´e
  Code points This is an Unicode encoding of the string
              - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
              - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
   Code Units This is what the implementation stores (e.g. UTF-8
              - Andre0xCC 0x81
 This can be explored in Linux using the program gucharmap




J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   7 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Encodings


Encodings

        Along with the code points, Unicode also defines methods for storing
        them in byte sequences in a computer
        There are three approaches named UTF-8, UTF-16 and UTF-32
        UTF stands for Unicode Transformation Format or UCS
        Transformation Format where UCS stands for Unicode Character
        Set
        The characters we will use in the explanations are:
          Number               Name                                 Plane
          U+0026 (38)          AMPERSAND                             BMP
          U+0416 (1046)        CYRILLIC CAPITAL LETTER ZHE           BMP
          U+4E2D (20013)       HAN IDEOGRAPH 4E2E                    BMP
          U+10346 (66374) GOTHIC LETTER FAIHU                      Astral


J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode           November 2008   10 / 21
Unicode   Encodings


Encodings

        Along with the code points, Unicode also defines methods for storing
        them in byte sequences in a computer
        There are three approaches named UTF-8, UTF-16 and UTF-32
        UTF stands for Unicode Transformation Format or UCS
        Transformation Format where UCS stands for Unicode Character
        Set
        The characters we will use in the explanations are:
          Number               Name                                 Plane
          U+0026 (38)          AMPERSAND                             BMP
          U+0416 (1046)        CYRILLIC CAPITAL LETTER ZHE           BMP
          U+4E2D (20013)       HAN IDEOGRAPH 4E2E                    BMP
          U+10346 (66374) GOTHIC LETTER FAIHU                      Astral


J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode           November 2008   10 / 21
Unicode   Encodings


Encodings

        Along with the code points, Unicode also defines methods for storing
        them in byte sequences in a computer
        There are three approaches named UTF-8, UTF-16 and UTF-32
        UTF stands for Unicode Transformation Format or UCS
        Transformation Format where UCS stands for Unicode Character
        Set
        The characters we will use in the explanations are:
          Number               Name                                 Plane
          U+0026 (38)          AMPERSAND                             BMP
          U+0416 (1046)        CYRILLIC CAPITAL LETTER ZHE           BMP
          U+4E2D (20013)       HAN IDEOGRAPH 4E2E                    BMP
          U+10346 (66374) GOTHIC LETTER FAIHU                      Astral


J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode           November 2008   10 / 21
Unicode   Encodings


Encodings

        Along with the code points, Unicode also defines methods for storing
        them in byte sequences in a computer
        There are three approaches named UTF-8, UTF-16 and UTF-32
        UTF stands for Unicode Transformation Format or UCS
        Transformation Format where UCS stands for Unicode Character
        Set
        The characters we will use in the explanations are:
          Number               Name                                 Plane
          U+0026 (38)          AMPERSAND                             BMP
          U+0416 (1046)        CYRILLIC CAPITAL LETTER ZHE           BMP
          U+4E2D (20013)       HAN IDEOGRAPH 4E2E                    BMP
          U+10346 (66374) GOTHIC LETTER FAIHU                      Astral


J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode           November 2008   10 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Python’s Unicode Support   Unicode String Type


Python’s Unicode type


        Python has a built-in Unicode type
        Unicode string literals has the same syntax as the normal ones, with a
        u or U prefixing the quotes (e.g. uquot;This is Unicodequot;)
        Unicode literals can include the escape sequence uXXXX to denote
        character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g.
        uquot;u0026u0416u4e2dU00010346quot;)
        Unicode characters can be named using the escape sequence
        N{name} (e.g. uquot;N{Ampersand}quot;)
        unichr(i) returns a Unicode String with character i (the inverse is
        ord)




J.M.Gimeno (jmgimeno@diei.udl.cat)                  Unicode                    November 2008   16 / 21
Python’s Unicode Support   Unicode String Type


Python’s Unicode type


        Python has a built-in Unicode type
        Unicode string literals has the same syntax as the normal ones, with a
        u or U prefixing the quotes (e.g. uquot;This is Unicodequot;)
        Unicode literals can include the escape sequence uXXXX to denote
        character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g.
        uquot;u0026u0416u4e2dU00010346quot;)
        Unicode characters can be named using the escape sequence
        N{name} (e.g. uquot;N{Ampersand}quot;)
        unichr(i) returns a Unicode String with character i (the inverse is
        ord)




J.M.Gimeno (jmgimeno@diei.udl.cat)                  Unicode                    November 2008   16 / 21
Python’s Unicode Support   Unicode String Type


Python’s Unicode type


        Python has a built-in Unicode type
        Unicode string literals has the same syntax as the normal ones, with a
        u or U prefixing the quotes (e.g. uquot;This is Unicodequot;)
        Unicode literals can include the escape sequence uXXXX to denote
        character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g.
        uquot;u0026u0416u4e2dU00010346quot;)
        Unicode characters can be named using the escape sequence
        N{name} (e.g. uquot;N{Ampersand}quot;)
        unichr(i) returns a Unicode String with character i (the inverse is
        ord)




J.M.Gimeno (jmgimeno@diei.udl.cat)                  Unicode                    November 2008   16 / 21
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)

More Related Content

Viewers also liked

Multimedia file formats
Multimedia file formatsMultimedia file formats
Multimedia file formatsShruti Garg
 
Hypertext,hypermedia and multimedia
Hypertext,hypermedia and multimediaHypertext,hypermedia and multimedia
Hypertext,hypermedia and multimediagaflores2
 
Hypertext, hypermedia and multimedia
Hypertext, hypermedia and multimediaHypertext, hypermedia and multimedia
Hypertext, hypermedia and multimediafernandadavalos2566
 
multimedia data and file format
multimedia data and file formatmultimedia data and file format
multimedia data and file formatALOK SAHNI
 
MultiMedia dbms
MultiMedia dbmsMultiMedia dbms
MultiMedia dbmsTech_MX
 
UNIX Operating System
UNIX Operating SystemUNIX Operating System
UNIX Operating SystemUnless Yuriko
 
Multimedia data and file format
Multimedia data and file formatMultimedia data and file format
Multimedia data and file formatNiketa Jain
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Karan Panjwani
 
File formats and its types
File formats and its typesFile formats and its types
File formats and its typesAnu Garg
 
Pulse modulation
Pulse modulationPulse modulation
Pulse modulationstk_gpg
 
Chapter 2 : TEXT
Chapter 2 : TEXTChapter 2 : TEXT
Chapter 2 : TEXTazira96
 
Text-Elements of multimedia
Text-Elements of multimediaText-Elements of multimedia
Text-Elements of multimediaVanitha Chandru
 
Mobile Operating System
Mobile Operating SystemMobile Operating System
Mobile Operating SystemSonal Poddar
 

Viewers also liked (20)

Multimedia Technology - text
Multimedia Technology - textMultimedia Technology - text
Multimedia Technology - text
 
Ch04
Ch04Ch04
Ch04
 
Multimedia file formats
Multimedia file formatsMultimedia file formats
Multimedia file formats
 
Hypertext,hypermedia and multimedia
Hypertext,hypermedia and multimediaHypertext,hypermedia and multimedia
Hypertext,hypermedia and multimedia
 
Hypertext, hypermedia and multimedia
Hypertext, hypermedia and multimediaHypertext, hypermedia and multimedia
Hypertext, hypermedia and multimedia
 
multimedia data and file format
multimedia data and file formatmultimedia data and file format
multimedia data and file format
 
MultiMedia dbms
MultiMedia dbmsMultiMedia dbms
MultiMedia dbms
 
Windows 10
Windows 10Windows 10
Windows 10
 
UNIX Operating System
UNIX Operating SystemUNIX Operating System
UNIX Operating System
 
Unix
UnixUnix
Unix
 
Multimedia data and file format
Multimedia data and file formatMultimedia data and file format
Multimedia data and file format
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
 
File formats and its types
File formats and its typesFile formats and its types
File formats and its types
 
Windows 10
Windows 10Windows 10
Windows 10
 
Pulse modulation
Pulse modulationPulse modulation
Pulse modulation
 
Chapter 2 : TEXT
Chapter 2 : TEXTChapter 2 : TEXT
Chapter 2 : TEXT
 
File formats
File formatsFile formats
File formats
 
Text-Elements of multimedia
Text-Elements of multimediaText-Elements of multimedia
Text-Elements of multimedia
 
Mobile Operating System
Mobile Operating SystemMobile Operating System
Mobile Operating System
 
Pulse modulation
Pulse modulationPulse modulation
Pulse modulation
 

Similar to Unicode (and Python)

Similar to Unicode (and Python) (6)

Unicode
UnicodeUnicode
Unicode
 
Unicode
UnicodeUnicode
Unicode
 
Io
IoIo
Io
 
chapter-2.pptx
chapter-2.pptxchapter-2.pptx
chapter-2.pptx
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
Lecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptLecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.ppt
 

More from Juan-Manuel Gimeno

Visualización de datos enlazados
Visualización de datos enlazadosVisualización de datos enlazados
Visualización de datos enlazadosJuan-Manuel Gimeno
 
Functional programming in clojure
Functional programming in clojureFunctional programming in clojure
Functional programming in clojureJuan-Manuel Gimeno
 
Proves de Software (en Java amb JUnit)
Proves de Software (en Java amb JUnit)Proves de Software (en Java amb JUnit)
Proves de Software (en Java amb JUnit)Juan-Manuel Gimeno
 
Conceptes bàsics de la Web 2.0
Conceptes bàsics de la Web 2.0Conceptes bàsics de la Web 2.0
Conceptes bàsics de la Web 2.0Juan-Manuel Gimeno
 
Metaclass Programming in Python
Metaclass Programming in PythonMetaclass Programming in Python
Metaclass Programming in PythonJuan-Manuel Gimeno
 
Object-oriented Programming in Python
Object-oriented Programming in PythonObject-oriented Programming in Python
Object-oriented Programming in PythonJuan-Manuel Gimeno
 
Python: the Project, the Language and the Style
Python: the Project, the Language and the StylePython: the Project, the Language and the Style
Python: the Project, the Language and the StyleJuan-Manuel Gimeno
 

More from Juan-Manuel Gimeno (8)

Visualización de datos enlazados
Visualización de datos enlazadosVisualización de datos enlazados
Visualización de datos enlazados
 
Functional programming in clojure
Functional programming in clojureFunctional programming in clojure
Functional programming in clojure
 
Sistemas de recomendación
Sistemas de recomendaciónSistemas de recomendación
Sistemas de recomendación
 
Proves de Software (en Java amb JUnit)
Proves de Software (en Java amb JUnit)Proves de Software (en Java amb JUnit)
Proves de Software (en Java amb JUnit)
 
Conceptes bàsics de la Web 2.0
Conceptes bàsics de la Web 2.0Conceptes bàsics de la Web 2.0
Conceptes bàsics de la Web 2.0
 
Metaclass Programming in Python
Metaclass Programming in PythonMetaclass Programming in Python
Metaclass Programming in Python
 
Object-oriented Programming in Python
Object-oriented Programming in PythonObject-oriented Programming in Python
Object-oriented Programming in Python
 
Python: the Project, the Language and the Style
Python: the Project, the Language and the StylePython: the Project, the Language and the Style
Python: the Project, the Language and the Style
 

Recently uploaded

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxNikitaBankoti2
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 

Recently uploaded (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 

Unicode (and Python)

  • 1. Unicode (and Python) Juan Manuel Gimeno Illa jmgimeno@diei.udl.cat November 2008 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 1 / 21
  • 2. Outline 1 Before Unicode 2 Unicode Unicode Concepts Encodings 3 Python’s Unicode Support Unicode String Type Source Code Encoding 4 Bibliography J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 2 / 21
  • 3. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 4. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 5. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 6. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 7. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 8. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 9. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 10. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 11. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 12. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 13. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 14. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 15. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 16. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 17. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 18. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 19. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 20. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 21. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 22. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 23. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 24. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 25. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 26. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 27. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 28. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 29. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 30. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 31. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 32. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 33. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 34. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 35. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 36. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 37. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 38. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 39. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 40. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 41. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 42. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 43. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 44. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 45. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 46. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 47. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 48. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 49. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 50. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 51. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 52. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 53. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 54. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 55. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 56. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 57. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 58. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 59. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 60. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 61. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 62. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 63. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 64. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 65. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 66. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 67. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 68. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 69. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 70. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 71. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 72. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 73. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 74. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 75. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 76. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 77. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 78. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 79. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 80. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 81. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 82. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 83. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 84. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 85. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 86. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 87. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 88. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 89. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 90. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 91. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 92. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 93. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 94. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 95. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 96. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 97. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 98. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 99. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 100. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 101. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 102. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 103. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 104. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 105. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 106. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 107. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 108. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 109. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 110. Python’s Unicode Support Unicode String Type Python’s Unicode type Python has a built-in Unicode type Unicode string literals has the same syntax as the normal ones, with a u or U prefixing the quotes (e.g. uquot;This is Unicodequot;) Unicode literals can include the escape sequence uXXXX to denote character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g. uquot;u0026u0416u4e2dU00010346quot;) Unicode characters can be named using the escape sequence N{name} (e.g. uquot;N{Ampersand}quot;) unichr(i) returns a Unicode String with character i (the inverse is ord) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21
  • 111. Python’s Unicode Support Unicode String Type Python’s Unicode type Python has a built-in Unicode type Unicode string literals has the same syntax as the normal ones, with a u or U prefixing the quotes (e.g. uquot;This is Unicodequot;) Unicode literals can include the escape sequence uXXXX to denote character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g. uquot;u0026u0416u4e2dU00010346quot;) Unicode characters can be named using the escape sequence N{name} (e.g. uquot;N{Ampersand}quot;) unichr(i) returns a Unicode String with character i (the inverse is ord) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21
  • 112. Python’s Unicode Support Unicode String Type Python’s Unicode type Python has a built-in Unicode type Unicode string literals has the same syntax as the normal ones, with a u or U prefixing the quotes (e.g. uquot;This is Unicodequot;) Unicode literals can include the escape sequence uXXXX to denote character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g. uquot;u0026u0416u4e2dU00010346quot;) Unicode characters can be named using the escape sequence N{name} (e.g. uquot;N{Ampersand}quot;) unichr(i) returns a Unicode String with character i (the inverse is ord) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21