SlideShare a Scribd company logo
1 of 92
THE GOOD, THE BAD,
     AND THE UGLY




What Happened to Unicode and PHP 6


  Andrei Zmievski ❖ PHP Community Conference
ABOUT 1 YEAR AGO…

“Hello PHP 5.4, open for all new stuff.” — Jani
TIME OF DEATH

March 11, 11:09:37 2010 GMT
5 YEARS EARLIER…
       PHP 5.0.0
 released in July 2004
5 YEARS EARLIER…
         Firefox 1.0
released in November 2004
5 YEARS EARLIER…
             Chrome
not even a twinkle in Google’s eye
5 YEARS EARLIER…
      Unicode
    version 4.0.1
WHAT IS UNICODE?
  and why do I need it?
Unicode
    …is a computing industry
   standard for the consistent
  encoding, representation and
handling of text expressed in most
 of the world's writing systems.
Unicode

 provides a unique number
    for every character:

no matter what the platform,
no matter what the program,
no matter what the language.
UNICODE STANDARD

❖   Developed by the Unicode Consortium
❖   Covers all major living scripts
❖   Version 6.0 has 109,000+ characters
❖   Capacity for 1 million+ characters
❖   Widely supported by standards & industry
FEATURES


❖   Rich property set for every character

❖   Standard, unified encodings: UTF-8/16/32

❖   Extensive rules and documents for implementation

❖   Everything works, as long as everyone follows the rules
UNICODE != I18N



❖   Unicode simplifies development

❖   Unicode does not fix all internationalization problems
TIME FORMATS
❖   USA:    4:00  P.M.

❖   France: 16.00

❖   Japan: 1600  

❖   Don’t forget to identify the time zone
CURRENCY

❖   Symbol placement
                                     US  $12.34
❖   Symbol length (1-15)            12.345,67  €
❖   Number width                       12$34€
❖   Number precision:                   ¥123
    ‣   Spain, Japan   –0
    ‣   Mexico, Brazil – 2
    ‣   Egypt, Iraq    –3
SORTING
❖   Swedish:        z<ö
❖   German:         ö<z
❖   Dictionary:     öf < of
❖   Phonebook:      of < öf
❖   Upper-first:     A<a
❖   Lower-First:    a<A
❖   Contractions:   H < Z, but CH > CZ
❖   Expansions:     OE < Œ < OF
CLDR


❖   Hosted by Unicode Consortium

❖   Latest release: December 2010 (CLDR 1.9)

❖   516 locales, with 187 languages and 166 territories
WHY WEB NEEDS UNICODE
MOJIBAKE
MOJIBAKE

noun: phenomenon of incorrect, unreadable
characters shown when computer software
fails to render a text correctly according to
its associated character encoding.
MOJIBAKE
MOJIBAKE
I UNICODE,
YOU UNICODE
I | UNICODE,
YOU | UNICODE
Helgi
Helgi
Helgi
  Þormar
Þorbjörnsson
ISLTHORP
         or

Mr. SECURITY OVERRIDE
Joel
Joël
Joël
Joël
Joël
WEAKEST LINK
PRINCIPLE APPLIES
WHY PHP NEEDS UNICODE
PHP


❖   Essential Web platform

❖   Since Web needs Unicode…

❖   …so does PHP

❖   Do not want to be the weakest link
THE PROJECT
THE PROJECT


❖   Launched in February 2005 by me at Yahoo

❖   Small group from Yahoo, Zend, and PHP development
    community

❖   Design before code
UNICODE SUPPORT


❖   Everywhere:

    ‣   in the engine

    ‣   in the extensions

    ‣   in the API
UNICODE SUPPORT

❖   Native and complete

    ‣   no hacks

    ‣   no mishmash of external libraries

    ‣   no missing locales

    ‣   no language bias
ICU LIBRARY
       International Components for Unicode
✓   Unicode Character Properties       ✓ Formatting: Date/Time/
✓   Unicode String Class & text          Numbers/Currency
    processing                         ✓ Cultural Calendars & Time

✓   Text transformations                 Zones
    (normalization, upper/lowercase,   ✓ (230+) Locale handling
    etc)                               ✓ Resource Bundles
✓   Text Boundary Analysis             ✓ Transliterations (50+ script
    (Character/Word/Sentence Break       pairs)
    Iterators)
                                       ✓ Complex Text Layout for Arabic,
✓   Encoding Conversions for 500+        Hebrew, Indic & Thai
    legacy encodings
                                       ✓ International Domain Names
✓   Language-sensitive collation         and Web addresses
    (sorting) and searching
                                       ✓ Java model for locale-
✓   Unicode regular expressions          hierarchical resource bundles.
✓   Thread-safe                          Multiple locales can be used at a
                                         time
THE PROJECT


❖   Development was in a separate repository

❖   Merged into PHP tree once the basics were working

❖   Initially slated for 5.x

❖   Extensive changes necessitated a major version bump
PHP 6 = PHP 5 + Unicode
PHP 5 = PHP 6 - Unicode
Unicode = PHP 6 - PHP 5
PHP 6
PHP 6


   6
STRING TYPES

❖   Unicode

    ‣   text

    ‣   default for literals, etc

❖   Binary

    ‣   bytes

    ‣   everything ∉ Unicode type
Conversions
  Dataflow                                                  streams




                                                        s c
                                                      ng ifi
                                                    di ec
                                PHP




                                                  co -sp
                                              en am
                                Unicode




                                                 re
                                              st
                                strings
                          runtime encoding
request                                                       response
          HTTP input            binary       HTTP output
           encoding             strings       encoding
                          ng




                                          fil co
                            i




                                            es d
                         od




                                             en

                                              ys in
                       nc




                                                te g
                    te




                                                  m
                rip
               sc




             scripts                         filesystem
STRINGS
❖   String literals are Unicode

❖   String offsets work on code points

        $str  =  "   ";
 
   //  2  code  points
        echo  $str[1];
 
    //  result  is     
        $str[0]  =  ' ';
 //  full  string  is  now  
IDENTIFIERS
❖   Unicode identifiers are allowed

      class                       {  
            function  ᓱᓴᓐ  ᐊᒡᓗᒃᑲᖅ()
 
 {  ...  }
            function  !வா$  கேனச)

()    {  ...  }
            function  འ"ག་%ལ།()
 
 
         {  ...  }
      }  

      $              =  array();
      $            ['‫  =  ]'ַרעְיולּוחַ  ׁשָנָה‬new       ;
FUNCTIONS
❖ Functions understand Unicode text and apply
  appropriate rules
❖ i.e. case manipulation



    $str  =  strtoupper("fußball");
 //  result  is  FUSSBALL

    $str  =  strtolower("ΣΕΛΛΑΣ");
   //  result  is  σελλάς  
TRANSLITERATION
$names  =  "  
    김,  국삼  
    김,  명희  
         ,         
           ,          
    Горбачев,  Михаил  
    Козырев,  Андрей  
    Καφετζόπουλος,  Θεόφιλος  
    Θεοδωράτου,  Ελένη  
";  
$r  =  strtotitle(str_transliterate($names,  "Any",  "Latin"));

Gim,  Gugsam
Gim,  Myeonghyi
Takeda,  Masayuki
Oohara,  Manabu
Gorbačev,  Mihail
Kozyrev,  Andrej
Kaphetzópoulos,  Theóphilos
Theodōrátou,  Elénē
PECL/INTL
FEATURES
❖   Locales
❖   Collation
❖   Number and Currency Formatters
❖   Date and Time Formatters
❖   Time Zones
❖   Calendars
❖   Message Formatter
❖   Choice Formatter
❖   Resource Handler
❖   Normalization
COLLATION
sorting
$strings  =  array(  
                "cote",  "côte",  "Côte",  "coté",  
                "Coté",  "côté",  "Côté",  "coter");  
$coll  =  new  Collator("fr_FR");  
$coll-­‐>sort($strings);

result
cote
côte
Côte
coté
Coté
côté
Côté
coter
NUMBER FORMATTING
                123456.789  in  en_US
❖   NumberFormatter::DECIMAL
    123456.789

❖   NumberFormatter::CURRENCY
    $123,456.79

❖   NumberFormatter::ORDINAL
    123,457th

❖   NumberFormatter::SPELLOUT
    one hundred and twenty-three thousand, four hundred
    and fifty-six point seven eight nine
MESSAGE FORMATTING

with modifiers
$pattern  =  “On  {0,date,full}  you  received
                        {1,number,#,##0.00}  emails.”;
$args  =  array(time(),  1184);  
$fmt  =  new  MessageFormatter(‘en_US’,  $pattern);
echo  $fmt-­‐>format($args);

result
On  Tuesday,  November  22,  2007  you  received  
1,184.00  emails.
POSTMORTEM
WHAT WENT RIGHT
1. RAISED AWARENESS


❖   Spoke at multiple conferences about the project

    ‣   including Unicode Conference

❖   Shoved Unicode down people’s throats at every
    opportunity
2. CHOSE THE RIGHT TECH


❖   ICU library had everything we needed

❖   Low- and high-level functionality

❖   Good support from its developers
3. UNIT TESTS


❖   Every function handling strings had to be ported

❖   Unit tests showed us where things broke

❖   Also easy to track progress
4. PECL/INTL EXTENSION


❖   A lot of i18n/l10n functionality in a self-contained
    extension

❖   Ensuring that it worked with PHP 5
5. CODE SEGREGATION


❖   Proof-of-concept developed by only a few people

❖   Faster decisions, iteration, development

❖   Things slowed down after merging into the main tree

    ‣   but was necessary to spread the workload
WHAT WENT RONG
1. CHOICE OF UTF-16
Thought to be the best compromise
UTF-8


❖   Backward-compatible with ASCII

❖   Avoids complications of endianness

❖   Dominant UTF encoding for the Web

❖   Supported in a lot of libraries, APIs, etc
UTF-8, BUT…

❖   Variable-length encoding (1-4 bytes)

❖   Uses 3 bytes for BMP code points > U+07FF

❖   Not all byte sequences are valid

❖   ICU did not have many UTF-8 APIs (at the time)

    ‣   on-the-fly conversion is necessary
UTF-32



❖   Uses exactly 4 bytes for each code point

    ‣   directly indexable!
UTF-32, BUT...

❖   Uses exactly 4 bytes for each code point

    ‣   4x the size of UTF-8 for majority of languages

❖   Only affordable by people from rich oil countries

❖   Still needs conversion to UTF-16 when using ICU

❖   Endianness
UTF-16


❖   “65,536 code points should be enough for everyone…”

❖   2 bytes to represent all of BMP (U+0 to U+FFFF)

    ‣   directly indexable in that plane

❖   Internal encoding of ICU
UTF-16, BUT…

❖   Requires surrogate pairs for code points > U+FFFF

    ‣   still variable-length

❖   2x the size of UTF-8 for Latin, Greek, Cyrillic, Armenian,
    Hebrew, Arabic and other scripts

❖   Can’t be manipulated by normal C string handling

❖   Endianness
CHOICE OF UTF-16

❖   Thought that CJK languages would benefit from UTF-16

❖   Primary driver was the ICU APIs

❖   Problems: no direct indexing, many conversions

❖   Would probably choose UTF-8, if started over

    ‣   no need for decoding/encoding on the periphery

    ‣   can be used by C-based libraries
2. CRUCIAL CODE LAGGED
2. CRUCIAL CODE LAGGED


❖   PDO (and native DB extensions)

❖   filter

❖   proper substring search (collation-based)

❖   some ext/standard functionality
3. LACK OF MINDSHARE
3. LACK OF MINDSHARE

❖   Probably <10 people who understood the intricacies of
    the Unicode and ICU

❖   In the end, implementation deemed too technically
    difficult

❖   People were bored converting large chunks of already
    working code
4. DELAYED NEW FEATURES
5. MEA CULPA
END GAME
RE-ORG

❖   PHP 6 trunk was moved to a branch

❖   PHP 5.4 became the trunk

❖   Kick-started development of new features

❖   Some clean-ups and improvements from 6 back-ported
    to 5.4
PEOPLE MATTER


❖   The project ran out of steam

    ‣   PHP development culture means that people work on
        what they’re interested in

    ‣   Clearly, the Unicode/i18n implementation wasn’t
        interesting enough to be viable
INTERNALS

“Because it’s nearly impossible to
participate on Internals if your
poo-throwing arm isn’t strong.”
                       — @coates
PERSISTENCE

“Those with talent, competence, energy,
and good ideas over a period of time
tend to be the main drivers behind PHP
development.”
                                   — me
PERSISTENCE

“Those with talent, competence, energy,
and good ideas over a period of time
        and who outlast the rest
tend to be the main drivers behind PHP
development.”
                                  — me
PROGRESS



❖   No development on the Unicode branch

❖   No visible effort to develop alternatives
FUTURE?

❖   Lighter, gentler implementations?

    ‣   mbstring is clunky

    ‣   separate Unicode String class would also be clunky

❖   Open field for someone with a great idea, persistence,
    and people skills
STEPPING AWAY


❖   Invalidation of several man-years of hard work is
    discouraging

❖   Did not feel like pushing the project up the hill again

❖   Working on more fun stuff these days
LESSONS LEARNED

❖   Rewriting large existing code base is hard

❖   Making people do tedious stuff is hard

    ‣   make it interesting for them (game-like)

❖   Waiting for results of long iterations is hard

    ‣   short, results-oriented projects (if possible)

❖   Stay committed
FINITA LA COMEDIA




http://joind.in/3349 ❖ http://zazzle.com/andreiz

More Related Content

What's hot

Introduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesIntroduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesGopal Venkatesan
 
Unicode - Hacking The International Character System
Unicode - Hacking The International Character SystemUnicode - Hacking The International Character System
Unicode - Hacking The International Character SystemWebsecurify
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Project Student
 
Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeUlf Mattsson
 
Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Ulf Mattsson
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And GlobalisationAlan Dean
 
4 character encoding-ascii
4 character encoding-ascii4 character encoding-ascii
4 character encoding-asciiirdginfo
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character EncodingsMobisoft Infotech
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash CourseWill Iverson
 
An overview of potential leaks via PDF
An overview of potential leaks via PDFAn overview of potential leaks via PDF
An overview of potential leaks via PDFAnge Albertini
 
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScript
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScriptJS Fest 2019. Ryan Dahl. Deno, a new way to JavaScript
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScriptJSFestUA
 

What's hot (20)

Unicode
UnicodeUnicode
Unicode
 
Notes on a Standard: Unicode
Notes on a Standard: UnicodeNotes on a Standard: Unicode
Notes on a Standard: Unicode
 
Introduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesIntroduction to W3C I18N Best Practices
Introduction to W3C I18N Best Practices
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Unicode - Hacking The International Character System
Unicode - Hacking The International Character SystemUnicode - Hacking The International Character System
Unicode - Hacking The International Character System
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)
 
Ascii codes
Ascii codesAscii codes
Ascii codes
 
Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicode
 
Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And Globalisation
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
4 character encoding-ascii
4 character encoding-ascii4 character encoding-ascii
4 character encoding-ascii
 
Io
IoIo
Io
 
mdc_ppt
mdc_pptmdc_ppt
mdc_ppt
 
What character is that
What character is thatWhat character is that
What character is that
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character Encodings
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
 
An overview of potential leaks via PDF
An overview of potential leaks via PDFAn overview of potential leaks via PDF
An overview of potential leaks via PDF
 
Ascii 03
Ascii 03Ascii 03
Ascii 03
 
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScript
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScriptJS Fest 2019. Ryan Dahl. Deno, a new way to JavaScript
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScript
 

Viewers also liked

PHP 7 – What changed internally?
PHP 7 – What changed internally?PHP 7 – What changed internally?
PHP 7 – What changed internally?Nikita Popov
 
W3 conf hill-html5-security-realities
W3 conf hill-html5-security-realitiesW3 conf hill-html5-security-realities
W3 conf hill-html5-security-realitiesBrad Hill
 
PHP 5.3 And PHP 6 A Look Ahead
PHP 5.3 And PHP 6 A Look AheadPHP 5.3 And PHP 6 A Look Ahead
PHP 5.3 And PHP 6 A Look Aheadthinkphp
 
Introduction to php 6
Introduction to php   6Introduction to php   6
Introduction to php 6pctechnology
 
Diapositiva reclutamiento y seleccion leymar jimenez
Diapositiva reclutamiento y seleccion leymar jimenezDiapositiva reclutamiento y seleccion leymar jimenez
Diapositiva reclutamiento y seleccion leymar jimenezLeymar jimenez
 
Consórcio realiza consta da relação das administradoras de consórcios autoriz...
Consórcio realiza consta da relação das administradoras de consórcios autoriz...Consórcio realiza consta da relação das administradoras de consórcios autoriz...
Consórcio realiza consta da relação das administradoras de consórcios autoriz...Jessica R.
 
Prophet's Sunnah, tagalog
Prophet's Sunnah, tagalogProphet's Sunnah, tagalog
Prophet's Sunnah, tagalogArab Muslim
 
Perfil sociodemográfico de los internautas 2013 - ONTSI
Perfil sociodemográfico de los internautas 2013 - ONTSIPerfil sociodemográfico de los internautas 2013 - ONTSI
Perfil sociodemográfico de los internautas 2013 - ONTSIAritz Pérez
 
Einführung ins eCampaigning
Einführung ins eCampaigning Einführung ins eCampaigning
Einführung ins eCampaigning more onion
 
HTML5 on Mobile
HTML5 on MobileHTML5 on Mobile
HTML5 on MobileAdam Lu
 
Vloge (funkcije) umetnosti v družbi (2) 2.del
Vloge (funkcije) umetnosti v družbi (2) 2.delVloge (funkcije) umetnosti v družbi (2) 2.del
Vloge (funkcije) umetnosti v družbi (2) 2.delsKastelic
 
First Beat Media - Rad od kuće #tnt3
First Beat Media - Rad od kuće #tnt3First Beat Media - Rad od kuće #tnt3
First Beat Media - Rad od kuće #tnt3SICEF
 

Viewers also liked (20)

PHP 7 – What changed internally?
PHP 7 – What changed internally?PHP 7 – What changed internally?
PHP 7 – What changed internally?
 
W3 conf hill-html5-security-realities
W3 conf hill-html5-security-realitiesW3 conf hill-html5-security-realities
W3 conf hill-html5-security-realities
 
Php Presentation
Php PresentationPhp Presentation
Php Presentation
 
HTML 5 Accessibility
HTML 5 AccessibilityHTML 5 Accessibility
HTML 5 Accessibility
 
PHP 5.3 And PHP 6 A Look Ahead
PHP 5.3 And PHP 6 A Look AheadPHP 5.3 And PHP 6 A Look Ahead
PHP 5.3 And PHP 6 A Look Ahead
 
Introduction to php 6
Introduction to php   6Introduction to php   6
Introduction to php 6
 
All The Little Pieces
All The Little PiecesAll The Little Pieces
All The Little Pieces
 
Diapositiva reclutamiento y seleccion leymar jimenez
Diapositiva reclutamiento y seleccion leymar jimenezDiapositiva reclutamiento y seleccion leymar jimenez
Diapositiva reclutamiento y seleccion leymar jimenez
 
Consórcio realiza consta da relação das administradoras de consórcios autoriz...
Consórcio realiza consta da relação das administradoras de consórcios autoriz...Consórcio realiza consta da relação das administradoras de consórcios autoriz...
Consórcio realiza consta da relação das administradoras de consórcios autoriz...
 
El señor de los milagros
El señor de los milagrosEl señor de los milagros
El señor de los milagros
 
Prophet's Sunnah, tagalog
Prophet's Sunnah, tagalogProphet's Sunnah, tagalog
Prophet's Sunnah, tagalog
 
Roma
RomaRoma
Roma
 
Perfil sociodemográfico de los internautas 2013 - ONTSI
Perfil sociodemográfico de los internautas 2013 - ONTSIPerfil sociodemográfico de los internautas 2013 - ONTSI
Perfil sociodemográfico de los internautas 2013 - ONTSI
 
Einführung ins eCampaigning
Einführung ins eCampaigning Einführung ins eCampaigning
Einführung ins eCampaigning
 
HTML5 on Mobile
HTML5 on MobileHTML5 on Mobile
HTML5 on Mobile
 
Gabriel
GabrielGabriel
Gabriel
 
Air mobility command amc travel contacts
Air mobility command   amc travel contactsAir mobility command   amc travel contacts
Air mobility command amc travel contacts
 
Vloge (funkcije) umetnosti v družbi (2) 2.del
Vloge (funkcije) umetnosti v družbi (2) 2.delVloge (funkcije) umetnosti v družbi (2) 2.del
Vloge (funkcije) umetnosti v družbi (2) 2.del
 
First Beat Media - Rad od kuće #tnt3
First Beat Media - Rad od kuće #tnt3First Beat Media - Rad od kuće #tnt3
First Beat Media - Rad od kuće #tnt3
 
¿Cómo funciona SIGRE?
¿Cómo funciona SIGRE?¿Cómo funciona SIGRE?
¿Cómo funciona SIGRE?
 

Similar to The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6

How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
 
Internationlization
InternationlizationInternationlization
InternationlizationTuan Ngo
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - ITguest6ddfb98
 
Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text Beatrice Alex
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
COMPUTER LANGUAGES AND THERE DIFFERENCE
COMPUTER LANGUAGES AND THERE DIFFERENCE COMPUTER LANGUAGES AND THERE DIFFERENCE
COMPUTER LANGUAGES AND THERE DIFFERENCE Pavan Kalyan
 
2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekingeProf. Wim Van Criekinge
 
Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)Feihong Hsu
 
Lect 1. introduction to programming languages
Lect 1. introduction to programming languagesLect 1. introduction to programming languages
Lect 1. introduction to programming languagesVarun Garg
 
Desert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming languageDesert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming languageJames Montemagno
 
Learning by hacking - android application hacking tutorial
Learning by hacking - android application hacking tutorialLearning by hacking - android application hacking tutorial
Learning by hacking - android application hacking tutorialLandice Fu
 
There should be a tool for that - GameQALoc Barcelona 2016
There should be a tool for that - GameQALoc Barcelona 2016There should be a tool for that - GameQALoc Barcelona 2016
There should be a tool for that - GameQALoc Barcelona 2016Adolfo Gomez-Urda
 
Affordable and efficient multi platform localisation. case study
Affordable and efficient multi platform localisation. case studyAffordable and efficient multi platform localisation. case study
Affordable and efficient multi platform localisation. case studyVolodymyr Shyrochuk
 
Type हिन्दी in Java
Type हिन्दी in JavaType हिन्दी in Java
Type हिन्दी in Javagagmansa
 
Type हिन्दी in Java
Type हिन्दी in JavaType हिन्दी in Java
Type हिन्दी in Javagagmansa
 

Similar to The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6 (20)

Unicode & PHP6
Unicode & PHP6Unicode & PHP6
Unicode & PHP6
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Perl5 VS JSON
Perl5 VS JSONPerl5 VS JSON
Perl5 VS JSON
 
Internationlization
InternationlizationInternationlization
Internationlization
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
PHP for Grown-ups
PHP for Grown-upsPHP for Grown-ups
PHP for Grown-ups
 
Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
COMPUTER LANGUAGES AND THERE DIFFERENCE
COMPUTER LANGUAGES AND THERE DIFFERENCE COMPUTER LANGUAGES AND THERE DIFFERENCE
COMPUTER LANGUAGES AND THERE DIFFERENCE
 
2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge
 
Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)
 
P1 2017 python
P1 2017 pythonP1 2017 python
P1 2017 python
 
Lect 1. introduction to programming languages
Lect 1. introduction to programming languagesLect 1. introduction to programming languages
Lect 1. introduction to programming languages
 
Desert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming languageDesert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming language
 
The Evolution of Good Code
The Evolution of Good CodeThe Evolution of Good Code
The Evolution of Good Code
 
Learning by hacking - android application hacking tutorial
Learning by hacking - android application hacking tutorialLearning by hacking - android application hacking tutorial
Learning by hacking - android application hacking tutorial
 
There should be a tool for that - GameQALoc Barcelona 2016
There should be a tool for that - GameQALoc Barcelona 2016There should be a tool for that - GameQALoc Barcelona 2016
There should be a tool for that - GameQALoc Barcelona 2016
 
Affordable and efficient multi platform localisation. case study
Affordable and efficient multi platform localisation. case studyAffordable and efficient multi platform localisation. case study
Affordable and efficient multi platform localisation. case study
 
Type हिन्दी in Java
Type हिन्दी in JavaType हिन्दी in Java
Type हिन्दी in Java
 
Type हिन्दी in Java
Type हिन्दी in JavaType हिन्दी in Java
Type हिन्दी in Java
 

Recently uploaded

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 

Recently uploaded (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6

  • 1. THE GOOD, THE BAD, AND THE UGLY What Happened to Unicode and PHP 6 Andrei Zmievski ❖ PHP Community Conference
  • 2. ABOUT 1 YEAR AGO… “Hello PHP 5.4, open for all new stuff.” — Jani
  • 4. 5 YEARS EARLIER… PHP 5.0.0 released in July 2004
  • 5. 5 YEARS EARLIER… Firefox 1.0 released in November 2004
  • 6. 5 YEARS EARLIER… Chrome not even a twinkle in Google’s eye
  • 7. 5 YEARS EARLIER… Unicode version 4.0.1
  • 8. WHAT IS UNICODE? and why do I need it?
  • 9. Unicode …is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems.
  • 10. Unicode provides a unique number for every character: no matter what the platform, no matter what the program, no matter what the language.
  • 11. UNICODE STANDARD ❖ Developed by the Unicode Consortium ❖ Covers all major living scripts ❖ Version 6.0 has 109,000+ characters ❖ Capacity for 1 million+ characters ❖ Widely supported by standards & industry
  • 12. FEATURES ❖ Rich property set for every character ❖ Standard, unified encodings: UTF-8/16/32 ❖ Extensive rules and documents for implementation ❖ Everything works, as long as everyone follows the rules
  • 13. UNICODE != I18N ❖ Unicode simplifies development ❖ Unicode does not fix all internationalization problems
  • 14. TIME FORMATS ❖ USA: 4:00  P.M. ❖ France: 16.00 ❖ Japan: 1600   ❖ Don’t forget to identify the time zone
  • 15. CURRENCY ❖ Symbol placement US  $12.34 ❖ Symbol length (1-15) 12.345,67  € ❖ Number width 12$34€ ❖ Number precision: ¥123 ‣ Spain, Japan –0 ‣ Mexico, Brazil – 2 ‣ Egypt, Iraq –3
  • 16. SORTING ❖ Swedish: z<ö ❖ German: ö<z ❖ Dictionary: öf < of ❖ Phonebook: of < öf ❖ Upper-first: A<a ❖ Lower-First: a<A ❖ Contractions: H < Z, but CH > CZ ❖ Expansions: OE < Œ < OF
  • 17. CLDR ❖ Hosted by Unicode Consortium ❖ Latest release: December 2010 (CLDR 1.9) ❖ 516 locales, with 187 languages and 166 territories
  • 18. WHY WEB NEEDS UNICODE
  • 20. MOJIBAKE noun: phenomenon of incorrect, unreadable characters shown when computer software fails to render a text correctly according to its associated character encoding.
  • 24. I | UNICODE, YOU | UNICODE
  • 25. Helgi
  • 26. Helgi
  • 28. ISLTHORP or Mr. SECURITY OVERRIDE
  • 29.
  • 30. Joel
  • 31. Joël
  • 32. Joël
  • 36. WHY PHP NEEDS UNICODE
  • 37. PHP ❖ Essential Web platform ❖ Since Web needs Unicode… ❖ …so does PHP ❖ Do not want to be the weakest link
  • 39. THE PROJECT ❖ Launched in February 2005 by me at Yahoo ❖ Small group from Yahoo, Zend, and PHP development community ❖ Design before code
  • 40. UNICODE SUPPORT ❖ Everywhere: ‣ in the engine ‣ in the extensions ‣ in the API
  • 41. UNICODE SUPPORT ❖ Native and complete ‣ no hacks ‣ no mishmash of external libraries ‣ no missing locales ‣ no language bias
  • 42. ICU LIBRARY International Components for Unicode ✓ Unicode Character Properties ✓ Formatting: Date/Time/ ✓ Unicode String Class & text Numbers/Currency processing ✓ Cultural Calendars & Time ✓ Text transformations Zones (normalization, upper/lowercase, ✓ (230+) Locale handling etc) ✓ Resource Bundles ✓ Text Boundary Analysis ✓ Transliterations (50+ script (Character/Word/Sentence Break pairs) Iterators) ✓ Complex Text Layout for Arabic, ✓ Encoding Conversions for 500+ Hebrew, Indic & Thai legacy encodings ✓ International Domain Names ✓ Language-sensitive collation and Web addresses (sorting) and searching ✓ Java model for locale- ✓ Unicode regular expressions hierarchical resource bundles. ✓ Thread-safe Multiple locales can be used at a time
  • 43. THE PROJECT ❖ Development was in a separate repository ❖ Merged into PHP tree once the basics were working ❖ Initially slated for 5.x ❖ Extensive changes necessitated a major version bump
  • 44. PHP 6 = PHP 5 + Unicode
  • 45. PHP 5 = PHP 6 - Unicode
  • 46. Unicode = PHP 6 - PHP 5
  • 47. PHP 6
  • 48. PHP 6 6
  • 49. STRING TYPES ❖ Unicode ‣ text ‣ default for literals, etc ❖ Binary ‣ bytes ‣ everything ∉ Unicode type
  • 50. Conversions Dataflow streams s c ng ifi di ec PHP co -sp en am Unicode re st strings runtime encoding request response HTTP input binary HTTP output encoding strings encoding ng fil co i es d od en ys in nc te g te m rip sc scripts filesystem
  • 51. STRINGS ❖ String literals are Unicode ❖ String offsets work on code points $str  =  " "; //  2  code  points echo  $str[1]; //  result  is     $str[0]  =  ' '; //  full  string  is  now  
  • 52. IDENTIFIERS ❖ Unicode identifiers are allowed class    {        function  ᓱᓴᓐ  ᐊᒡᓗᒃᑲᖅ() {  ...  }      function  !வா$  கேனச) ()    {  ...  }      function  འ"ག་%ལ།()        {  ...  } }   $  =  array(); $ ['‫  =  ]'ַרעְיולּוחַ  ׁשָנָה‬new   ;
  • 53. FUNCTIONS ❖ Functions understand Unicode text and apply appropriate rules ❖ i.e. case manipulation $str  =  strtoupper("fußball"); //  result  is  FUSSBALL $str  =  strtolower("ΣΕΛΛΑΣ"); //  result  is  σελλάς  
  • 54. TRANSLITERATION $names  =  "      김,  국삼      김,  명희       ,         ,        Горбачев,  Михаил      Козырев,  Андрей      Καφετζόπουλος,  Θεόφιλος      Θεοδωράτου,  Ελένη   ";   $r  =  strtotitle(str_transliterate($names,  "Any",  "Latin")); Gim,  Gugsam Gim,  Myeonghyi Takeda,  Masayuki Oohara,  Manabu Gorbačev,  Mihail Kozyrev,  Andrej Kaphetzópoulos,  Theóphilos Theodōrátou,  Elénē
  • 56. FEATURES ❖ Locales ❖ Collation ❖ Number and Currency Formatters ❖ Date and Time Formatters ❖ Time Zones ❖ Calendars ❖ Message Formatter ❖ Choice Formatter ❖ Resource Handler ❖ Normalization
  • 57. COLLATION sorting $strings  =  array(                  "cote",  "côte",  "Côte",  "coté",                  "Coté",  "côté",  "Côté",  "coter");   $coll  =  new  Collator("fr_FR");   $coll-­‐>sort($strings); result cote côte Côte coté Coté côté Côté coter
  • 58. NUMBER FORMATTING 123456.789  in  en_US ❖ NumberFormatter::DECIMAL 123456.789 ❖ NumberFormatter::CURRENCY $123,456.79 ❖ NumberFormatter::ORDINAL 123,457th ❖ NumberFormatter::SPELLOUT one hundred and twenty-three thousand, four hundred and fifty-six point seven eight nine
  • 59. MESSAGE FORMATTING with modifiers $pattern  =  “On  {0,date,full}  you  received                        {1,number,#,##0.00}  emails.”; $args  =  array(time(),  1184);   $fmt  =  new  MessageFormatter(‘en_US’,  $pattern); echo  $fmt-­‐>format($args); result On  Tuesday,  November  22,  2007  you  received   1,184.00  emails.
  • 62. 1. RAISED AWARENESS ❖ Spoke at multiple conferences about the project ‣ including Unicode Conference ❖ Shoved Unicode down people’s throats at every opportunity
  • 63. 2. CHOSE THE RIGHT TECH ❖ ICU library had everything we needed ❖ Low- and high-level functionality ❖ Good support from its developers
  • 64. 3. UNIT TESTS ❖ Every function handling strings had to be ported ❖ Unit tests showed us where things broke ❖ Also easy to track progress
  • 65. 4. PECL/INTL EXTENSION ❖ A lot of i18n/l10n functionality in a self-contained extension ❖ Ensuring that it worked with PHP 5
  • 66. 5. CODE SEGREGATION ❖ Proof-of-concept developed by only a few people ❖ Faster decisions, iteration, development ❖ Things slowed down after merging into the main tree ‣ but was necessary to spread the workload
  • 68. 1. CHOICE OF UTF-16 Thought to be the best compromise
  • 69. UTF-8 ❖ Backward-compatible with ASCII ❖ Avoids complications of endianness ❖ Dominant UTF encoding for the Web ❖ Supported in a lot of libraries, APIs, etc
  • 70. UTF-8, BUT… ❖ Variable-length encoding (1-4 bytes) ❖ Uses 3 bytes for BMP code points > U+07FF ❖ Not all byte sequences are valid ❖ ICU did not have many UTF-8 APIs (at the time) ‣ on-the-fly conversion is necessary
  • 71. UTF-32 ❖ Uses exactly 4 bytes for each code point ‣ directly indexable!
  • 72. UTF-32, BUT... ❖ Uses exactly 4 bytes for each code point ‣ 4x the size of UTF-8 for majority of languages ❖ Only affordable by people from rich oil countries ❖ Still needs conversion to UTF-16 when using ICU ❖ Endianness
  • 73. UTF-16 ❖ “65,536 code points should be enough for everyone…” ❖ 2 bytes to represent all of BMP (U+0 to U+FFFF) ‣ directly indexable in that plane ❖ Internal encoding of ICU
  • 74. UTF-16, BUT… ❖ Requires surrogate pairs for code points > U+FFFF ‣ still variable-length ❖ 2x the size of UTF-8 for Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic and other scripts ❖ Can’t be manipulated by normal C string handling ❖ Endianness
  • 75. CHOICE OF UTF-16 ❖ Thought that CJK languages would benefit from UTF-16 ❖ Primary driver was the ICU APIs ❖ Problems: no direct indexing, many conversions ❖ Would probably choose UTF-8, if started over ‣ no need for decoding/encoding on the periphery ‣ can be used by C-based libraries
  • 76. 2. CRUCIAL CODE LAGGED
  • 77. 2. CRUCIAL CODE LAGGED ❖ PDO (and native DB extensions) ❖ filter ❖ proper substring search (collation-based) ❖ some ext/standard functionality
  • 78. 3. LACK OF MINDSHARE
  • 79. 3. LACK OF MINDSHARE ❖ Probably <10 people who understood the intricacies of the Unicode and ICU ❖ In the end, implementation deemed too technically difficult ❖ People were bored converting large chunks of already working code
  • 80. 4. DELAYED NEW FEATURES
  • 83. RE-ORG ❖ PHP 6 trunk was moved to a branch ❖ PHP 5.4 became the trunk ❖ Kick-started development of new features ❖ Some clean-ups and improvements from 6 back-ported to 5.4
  • 84. PEOPLE MATTER ❖ The project ran out of steam ‣ PHP development culture means that people work on what they’re interested in ‣ Clearly, the Unicode/i18n implementation wasn’t interesting enough to be viable
  • 85. INTERNALS “Because it’s nearly impossible to participate on Internals if your poo-throwing arm isn’t strong.” — @coates
  • 86. PERSISTENCE “Those with talent, competence, energy, and good ideas over a period of time tend to be the main drivers behind PHP development.” — me
  • 87. PERSISTENCE “Those with talent, competence, energy, and good ideas over a period of time and who outlast the rest tend to be the main drivers behind PHP development.” — me
  • 88. PROGRESS ❖ No development on the Unicode branch ❖ No visible effort to develop alternatives
  • 89. FUTURE? ❖ Lighter, gentler implementations? ‣ mbstring is clunky ‣ separate Unicode String class would also be clunky ❖ Open field for someone with a great idea, persistence, and people skills
  • 90. STEPPING AWAY ❖ Invalidation of several man-years of hard work is discouraging ❖ Did not feel like pushing the project up the hill again ❖ Working on more fun stuff these days
  • 91. LESSONS LEARNED ❖ Rewriting large existing code base is hard ❖ Making people do tedious stuff is hard ‣ make it interesting for them (game-like) ❖ Waiting for results of long iterations is hard ‣ short, results-oriented projects (if possible) ❖ Stay committed
  • 92. FINITA LA COMEDIA http://joind.in/3349 ❖ http://zazzle.com/andreiz