International Components for Unicode

International Components for Unicode
Developer(s)	Unicode Consortium
Initial release	1999
Stable release	68.2 / 17 December 2020; 5 months ago
Repository	github.com/unicode-org/icu;
Written in	C/C++ (C++11) and Java
Operating system	Cross-platform
Type	libraries for Unicode and internationalization
License	Unicode License
Website	www.icu-project.org

International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the Unicode Consortium and sponsored, supported, and used by IBM and many other companies.^[1]

ICU provides the following services: Unicode text handling, full character properties, and character set conversions; Unicode regular expressions; full Unicode sets; character, word, and line boundaries; language-sensitive collation and searching; normalization, upper and lowercase conversion, and script transliterations; comprehensive locale data and resource bundle architecture via the Common Locale Data Repository (CLDR); multi-calendar and time zones; and rule-based formatting and parsing of dates, times, numbers, currencies, and messages. ICU provided complex text layout service for Arabic, Hebrew, Indic, and Thai historically, but that was deprecated in version 54, and was completely removed in version 58 in favor of HarfBuzz.^[2]

ICU provides more extensive internationalization facilities than the standard libraries for C and C++. ICU 67 supports Unicode 13.0 and handles removal of Great Britain from EU regions. ICU 64 supports Unicode 12.0, while ICU 64.2 added support for Unicode 12.1, i.e. the single new symbol for current Japanese Reiwa era (but support for it has also been backported to older ICU versions down to ICU 4.8.2). ICU 58 (with Unicode 9.0 support) is the last version to support older platforms such as Windows XP, Windows Vista. Support for AIX, Solaris and z/OS may also be limited in later versions (i.e. building depends on compiler supporting).^[3] ICU has been included as a standard component with Microsoft Windows since Windows 10 version 1703.^[4]

ICU has historically used UTF-16, and still does only for Java; while for C/C++ UTF-8 is supported,^[5] including the correct handling of "illegal UTF-8".^[6]

ICU 68 cannot be built with a C++20 compiler^[7] but should be possible with 69.1.

Origin and development

After Taligent became part of IBM in early 1996, Sun Microsystems decided that the new Java language should have better support for internationalization. Since Taligent had experience with such technologies and were close geographically, their Text and International group were asked to contribute the international classes to the Java Development Kit as part of the JDK 1.1 internationalization APIs.^[8] A large portion of this code still exists in the java.text and java.util packages. Further internationalization features were added with each later release of Java.

The Java internationalization classes were then ported to C++ and C^[9] as part of a library known as ICU4C ("ICU for C"). The ICU project also provides ICU4J ("ICU for Java"), which adds features not present in the standard Java libraries. ICU4C and ICU4J are very similar, though not identical; for example, ICU4C includes a Regular Expression API, while ICU4J does not. Both frameworks have been enhanced over time to support new facilities and new features of Unicode and Common Locale Data Repository (CLDR).

ICU was released as an open-source project in 1999 under the name IBM Classes for Unicode. It was later renamed to International Components For Unicode.^[10] In May, 2016, the ICU project joined the Unicode consortium as technical committee ICU-TC, and the library sources are now distributed under the Unicode license.^[11]

MessageFormat

A part of ICU is the MessageFormat class, a formatting system that allows for any number of arguments to control the plural form (plural, selectordinal) or more general switch-case-style selection (select) for things like grammatical gender. These statements can be nested.^[12] A JavaScript port of this library is commonly used by AngularJS developers in combination with ngx-translate, so that the simple key-based library can handle nuanced localization inputs.^[13] An example for this system can look like:

# Using YAML for simplicity of example.
hello: Hello, {user}!
# offset allows the categories to be subtracted by the specified amount before being processed. It does not, however, affect the exact-match system.
party: {user} has invited {player_count, plural, offset:1, =0 {nobody} one {a player} other {# players}} to {user_gender, select, male {his}, female {her}, other {their}} party.

// Using the simple form of https://messageformat.github.io/messageformat/page-build
import msg from './example.yaml'
function say(messageKey, options) { console.log(msg[messageKey](options)) }

say('hello', {user: 'Jimmy'})  // Hello, Jimmy!
say('party', {user: 'Whales', player_count: 5000, user_gender: 'male'})   // Whales has invited 4999 players to his game.
say('party', {user: 'Dolphin', player_count: 20, user_gender: 'other'})   // Dolphin has invited 19 players to their game.
say('party', {user: 'Elephant', player_count: 1, user_gender: 'female'})  // Elephant has invited nobody to her game.

ICU MessageFormat was created by adding the plural and selection system to an identically-named system in Java SE.

References

^ "ICU - International Components for Unicode". site.icu-project.org.
^ "Layout Engine - ICU User Guide". userguide.icu-project.org.
^ "Download ICU 64 - ICU - International Components for Unicode". site.icu-project.org. Retrieved 2019-10-20.
^ Chen, Raymond. "How can I convert between IANA time zones and Windows registry-based time zones?". The Old New Thing. Microsoft.
^ "UTF-8 - ICU User Guide". userguide.icu-project.org. Retrieved 2018-04-03.
^ "#13311 (change illegal-UTF-8 handling to Unicode "best practice")". bugs.icu-project.org. Retrieved 2018-04-03.
^ "ICU 68 - ICU - International Components for Unicode". site.icu-project.org. Retrieved 2021-02-10.
^ Laura Werner (1999). "Getting Java ready for the world: A brief history of IBM and Sun's internationalization efforts".
^ "ICU User Guide". userguide.icu-project.org.
^ "ICU Project Management Committee".
^ "ICU joins the Unicode Consortium". Unicode, Inc. 2016-05-16. Retrieved 2016-08-01.
^ "Formatting Messages". ICU User Guide.
^ "messageformat (js)". GitHub Pages.

External links

[1] "ICU - International Components for Unicode". site.icu-project.org.

[2] "Layout Engine - ICU User Guide". userguide.icu-project.org.

[3] "Download ICU 64 - ICU - International Components for Unicode". site.icu-project.org. Retrieved 2019-10-20.

[4] Chen, Raymond. "How can I convert between IANA time zones and Windows registry-based time zones?". The Old New Thing. Microsoft.

[5] "UTF-8 - ICU User Guide". userguide.icu-project.org. Retrieved 2018-04-03.

[6] "#13311 (change illegal-UTF-8 handling to Unicode "best practice")". bugs.icu-project.org. Retrieved 2018-04-03.

[7] "ICU 68 - ICU - International Components for Unicode". site.icu-project.org. Retrieved 2021-02-10.

[8] Laura Werner (1999). "Getting Java ready for the world: A brief history of IBM and Sun's internationalization efforts".

[9] "ICU User Guide". userguide.icu-project.org.

[10] "ICU Project Management Committee".

[11] "ICU joins the Unicode Consortium". Unicode, Inc. 2016-05-16. Retrieved 2016-08-01.

[icu-mf-12] "Formatting Messages". ICU User Guide.

[13] "messageformat (js)". GitHub Pages.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

Contents

International Components for Unicode

Origin and development

MessageFormat

See also

References

External links