/* * Copyright (c) 2009, 2020, Oracle and/or its affiliates. All rights reserved. * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. * * This code is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License version 2 only, as * published by the Free Software Foundation. Oracle designates this * particular file as subject to the "Classpath" exception as provided * by Oracle in the LICENSE file that accompanied this code. * * This code is distributed in the hope that it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License * version 2 for more details (a copy is included in the LICENSE file that * accompanied this code). * * You should have received a copy of the GNU General Public License version * 2 along with this work; if not, write to the Free Software Foundation, * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. * * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA * or visit www.oracle.com if you need additional information or have any * questions. */ /** ******************************************************************************* * Copyright (C) 1996-2014, International Business Machines Corporation and * others. All Rights Reserved. ******************************************************************************* */ package jdk_internal.icu.lang; import jdk_internal.icu.impl.UBiDiProps; import jdk_internal.icu.impl.UCharacterProperty; import jdk_internal.icu.text.Normalizer2; import jdk_internal.icu.text.UTF16; import jdk_internal.icu.util.VersionInfo; /** *
* The UCharacter class provides extensions to the * * java.lang.Character class. These extensions provide support for more * Unicode properties and together with the UTF16 * class, provide support for supplementary characters (those with code points * above U+FFFF). Each ICU release supports the latest version of Unicode * available at that time. * *
* Code points are represented in these API using ints. While it would be more * convenient in Java to have a separate primitive datatype for them, ints * suffice in the meantime. * *
* To use this class please add the jar file name icu4j.jar to the class path,
* since it contains data files which supply the information used by this
* file.
* E.g. In Windows
* set CLASSPATH=%CLASSPATH%;$JAR_FILE_PATH/ucharacter.jar
.
* Otherwise, another method would be to copy the files uprops.dat and
* unames.icu from the icu4j source subdirectory
* $ICU4J_SRC/src/com.ibm.icu.impl.data to your class directory
* $ICU4J_CLASS/com.ibm.icu.impl.data.
*
*
* Aside from the additions for UTF-16 support, and the updated Unicode * properties, the main differences between UCharacter and Character are: *
* Further detail on differences can be determined using the program * com.ibm.icu.dev.test.lang.UCharacterCompare *
** In addition to Java compatibility functions, which calculate derived * properties, this API provides low-level access to the Unicode Character * Database. *
** Unicode assigns each code point (not just assigned character) values for many * properties. Most of them are simple boolean flags, or constants from a small * enumerated list. For some properties, values are strings or other relatively * more complex types. *
** For more information see "About the * Unicode Character Database" (http://www.unicode.org/ucd/) and the * ICU User Guide * chapter on Properties * (http://www.icu-project.org/userguide/properties.html). *
** There are also functions that provide easy migration from C/POSIX functions * like isblank(). Their use is generally discouraged because the C/POSIX * standards do not define their semantics beyond the ASCII range, which means * that different implementations exhibit very different behavior. Instead, * Unicode properties should be used directly. *
** There are also only a few, broad C/POSIX character classes, and they tend to * be used for conflicting purposes. For example, the "isalpha()" class is * sometimes used to determine word boundaries, while a more sophisticated * approach would at least distinguish initial letters from continuation * characters (the latter including combining marks). (In ICU, BreakIterator is * the most sophisticated API for word boundaries.) Another example: There is no * "istitle()" class for titlecase characters. *
** ICU 3.4 and later provides API access for all twelve C/POSIX character * classes. ICU implements them according to the Standard Recommendations in * Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions * (http://www.unicode.org/reports/tr18/#Compatibility_Properties). *
** API access for C/POSIX character classes is as follows: * *
{@code * - alpha: isUAlphabetic(c) or hasBinaryProperty(c, UProperty.ALPHABETIC) * - lower: isULowercase(c) or hasBinaryProperty(c, UProperty.LOWERCASE) * - upper: isUUppercase(c) or hasBinaryProperty(c, UProperty.UPPERCASE) * - punct: ((1<* * * The C/POSIX character classes are also available in UnicodeSet patterns, * using patterns like [:graph:] or \p{graph}. *
* * There are several ICU (and Java) whitespace functions. Comparison: **
* *- isUWhiteSpace=UCHAR_WHITE_SPACE: Unicode White_Space property; most of * general categories "Z" (separators) + most whitespace ISO controls (including * no-break spaces, but excluding IS1..IS4 and ZWSP) *
- isWhitespace: Java isWhitespace; Z + whitespace ISO controls but * excluding no-break spaces *
- isSpaceChar: just Z (including no-break spaces) *
* This class is not subclassable. *
* * @author Syn Wee Quek * @stable ICU 2.1 * @see com.ibm.icu.lang.UCharacterEnums */ public final class UCharacter { /** * Joining Group constants. * * @see UProperty#JOINING_GROUP * @stable ICU 2.4 */ public static interface JoiningGroup { /** * @stable ICU 2.4 */ public static final int NO_JOINING_GROUP = 0; } /** * Numeric Type constants. * * @see UProperty#NUMERIC_TYPE * @stable ICU 2.4 */ public static interface NumericType { /** * @stable ICU 2.4 */ public static final int NONE = 0; /** * @stable ICU 2.4 */ public static final int DECIMAL = 1; /** * @stable ICU 2.4 */ public static final int DIGIT = 2; /** * @stable ICU 2.4 */ public static final int NUMERIC = 3; /** * @stable ICU 2.4 */ public static final int COUNT = 4; } /** * Hangul Syllable Type constants. * * @see UProperty#HANGUL_SYLLABLE_TYPE * @stable ICU 2.6 */ public static interface HangulSyllableType { /** * @stable ICU 2.6 */ public static final int NOT_APPLICABLE = 0; /* [NA] */ /* See note !! */ /** * @stable ICU 2.6 */ public static final int LEADING_JAMO = 1; /* [L] */ /** * @stable ICU 2.6 */ public static final int VOWEL_JAMO = 2; /* [V] */ /** * @stable ICU 2.6 */ public static final int TRAILING_JAMO = 3; /* [T] */ /** * @stable ICU 2.6 */ public static final int LV_SYLLABLE = 4; /* [LV] */ /** * @stable ICU 2.6 */ public static final int LVT_SYLLABLE = 5; /* [LVT] */ /** * @stable ICU 2.6 */ public static final int COUNT = 6; } // public data members ----------------------------------------------- /** * The lowest Unicode code point value. * * @stable ICU 2.1 */ public static final int MIN_VALUE = UTF16.CODEPOINT_MIN_VALUE; /** * The highest Unicode code point value (scalar value) according to the Unicode * Standard. This is a 21-bit value (21 bits, rounded up).
* Up-to-date Unicode implementation of java.lang.Character.MAX_VALUE * * @stable ICU 2.1 */ public static final int MAX_VALUE = UTF16.CODEPOINT_MAX_VALUE; // public methods ---------------------------------------------------- /** * Returns the numeric value of a decimal digit code point.
* This method observes the semantics of *java.lang.Character.digit()
. Note that this will return positive * values for code points for which isDigit returns false, just like * java.lang.Character.
* Semantic Change: In release 1.3.1 and prior, this did not treat the * European letters as having a digit value, and also treated numeric letters * and other numbers as digits. This has been changed to conform to the java * semantics.
* A code point is a valid digit if and only if: **
* * @param ch the code point to query * @param radix the radix * @return the numeric value represented by the code point in the specified * radix, or -1 if the code point is not a decimal digit or if its value * is too large for the radix * @stable ICU 2.1 */ public static int digit(int ch, int radix) { if (2 <= radix && radix <= 36) { int value = digit(ch); if (value < 0) { // ch is not a decimal digit, try latin letters value = UCharacterProperty.getEuropeanDigit(ch); } return (value < radix) ? value : -1; } else { return -1; // invalid radix } } /** * Returns the numeric value of a decimal digit code point.- ch is a decimal digit or one of the european letters, and *
- the value of ch is less than the specified radix. *
* This is a convenience overload ofdigit(int, int)
that provides * a decimal radix.
* Semantic Change: In release 1.3.1 and prior, this treated numeric * letters and other numbers as digits. This has been changed to conform to the * java semantics. * * @param ch the code point to query * @return the numeric value represented by the code point, or -1 if the code * point is not a decimal digit or if its value is too large for a * decimal radix * @stable ICU 2.1 */ public static int digit(int ch) { return UCharacterProperty.INSTANCE.digit(ch); } /** * Returns a value indicating a code point's Unicode category. Up-to-date * Unicode implementation of java.lang.Character.getType() except for the above * mentioned code points that had their category changed.
* Return results are constants from the interface * UCharacterCategory
* NOTE: the UCharacterCategory values are not compatible with * those returned by java.lang.Character.getType. UCharacterCategory values * match the ones used in ICU4C, while java.lang.Character type values, though * similar, skip the value 17. * * * @param ch code point whose type is to be determined * @return category which is a value of UCharacterCategory * @stable ICU 2.1 */ public static int getType(int ch) { return UCharacterProperty.INSTANCE.getType(ch); } /** * Returns the Bidirection property of a code point. For example, 0x0041 (letter * A) has the LEFT_TO_RIGHT directional property.
* Result returned belongs to the interface * UCharacterDirection * * @param ch the code point to be determined its direction * @return direction constant from UCharacterDirection. * @stable ICU 2.1 */ public static int getDirection(int ch) { return UBiDiProps.INSTANCE.getClass(ch); } /** * Maps the specified code point to a "mirror-image" code point. For code points * with the "mirrored" property, implementations sometimes need a "poor man's" * mapping to another code point such that the default glyph may serve as the * mirror-image of the default glyph of the specified code point.
* This is useful for text conversion to and from codepages with visual order, * and for displays without glyph selection capabilities. * * @param ch code point whose mirror is to be retrieved * @return another code point that may serve as a mirror-image substitute, or ch * itself if there is no such mapping or ch does not have the "mirrored" * property * @stable ICU 2.1 */ public static int getMirror(int ch) { return UBiDiProps.INSTANCE.getMirror(ch); } /** * Maps the specified character to its paired bracket character. For * Bidi_Paired_Bracket_Type!=None, this is the same as getMirror(int). Otherwise * c itself is returned. See http://www.unicode.org/reports/tr9/ * * @param c the code point to be mapped * @return the paired bracket code point, or c itself if there is no such * mapping (Bidi_Paired_Bracket_Type=None) * * @see UProperty#BIDI_PAIRED_BRACKET * @see UProperty#BIDI_PAIRED_BRACKET_TYPE * @see #getMirror(int) * @stable ICU 52 */ public static int getBidiPairedBracket(int c) { return UBiDiProps.INSTANCE.getPairedBracket(c); } /** * Returns the combining class of the argument codepoint * * @param ch code point whose combining is to be retrieved * @return the combining class of the codepoint * @stable ICU 2.1 */ public static int getCombiningClass(int ch) { return Normalizer2.getNFDInstance().getCombiningClass(ch); } /** * Returns the version of Unicode data used. * * @return the unicode version number used * @stable ICU 2.1 */ public static VersionInfo getUnicodeVersion() { return UCharacterProperty.INSTANCE.m_unicodeVersion_; } /** * Returns a code point corresponding to the two UTF16 characters. * * @param lead the lead char * @param trail the trail char * @return code point if surrogate characters are valid. * @exception IllegalArgumentException thrown when argument characters do not * form a valid codepoint * @stable ICU 2.1 */ public static int getCodePoint(char lead, char trail) { if (UTF16.isLeadSurrogate(lead) && UTF16.isTrailSurrogate(trail)) { return UCharacterProperty.getRawSupplementary(lead, trail); } throw new IllegalArgumentException("Illegal surrogate characters"); } /** * Returns the "age" of the code point. * ** The "age" is the Unicode version when the code point was first designated (as * a non-character or for Private Use) or assigned a character. *
* This can be useful to avoid emitting code points to receiving processes that * do not accept newer characters. *
** The data is from the UCD file DerivedAge.txt. *
* * @param ch The code point. * @return the Unicode version number * @stable ICU 2.6 */ public static VersionInfo getAge(int ch) { if (ch < MIN_VALUE || ch > MAX_VALUE) { throw new IllegalArgumentException("Codepoint out of bounds"); } return UCharacterProperty.INSTANCE.getAge(ch); } /** * Returns the property value for an Unicode property type of a code point. Also * returns binary and mask property values. * ** Unicode, especially in version 3.2, defines many more properties than the * original set in UnicodeData.txt. *
** The properties APIs are intended to reflect Unicode properties as defined in * the Unicode Character Database (UCD) and Unicode Technical Reports (UTR). For * details about the properties see http://www.unicode.org/. *
** For names of Unicode properties see the UCD file PropertyAliases.txt. *
* ** Sample usage: * int ea = UCharacter.getIntPropertyValue(c, UProperty.EAST_ASIAN_WIDTH); * int ideo = UCharacter.getIntPropertyValue(c, UProperty.IDEOGRAPHIC); * boolean b = (ideo == 1) ? true : false; ** * @param ch code point to test. * @param type UProperty selector constant, identifies which binary property to * check. Must be UProperty.BINARY_START <= type < * UProperty.BINARY_LIMIT or UProperty.INT_START <= type < * UProperty.INT_LIMIT or UProperty.MASK_START <= type < * UProperty.MASK_LIMIT. * @return numeric value that is directly the property value or, for enumerated * properties, corresponds to the numeric value of the enumerated * constant of the respective property value enumeration type (cast to * enum type if necessary). Returns 0 or 1 (for false / true) for binary * Unicode properties. Returns a bit-mask for mask properties. Returns 0 * if 'type' is out of bounds or if the Unicode version does not have * data for the property at all, or not for this code point. * @see UProperty * @see #hasBinaryProperty * @see #getIntPropertyMinValue * @see #getIntPropertyMaxValue * @see #getUnicodeVersion * @stable ICU 2.4 */ // for BiDiBase.java public static int getIntPropertyValue(int ch, int type) { return UCharacterProperty.INSTANCE.getIntPropertyValue(ch, type); } // private constructor ----------------------------------------------- /** * Private constructor to prevent instantiation */ private UCharacter() { } /* * Copied from UCharacterEnums.java */ /** * Character type Mn * * @stable ICU 2.1 */ public static final byte NON_SPACING_MARK = 6; /** * Character type Me * * @stable ICU 2.1 */ public static final byte ENCLOSING_MARK = 7; /** * Character type Mc * * @stable ICU 2.1 */ public static final byte COMBINING_SPACING_MARK = 8; /** * Character type count * * @stable ICU 2.1 */ public static final byte CHAR_CATEGORY_COUNT = 30; /** * Directional type R * * @stable ICU 2.1 */ public static final int RIGHT_TO_LEFT = 1; /** * Directional type AL * * @stable ICU 2.1 */ public static final int RIGHT_TO_LEFT_ARABIC = 13; }