OverviewThis section provides the guidelines for developing C and C++ code, based on the coding conventions used by ICU programmers in the creation of the ICU library. Details about ICU Error CodesWhen calling an ICU API function and an error code pointer (C) or reference (C++), a UErrorCode variable is often passed in. This variable is allocated by the caller and must pass the test U_SUCCESS() before the function call. Otherwise, the function will not work. Normally, an error code variable is initialized by U_ZERO_ERROR. UErrorCode is passed around and used this way, instead of using C++ exceptions for the following reasons:
The following code shows the inside of an ICU function implementation:
Note: We have decided that we do not want to test for pErrorCode==NULL. Some existing code does this, but new code should not. Note: Callers (as opposed to implementers) of ICU APIs can simplify their code by defining and using a subclass of icu::ErrorCode. ICU implementers can use the IcuTestErrorCode class in intltest code. New API FunctionsIf the API function is non-const, then it should have a UErrorCode parameter. (Not the other way around: Some const functions may need a UErrorCode as well.) Default C++ assignment operators and copy constructors should not be used (they should be declared private and not implemented). Instead, define an assign(Class &other, UErrorCode &errorCode) function. Normal constructors are fine, and should have a UErrorCode parameter. Warning CodesSome UErrorCode values do not indicate a failure but an additional informational return value. Their enum constants have the _WARNING suffix and they pass the U_SUCCESS() test. However, experience has shown that they are problematic: They can get lost easily because subsequent function calls may set their own "warning" codes or may reset a UErrorCode to U_ZERO_ERROR. The source of the problem is that the UErrorCode mechanism is designed to mimic C++/Java exceptions. It prevents ICU function execution after a failure code is set, but like exceptions it does not work well for non-failure information passing. Therefore, we recommend to use warning codes very carefully:
Future versions of ICU will not introduce new warning codes, and will provide real API replacements for all existing warning codes. Bogus ObjectsSome objects, for example UnicodeString and UnicodeSet, can become "bogus". This is used when methods that create or modify the object fail (mostly due to an out-of-memory condition) but do not take a UErrorCode parameter and can therefore not otherwise report the failure.
C and C++ Coding Conventions OverviewThe ICU group uses the following coding guidelines to create software using the ICU C++ classes and methods as well as the ICU C methods. C and C++ Type and Format Convention GuidelinesThe following C and C++ type and format conventions are used to maximize portability across platforms and to provide consistency in the code: Constants (#define, enum items, const)Use uppercase letters for constants. For example, use UBREAKITERATOR_DONE, UBIDI_DEFAULT_LTR, ULESS. For new enum types (as opposed to new values added to existing types), do not define enum types in C++ style. Instead, define C-style enums with U... type prefix and U_/UMODULE_ constants. Define such enum types outside the ICU namespace and outside any C++ class. Define them in C header files if there are appropriate ones. Variables and FunctionsUse mixed-case letters that start with a lowercase letter for variables and functions. For example, use getLength(). Types (class, struct, enum, union)Use mixed-case that start with an uppercase letter for types. For example, use class DateFormatSymbols Function StyleUse the getProperty() and setProperty() style for functions where a lowercase letter begins the first word and the second word is capitalized without a space between it and the first word. For example, UnicodeString getSymbol(ENumberFormatSymbol symbol), void setSymbol(ENumberFormatSymbol symbol, UnicodeString value) and getLength(), getSomethingAt(index/offset). Common Parameter NamesIn order to keep function parameter names consistent, the following are recommendations for names or suffixes (usual "Camel case" applies):
Order of Source/Destination ArgumentsMany ICU function signatures list source arguments before destination arguments, as is common in C++ and Java APIs. This is the preferred order for new APIs. (Example: ucol_getSortKey(const UCollator *coll, const UChar *source, int32_t sourceLength, uint8_t *result, int32_t resultLength)) Some ICU function signatures list destination arguments before source arguments, as is common in C standard library functions. This should be limited to functions that closely resemble such C standard library functions or closely related ICU functions. (Example: u_strcpy(UChar *dst, const UChar *src)) Order of Include File IncludesInclude system header files (like <stdio.h>) before ICU headers followed by application-specific ones. This assures that ICU headers can use existing definitions from system headers if both happen to define the same symbols. In ICU files, all used headers should be explicitly included, even if some of them already include others. Pointer ConversionsDo not cast pointers to integers or integers to pointers. Also, do not cast between data pointers and function pointers. This will not work on some compilers, especially with different sizes of such types. Exceptions are only possible in platform-specific code where the behavior is known. Returning a Number of ItemsTo return a number of items, use countItems(), not getItemCount(), even if there is no need to actually count using that member function. Ranges of IndexesSpecify a range of indexes by having start and limit parameters with names or suffix conventions that represent the index. A range should contain indexes from start to limit-1 such as an interval that is left-closed and right-open. Using mathematical notation, this is represented as: [start..limit[. Functions with BuffersSet the default value to -1 for functions that take a buffer (pointer) and a length argument with a default value so that the function determines the length of the input itself (for text, calling u_strlen()). Any other negative or undefined value constitutes an error. Primitive TypesPrimitive types are defined by a utypes.h file or a header file that includes other header files. The most common types are uint8_t, uint16_t, uint32_t, int8_t, int16_t, int32_t, UChar (unsigned, 16-bit), UChar32 (signed, 32-bit), and UErrorCode. File Names (.h, .c, .cpp, data files if possible, etc.)Use the 8.3 standard with all characters in lowercase for file names. Language Extensions and StandardsProprietary
features, language extensions, or library functions, must not be used
because they will not work on all C or C++ compilers. Tabs and IndentationSave files with spaces instead of tab characters (\x09). The indentation size is 4. DocumentationUse Java doc-style in-file documentation created with doxygen . Multiple StatementsPlace multiple statements in multiple lines. if() or loop heads must not be followed by their bodies on the same line. Placements of {} Curly BracesPlace curly braces {} in reasonable and consistent locations. Each of us subscribes to different philosophies. It is recommended to use the style of a file, instead of mixing different styles. It is requested, however, to not have if() and loop bodies without curly braces. if() {...} and Loop BodiesUse curly braces for if() and else as well as loop bodies, etc., even if there is only one statement. Function DeclarationsHave one line that has the return type and place all the import declarations, extern declarations, export declarations, the function name, and function signature at the beginning of the next line. For example, use the following convention:
Use Static For File ScopeUse static for variables, functions, and constants that are not exported explicitly by a header file. Some platforms are confused if non-static symbols are not explicitly declared extern. These platforms will not be able to build ICU nor link to it. Using C Callbacks From C++ Codez/OS and Windows COM wrappers around ICU need __cdecl for callback functions. The reason is that C++ can have a different function calling convention from C. These callback functions also usually need to be private. So the following code
should be changed to look like the following by adding U_CDECL_BEGIN, static, U_CALLCONV and U_CDECL_END.
Same Module and Functionality in C and in C++Determine if two headers are needed. If the same functionality is provided with both a C and a C++ API, then there can be two headers, one for each language, even if one uses the other. For example, there can be umsg.h for C and msgfmt.h for C++. Not all functionality has or needs both kinds of API. More and more functionality is available only via C APIs to avoid duplication of API, documentation, and maintenance. C APIs are perfectly usable from C++ code, especially with UnicodeString methods that alias or expose C-style string buffers. Platform DependenciesUse the platform dependencies that are within the header files that utypes.h files include. They are platform.h (which is generated by the configuration script from platform.h.in) and its more specific cousins like pwin32.h for Windows, which define basic types, and putil.h, which defines platform utilities. Short, Unnested Mutex BlocksDo
not use function calls within a mutex block for mutual-exclusion
(mutex) blocks. This can prevent deadlocks from occurring later. There
should be as little code inside a mutex block as possible to minimize
the performance degradation from blocked threads. Names of Internal FunctionsInternal functions that are not declared static (regardless of inlining) must follow the naming conventions for exported functions because many compilers and linkers do not distinguish between library exports and intra-library visible functions. Which Language for the ImplementationWrite implementation code in C++. Use objects very carefully, as always: Implicit constructors, assignments etc. can make simple-looking code surprisingly slow. For every C API, make sure that there is at least one call from a pure C file in the cintltst test suite. Background: We used to prefer C or C-style C++ for implementation code because we used to have users ask for pure C. However, there was never a large, usable subset of ICU that was usable without any C++ dependencies, and C++ can(!) make for much shorter, simpler, less error-prone and easier-to-maintain code, for example via use of "smart pointers" (unicode/localpointer.h and cmemory.h). We still try to expose most functionality via C APIs because of the difficulties of binary compatible C++ APIs exported from DLLs/shared libraries. No Compiler WarningsICU must compile without compiler warnings unless such warnings are verified to be harmless or bogus. Often times a warning on one compiler indicates a breaking error on another. Enum ValuesWhen casting an integer value to an enum type, the enum type should have a constant with this integer value, or at least it must have a constant whose value is at least as large as the integer value being cast, with the same signedness. For example, do not cast a -1 to an enum type that only has non-negative constants. Some compilers choose the internal representation very tightly for the defined enum constants, which may result in the equivalent of a uint8_t representation for an enum type with only small, non-negative constants. Casting a -1 to such a type may result in an actual value of 255. (This has happened!) When casting an enum value to an integer type, make sure that the enum value's numeric value is within range of the integer type. Do not check for this!=NULL, do not check for NULL referencesIn public APIs, assume this!=0 and assume that references are not 0. In C code, "this" is the "service object" pointer, such as set in uset_add(USet* set, UChar32 c) — don't check for set!=NULL. We do usually check all other (non-this) pointers for NULL, in those cases when NULL is not valid. (Many functions allow a NULL string or buffer pointer if the length or capacity is 0.) Memory UsageDynamically Allocated MemoryICU4C APIs are designed to allow separate heaps for its libraries vs. the application. This is achieved by providing factory methods and matching destructors for all allocated objects. The C++ API uses a common base class with overridden new/delete operators and/or forms an equivalent pair with createXYZ() factory methods and the delete operator. The C API provides pairs of open/close functions for each service. See the C++ and C guideline sections below for details. Declaring Static DataAll unmodifiable data should be declared const. This includes the pointers and the data itself. Also if you do not need a pointer to a string, declare the string as an array. This reduces the time to load the library and all its pointers. This should be done so that the same library data can be shared across processes automatically. Here is an example:
This should be changed to the following:
No Static InitializationThe most common reason to have static initialization is to declare a static const UnicodeString, for example (see utypes.h about invariant characters):
The most portable and most efficient way to declare ASCII text as a Unicode string is to do the following instead:
You can easily change a string to hexadecimal values by using simple tools like http://icu-project.org/icu4jweb/escaper.jsp . We do not use character literals for Unicode characters and strings because the execution character set of C/C++ compilers is almost never Unicode and may not be ASCII-compatible (especially on EBCDIC platforms). Depending on the API where the string is to be used, a terminating NUL (0) may or may not be required. The length of the string (number of UChars in the array) can be determined with sizeof(myStr)/U_SIZEOF_UCHAR, (subtract 1 for the NUL if present). Always remember to put in a comment at the end of the declaration what the Unicode string says. Static initialization of C++ objects must not be used in ICU libraries because of the following reasons:
ICU users can use the U_STRING_DECL and U_STRING_INIT macros for C strings. Note that on some platforms this will incur a small initialization cost (simple conversion). Also, ICU users need to make sure that they properly and consistently declare the strings with both macros. See ustring.h for details. C++ Coding GuidelinesThis section describes the C++ specific guidelines or conventions to use. Portable Subset of C++ICU uses only a portable subset of C++ for maximum portability. Also, it does not use features of C++ that are not implemented well in all compilers or are cumbersome. In particular, ICU does not use exceptions, compiler-provided Run-Time Type Information, or the Standard Template Library. We have started to use templates in ICU 4.2 (e.g., StringByteSink) and ICU 4.4 (LocalPointer and some internal uses). We try to limit templates to where they provide a lot of benefit (robust code, avoid duplication) without much or any code bloat. We continue to not use the Standard Template Library because its design causes a lot of code bloat and because it is not supported on all platforms (e.g., Android). ICU uses a limited form of multiple inheritance equivalent to Java's interface mechanism: All but one base classes must be interface/mixin classes, i.e., they must contain only pure virtual member functions. For details see the 'boilerplate' discussion below. This restriction to at most one base class with non-virtual members eliminates problems with the use and implementation of multiple inheritance in C++. ICU does not use virtual base classes. Classes and MembersClasses and their members do not need a 'U' or any other prefix. Global OperatorsGlobal operators (operators that are not class members) can be problematic for library entry point versioning, may confuse users and cannot be easily ported to Java (ICU4J). They should be avoided if possible. The issue with library entry point versioning is that on platforms that do not support namespaces, users must rename all classes and global functions via urename.h. This renaming process is not possible with operators. However, a global operator can be used in ICU4C (when necessary) if its function signature contains an ICU C++ class that is versioned. This will result in a mangled linker name that does contain the ICU version number via the versioned name of the class parameter. For example, ICU4C 2.8 added an operator + for UnicodeString, with two UnicodeString reference parameters. NamespacesBeginning with ICU version 2.0, ICU uses namespaces. The actual namespace is icu_M_N with M being the major ICU release number and N being the minor ICU release number. For convenience, the namespace icu is an alias to the current release-specific one. Class declarations, even forward declarations, must be scoped to the ICU namespace. For example:
U_NAMESPACE_USE (expands to using namespace icu_M_N; when available) is automatically done when utypes.h is included, so that all ICU classes are immediately usable. Declare Class APIsClass APIs need to be declared like either of the following: Inline-Implemented Member FunctionsClass
member functions are usually declared but not inline-implemented in the
class declaration. A long function implementation in the class declaration makes it hard to read the class declaration. It is ok to inline-implement trivial functions in the class declaration. Pretty much everyone agrees that inline implementations are ok if they fit on the same line as the function signature, even if that means bending the single-statement-per-line rule slightly: T *orphan() { T *p=ptr; ptr=NULL; return p; }Most people also agree that very short multi-line implementations are ok inline in the class declaration. Something like the following is probably the maximum: Value *getValue(int index) { if(index>=0 && index<fLimit) { return fArray[index]; } return NULL;}If the inline implementation is longer than that, then just declare the function inline and put the actual inline implementations after the
class declaration in the same file. (See unicode/unistr.h for many examples.) If it's significantly longer than that, then it's probably not a good candidate for inlining anyway. C++ class layout and 'boilerplate'There are different sets of requirements for different kinds of C++ classes. In general, all instantiable classes (i.e., all classes except for interface/mixin classes and ones with only static member functions) inherit the UMemory base class. UMemory provides new/delete operators, which allows to keep the ICU heap separate from the application heap, or to customize ICU's memory allocation consistently.
Public ICU C++ classes must inherit the UObject base class and implement the following common set of 'boilerplate' functions:
Interface/mixin classes are equivalent to Java interfaces. They are as much multiple inheritance as ICU uses — they do not decrease performance, and they do not cause problems associated with multiple base classes having data members. Interface/mixin classes contain only pure virtual member functions, and must contain an empty virtual destructor. See for example the UnicodeMatcher class. Interface/mixin classes must not inherit any non-interface/mixin class, especially not UMemory or UObject. Instead, implementation classes must inherit one of these two (or a subclass of them) in addition to the interface/mixin classes they implement. See for example the UnicodeSet class. Static classes contain only static member functions and are therefore never instantiated. They must not inherit UMemory or UObject. Instead, they must declare a private default constructor (without any implementation) to prevent instantiation. See for example the LESwaps layout engine class. C++ classes internal to ICU need not (but may) implement the boilerplate functions as mentioned above. They must inherit at least UMemory if they are instantiable. Make Sure The Compiler Uses C++The XP_PLUSPLUS ensures that the compiler uses C++ and not __cplusplus. Adoption of ObjectsSome constructors and factory functions take pointers to objects that they adopt. The newly created object contains a pointer to the adoptee and takes over ownership and lifecycle control. If an error occurs while creating the new object (and thus in the code that adopts an object), then the semantics used within ICU must be adopt-on-call (as opposed to, for example, adopt-on-success):
Example:
Memory AllocationAll ICU C++ class objects directly or indirectly inherit UMemory (see 'boilerplate' discussion above) which provides new/delete operators, which in turn call the internal functions in cmemory.c. Creating and releasing ICU C++ objects with new/delete automatically uses the ICU allocation functions.
When global new/delete operators are to be used in the application (never inside ICU!), then they should be properly scoped as e.g. ::new, and the application must ensure that matching new/delete operators are used. In some cases where such scoping is missing in non-ICU code, it may be simpler to compile ICU without its own new/delete operators. See source/common/unicode/uobject.h for details. In ICU library code, allocation of non-class data types — simple integer types as well as pointers — must use the functions in cmemory.h/.c (uprv_malloc(), uprv_free(), uprv_realloc()). Such memory objects must be released inside ICU, never by the user; this is achieved either by providing a "close" function for a service or by avoiding to pass ownership of these objects to the user (and instead filling user-provided buffers or returning constant pointers without passing ownership). The cmemory.h/.c functions can be overridden at ICU compile time for custom memory management. By default, UMemory's new/delete operators are implemented by calling these common functions. Overriding the cmemory.h/.c functions changes the memory management for both C and C++. C++ objects that were either allocated with new or returned from a createXYZ() factory method must be deleted by the user/owner. Memory Allocation FailuresAll memory allocations and object creations should be checked for success. In the event of a failure (a NULL returned), a U_MEMORY_ALLOCATION_ERROR status should be returned by the ICU function in question. If the allocation failure leaves the ICU service in an invalid state, such that subsequent ICU operations could also fail, the situation should be flagged so that the subsequent operations will fail cleanly. Under no circumstances should a memory allocation failure result in a crash in ICU code, or cause incorrect results rather than a clean error return from an ICU function. Some functions, such as the C++ assignment operator, are unable to return an ICU error status to their caller. In the event of an allocation failure, these functions should mark the object as being in an invalid or bogus state so that subsequent attempts to use the object will fail. Deletion of an invalid object should always succeed. Global Inline FunctionsGlobal functions (non-class member functions) that are declared inline must be made static inline. Some compilers will export symbols that are declared inline but not static. No Declarations in the for() Loop HeadIterations through for() loops must not use declarations in the first part of the loop. There have been two revisions for the scoping of these declarations and some compilers do not comply to the latest scoping. Declarations of loop variables should be outside these loops. Common or I18NDecide whether or not the module is part of the common or the i18n API collection. Use the appropriate macros. For example, use U_COMMON_IMPLEMENTATION, U_I18N_IMPLEMENTATION, U_COMMON_API, U_I18N_API. See utypes.h. Constructor FailureIf there is a reasonable chance that a constructor fails (For example, if the constructor relies on loading data), then either it must use and set a UErrorCode or the class needs to support an isBogus()/setToBogus() mechanism like UnicodeString and UnicodeSet, and the constructor needs to set the object to bogus if it fails. C Coding GuidelinesThis section describes the C-specific guidelines or conventions to use. Declare and define C APIs with both U_CAPI and U_EXPORT2All C APIs need to be both declared and defined using the U_CAPI and U_EXPORT2 qualifiers.
Subdivide the NamespaceUse prefixes to avoid name collisions. Some of those prefixes contain a 3- (or sometimes 4-) letter module identifier. Very general names like u_charDirection() do not have a module identifier in their prefix.
Function DeclarationsFunction declarations need to be in the form CAPI return-type U_EXPORT2 to satisfy all the compilers' requirements. Functions for Constructors and DestructorsFunctions that roughly compare to constructors and destructors are called umod_open() and umod_close(). See the following example:
Each successful call to a umod_open() returns a pointer to an object that must be released by the user/owner by calling the matching umod_close(). Inline Implementation FunctionsSome, but not all, C compilers allow ICU users to declare functions inline (which is a C++ language feature) with various keywords. This has advantages for implementations because inline functions are much safer and more easily debugged than macros. ICU has a portable U_INLINE declaration macro that can be used for inline functions. On C compilers that do not support any form of inline declaration, U_INLINE will result in a static declaration. U_INLINE must only be used in implementation code, not in public C APIs. All functions that are declared inline, or are small enough that an optimizing compiler might inline them even without the inline declaration, should be defined (implemented) – not just declared – before they are first used. This is to enable as much inlining as possible, and also to prevent compiler warnings for functions that are declared inline but whose definition is not available when they are called. C Equivalents for Classes with Multiple ConstructorsIn cases like BreakIterator and NumberFormat, instead of having several different 'open' APIs for each kind of instances, use an enum selector. Source File NamesSource file names for C begin with a 'u'. Memory APIs Inside ICUFor memory allocation in C implementation files for ICU, use the functions and macros in cmemory.h. When allocated memory is returned from a C API function, there must be a corresponding function (like a ucnv_close()) that deallocates that memory. All memory allocations in ICU should be checked for success. In the event of a failure (a NULL returned from uprv_malloc()), a U_MEMORY_ALLOCATION_ERROR status should be returned by the ICU function in question. If the allocation failure leaves the ICU service in an invalid state, such that subsequent ICU operations could also fail, the situation should be flagged so that the subsequent operations will fail cleanly. Under no circumstances should a memory allocation failure result in a crash in ICU code, or cause incorrect results rather than a clean error return from an ICU function. // CommentsDo not use C++ style // comments in C files and in headers that will be included in C files. Some of the supported platforms are not compatible with C++ style comments in C files. Source Code Strings with Unicode Characterschar * strings in ICUThe C/C++ languages do not provide a portable way to specify Unicode code point or string literals other than with arrays of numeric constants. For convenience, ICU4C tends to use char * strings in places where only "invariant characters" (a portable subset of the 7-bit ASCII repertoire) are used. This allows locale IDs, charset names, resource bundle item keys and similar items to be easily specified as string literals in the source code. The same types of strings are also stored as "invariant character" char * strings in the ICU data files. ICU has hard coded mapping tables in source/common/putil.c to convert invariant characters to and from Unicode without using a full ICU converter. These tables must match the encoding of string literals in the ICU code as well as in the ICU data files.
Some usage of char * strings in ICU assumes the system charset instead of invariant characters. Such strings are only handled with the default converter (See the following section). The system charset is usually a superset of the invariant characters. The following are the ASCII and EBCDIC byte values for all of the invariant characters (see also unicode/utypes.h):
Rules Strings with Unicode CharactersIn order to include characters in source code strings that are not part of the invariant subset of ASCII, one has to use character escapes. In addition, rules strings for collation, break iteration, etc. need to follow service-specific syntax, which means that spaces and ASCII punctuation must be quoted using the following rules:
Java Coding Conventions OverviewThe ICU group uses the following coding guidelines to create software using the ICU Java classes and methods. Code styleThe standard order for modifier keywords on APIs is:
All if/else/for/while/do loops use braces, even if the controlled statement is a single line. This is for clarity and to avoid mistakes due to bad nesting of control statements, especially during maintenance. Tabs should not be present in source files. Indentation is 4 spaces. Make sure the code is formatted cleanly with regular indentation. Follow Java style code conventions, e.g., don't put multiple statements on a single line, use mixed-case identifiers for classes and methods and upper case for constants, and so on. All public and protected API in the 'API packages' (lang, math, text, util) should be tagged with either @draft, @stable, or @internal. Javadoc should be complete and correct when code is checked in, to avoid playing catch-up later during the throes of the release. Please javadoc all methods, not just external APIs, since this helps with maintenance. Code organizationAvoid putting more than one top-level class in a single file. Either use separate files or nested classes. Do not mix test, tool, and runtime code in the same file. If you need some access to private or package methods or data, provide public accessors for them and mark them @internal. Test code should be under dev/test, and tools (e.g., code that generates data, source code, or computes constants) under dev/tool. Occasionally for very simple cases you can leave a few lines of tool code in the main source and comment it out, but maintenance is easier if you just comment the location of the tools in the source and put the actual code elsewhere. Avoid creating new interfaces unless you know you need to mix the interface into two or more classes that have separate inheritance. Interfaces are impossible to modify later in a backwards-compatible way. Abstract classes, on the other hand, can add new methods with default behavior. Use interfaces only if it is required by the arcitecture, not just for expediency. Current releases of ICU4J are restricted to use JDK 1.4 APIs and language features. This unfortunately means no static imports, and no enums. But since we hope eventually to move forward to 1.5, we should avoid the fancy workarounds for these language deficiencies that have been used in the past. So don't avoid using interfaces as a convenience to import static constants into several files. Also, don't use the (rather clumsy) enum idiom based on classes with a fixed number of constant instances, as it's generally not worth the effort. Using static int constants is acceptable. ICU PackagesPublic APIs should be placed in com.ibm.icu.text, com.ibm.icu.util, and com.ibm.icu.lang. For historical reasons and for easier migration from JDK classes, there are also APIs in com.ibm.icu.math but new APIs should not be added there. APIs used only during development, testing, or tools work should be placed in com.ibm.icu.dev. A class or method which is used by public APIs (listed above) but which is not itself public can be placed in different places:
Error Handling and ExceptionsErrors should be indicated by throwing exceptions, not by returning “bogus” values. If an input parameter is in error, then a new IllegalArgumentException("description") should be thrown. Exceptions should be caught only when something must be done, for example special cleanup or rethrowing a different exception. If the error “should never occur”, then throw a new RuntimeException("description") (rare). In this case, a comment should be added with a justification. Use exception chaining: When an exception is caught and a new one created and thrown (usually with additional information), the original exception should be chained to the new one. A catch expression should not catch Throwable. Catch expressions should specify the most specific subclass of Throwable that applies. If there are two concrete subclasses, both should be specified in separate catch statements. Binary Data FilesICU4J uses the same binary data files as ICU4C, in the big-endian/ASCII form. The ICUBinary class should be used to read them. Some data sources (for example, compressed Jar files) do not allow the use of several InputStream and related APIs:
Compiler WarningsThere should be no compiler warnings when building ICU4J. It is recommended to develop using Eclipse, and to fix any problems that are shown in the Eclipse Problems panel (below the main window). MiscellaneousObjects should not be cast to a class in the sun.* packages because this would cause a SecurityException when run under a SecurityManager. The exception needs to be caught and default action taken, instead of propagating the exception. Adding .c, .cpp and .h files to ICUIn order to add compilable files to ICU, add them to the source code control system in the appropriate folder and also to the build environment. To add these files, use the following steps:
Test Suite NotesThe cintltst Test Suite contains all the tests for the International Components for Unicode C API. These tests may be automatically run by typing "cintltst" or "cintltst -all" at the command line. This depends on the C Test Services: cintltst or cintltst -all. C Test ServicesThe purpose of the test services is to enable the writing of tests entirely in C. The services have been designed to make creating tests or converting old ones as simple as possible with a minimum of services overhead. A sample test file, "demo.c", is included at the end of this document. For more information regarding C test services, please see the \intlwork\source\tools\ctestfwdirectory. Writing Test FunctionsThe following shows the possible format of test functions:
Output from the test is accomplished with three printf-like functions:
To use the tests, link them into a hierarchical structure. The root of the structure will be allocated by default.
Provide addTest() with the function pointer for the function that performs the test as well as the absolute 'path' to the test. Paths may be up to 127 chars in length and may be used to group tests. The calls to addTest must be placed in a function or a hierarchy of functions (perhaps mirroring the paths). See the existing cintltst for more details. Running the TestsA subtree may be extracted from another tree of tests for the programmatic running of subtests.
And a tree of tests may be run simply by:
Similarly, showTests() lists out the tests. However, it is easier to use the command prompt with the Usage specified below. GlobalsThe command line parser resets the error count and prints a summary of the failed tests. But if runTest is called directly, for instance, it needs to be managed manually. ERROR_COUNT contains the number of times log_err was called. runTests resets the count to zero before running the tests. VERBOSITY must be 1 to display log_verbose() data. Otherwise, VERBOSITY must be set to 0 (default). BuildingTo compile this test suite using Microsoft Visual C++ (MSVC), follow the instructions in icu/source/readme.html#HowToInstall for building the allC workspace. This builds the libraries as well as the cintltst executable. ExecutingTo run the test suite from the command line, change the directories to icu/source/test/cintltst/Debug for the debug build (or icu/source/test/cintltst/Release for the release build) and then type cintltst. UsageType cintltst -h to view its command line parameters.
IntlTest Test Suite DocumentationThe IntlTest suite contains all of the tests for the C++ API of International Components for Unicode. These tests may be automatically run by typing intltest at the command line. Since the verbose option prints out a considerable amount of information, it is recommended that the output be redirected to a file: intltest -v > testOutput. BuildingTo compile this test suite using MSVC, follow the instructions for building the alCPP (All C++ interfaces) workspace. This builds the libraries as well as the intltest executable. ExecutingTo run the test suite from the command line, change the directories to icu/source/test/intltest/Debug, then type: intltest -v >testOutput. For the release build, the executable will reside in the icu/source/test/intltest/Release directory. UsageType just intltest -h to see the usage:
Binary Data FormatsICU services rely heavily on data to perform their functions. Such data is available in various more or less structured text file formats, which make it easy to update and maintain. For high runtime performance, most data items are pre-built into binary formats, i.e., they are parsed and processed once and then stored in a format that is used directly during processing. Most of the data items are pre-built into binary files that are then installed on a user's machine. Some data can also be built at runtime but is not persistent. In the latter case, a master object should be built once and then cloned to avoid the multiple parsing, processing, and building of the same data. Binary data formats for ICU must be portable across platforms that share the same endianness and the same charset family (ASCII vs. EBCDIC). It would be possible to handle data from other platform types, but that would require load-time or even runtime conversion. Data TypesBinary data items are memory-mapped, i.e., they are used as readonly, constant data. Their structures must be portable according to the criteria above and should be efficiently usable at runtime without building additional runtime data structures. Most native C/C++ data types cannot be used as part of binary data formats because their sizes are not fixed across compilers. For example, an int could be 16/32/64 or even any other number of bits wide. Only types with absolutely known widths and semantics must be used. Use for example:
Do not use for example:
Each field in a binary/mappable data format must be aligned naturally. This means that a field with a primitive type of size n bytes must be at an n-aligned offset from the start of the data block. UChar must be 2-aligned, int32_t must be 4-aligned, etc. It is possible to use struct types, but one must make sure that each field is naturally aligned, without possible implicit field padding by the compiler — assuming a reasonable compiler.
Within the binary data, a struct type field must be aligned according to its widest member field. The struct OKExample must be 4-aligned because it contains an int32_t field. Another potential problem with struct types, especially in C++, is that some compilers provide RTTI for all classes and structs, which inserts a _vtable pointer before the first declared field. When using struct types with binary/mappable data in C++, assert in some place in the code that offsetof the first field is 0. For an example see the genpname tool. VersioningICU data files have a UDataHeader structure preceding the actual data. Among other fields, it contains a formatVersion field with four parts (one uint8_t each). It is best to use only the first (major) or first and second (major/minor) fields in the runtime code to determine binary compatibility, i.e., reject a data item only if its formatVersion contains an unrecognized major (or major/minor) version number. The following parts of the version should be used to indicate variations in the format that are backward compatible, or carry other information. For example, the current uprops.icu file's formatVersion (see the genprops tool and uchar.c/uprops.c) is set to indicate backward-incompatible changes with the major version number, backward-compatible additions with the minor version number, and shift width constants for the UTrie data structure in the third and fourth version numbers (these could change independently of the uprops.icu format). |