Internationalization and Localization

Configuring i18n mode, using supported character sets, handling non-ASCII data

SUMMARY

This article explains how to configure a Perforce Server to run in internationalization mode and how to configure Perforce clients to work with different character sets. This articles also discusses possible problems you might encounter when handling Unicode or non-ASCII data in Perforce, as well as remedies to these problems.


DETAILS

In Perforce there are two ways to work with multiple character sets depending on your requirements:
  1. If your filenames or Perforce metadata contain non-ASCII characters, then your Perforce administrator might need to consider switching your Perforce Server into unicode mode as described below. When running in unicode mode, all non-file data (identifiers, descriptions, and so on), as well as the content of all files of type “unicode”, are translated between the character set specified by the P4CHARSET variable on the client and UTF8 in the server.

    Before switching to unicode mode, verify that the character set you want to work with is supported. If the goal is to manage files that contain unicode characters, then you may consider standardizing on either UTF8 or UTF16 encoding. Note, starting with the 2007.2 Release, Perforce adds a new UTF16 filetype (see the Release Notes) to specifically support UTF16 files in both, non-unicode and unicode modes. To benefit from UTF16 support, all of your Perforce users need to be running 2007.2 versions of Perforce client programs.

    If you need to work on unicode files that contain characters saved in the users directory, syncing/submitting such files to/from a single client machine can become a cumbersome process, as extra steps (for instance, switching between different P4CHARSETS, installing an additional Code Pages and so on) are required to complete the task.

    Use the next option to store your unicode files if the above option does not meet your requirements.

  2. If the above option is not appropriate for your situation, then the unicode files can be added as binary files. This does make diffing such files a bit more difficult, because by default Perforce does not support diffing true binary files. However, if your binary files are true UTF8 files, then the default diff/merge tool in P4V correctly diffs them. In addition, P4Win/P4V users can also specify a third-party diff/merge tool for such files. Likewise, command line users can force the diff using the "-t" flag.

Switching the Perforce server into unicode mode

Before you use Perforce in a unicode environment, you must first instruct your Perforce Server to run in unicode mode. To set up your server to run in this mode, run:

p4d -xi

This command verifies that all existing metadata is valid UTF8 and sets a protected unicode counter, to make
sure that future invocations of p4d operate in unicode mode. Once set on the server, unicode mode cannot be deactivated (that is, you cannot return to non-unicode mode). After p4d -xi switches your server into the unicode mode, you may then invoke p4d with your usual flags.

Important:

Occasionally, when trying to switch the server to unicode mode with the p4d -xi command,
the server responds with:

Table db.user has 14 rows with invalid UTF8.
Table db.domain has 1 rows with invalid UTF8.
...

Perforce server error:
Database has 14 tables with non-UTF8 text and can't be switched to Unicode mode.

To fix this problem, do the following:

  1. Take a checkpoint
  2. Open the new checkpoint in your text editor and save it in UTF8.
  3. Remove all db.* files
  4. Restore from the checkpoint
  5. Verify
  6. Try p4d -xi again
Or, you might also try:
  1. Take a checkpoint
  2. Try to find all of the high-ASCII characters in each table (rows in the checkpoint)
  3. Fix those
  4. Save the modified checkpoint
  5. Restore from the checkpoint
  6. Try p4d -xi again

To convert to proper UTF8, you can also use any of the character set conversion tools that are available, such as "iconv" for Unix and others that are available. Note, "iconv" might miss some german umlaut characters; use it diligently and run p4 verify immediately after you use this tool.

User Notes

To use Perforce in an unicode environment, you must also set the P4CHARSET environment variable on your client machines. If it is not set, then users of P4V or P4SCC.DLL are asked to choose their encoding when making a first connection to a Unicode enabled server, and other users end up with a "Unicode server permits only unicode enabled clients" message. Be aware that mixing different encodings and, consequently, P4CHARSET settings on the same computer is likely to cause translation problems.

The following table lists a few of the most used (in the USA) P4CHARSET values:

Language Platform Windows
Code page
Unix
Locale
P4CHARSET
setting
English/High-ASCII Windows 1252 n/a winansi
English/High-ASCII UNIX/Linux n/a varies iso8859-1/utf8
English/High-ASCII MAC OS X n/a n/a utf8
All/untranslated All n/a n/a utf8*
All All n/a n/a utf16**


For the complete list of supported P4CHARSET values, run p4 help charset or visit: http://www.perforce.com/perforce/doc.current/user/i18nnotes.txt

* utf8 is untranslated, but the file content is validated.

** utf16 requires that P4COMMANDCHARSET be set to a different (non-utf16) charset
for the p4 command line client to function, for example:

p4 -C utf16 -Q utf8 sync some_files
where "-C" is a command line flag for P4CHARSET and "-Q" is for P4COMMANDCHARSET.

Note, that both, P4V and P4WIN have a field in the Preferences dialog to reset P4CHARSET.

Setting P4CHARSET on Windows:

  1. Log in to Windows and open an MS-DOS command prompt.
  2. Confirm that you have a True Type (TT) or Open Type font.
  3. Display your active code page on Windows machines by issuing the chcp command. Windows displays a message like the following:
    Active code page: 1252
  4. Select the character set based on the active code page as follows:

    Code page Set P4CHARSET to
    1252 winansi
    932 shiftjis


    To set P4CHARSET for all users on this workstation, you need Administrator privileges. Issue the following command:

    p4 set -s P4CHARSET=[character_set]
    

    If you do not have Administrator privileges, you can use:

    p4 set P4CHARSET=[character_set]
    

    to set P4CHARSET for the user currently logged in. Other users on the same machine have to set P4CHARSET independently.

Setting P4CHARSET on UNIX:

Set P4CHARSET to the proper value from a command shell or in a startup script such as .kshrc, .cshrc, or .profile. You can determine the proper value for P4CHARSET by examining the current setting of the LANG or LOCALE environment variable.

Sample $LANG value Set P4CHARSET to
en_US.UTF-8 utf8
ja_JP.EUC eucjp
ja_JP.PCK shiftjis

Setting P4CHARSET on MAC:

Set P4CHARSET to the proper value from either a command shell, for example:

$bash export P4CHARSET=utf8

or the "environment.plist" file which resides in ~/.MacOSX directory.

If P4CHARSET is not set in an environment, P4V users are prompted to select a setting from the drop down list when establishing their first connection with the Unicode enabled server.

Possible problems encountered running in unicode mode

“Cannot translate” error message

This message is displayed if your client machine is configured with a character set that does not include characters being sent to it by the Perforce Server. Your client machine cannot display unmapped characters.

For example, if your client machine is configured to use the shift-JIS character set and your depot contains files named using characters from the Japanese EUC character set that do not have mappings in shift-JIS, you see the "Cannot translate..." error message when you execute a p4 files or p4 changes command that lists those files.

Length limit for Unicode Perforce identifiers

The Perforce Server has internal limits on the lengths of strings used to index job descriptions, specify filenames, control view mappings, and identify client names, label names, and other objects.

The most common limit is 1024 bytes. Because some characters in Unicode can expand to more than one byte, it is possible for certain Unicode entries to exceed Perforce internal limits.

Because no basic Unicode character expands to more than three bytes, dividing the Perforce internal limit by three ensures that no Unicode sequence exceeds the limit.

To ensure that no Unicode sequence exceeds the Perforce limit, do not create client names or view patterns that exceed 341 Unicode characters.

Under normal usage conditions, this length limit is not expected to pose a significant limitation.

Possible problems encountered using unicode filetype with a non-unicode server

With a server not running in internationalized mode, the Perforce "unicode" filetype behaves much differently.
The client and server both assume that a file is valid UTF8 and store it as such. The server does not attempt to translate or verify the content of the file in any way. It is imperative that the files be saved using an editor that can save as UTF8 prior to submitting such files to Perforce. Outside of this requirement, users can access the Perforce server normally. There is no need to set P4CHARSET on the client.

Newlines are not correctly saved

The file was checked in UTF16 instead of UTF8 by a user. Rollback to an old revision or resave the file as UTF8.