mirror of
https://we.phorge.it/source/phorge.git
synced 2024-11-10 00:42:41 +01:00
Document character encoding policies and how to fix mangled UTF8 files
Summary: See D431, where I promised to document this like 2 months ago. Document that: - Everything should be UTF-8. - ASCII is recommended. - How to identify and repair files which aren't valid UTF-8. - What to do if you're using some other encoding. Test Plan: Generated and read documentation. Reviewed By: codeblock Reviewers: edward, codeblock, jungejason, tuomaspelkonen, aran CC: aran, codeblock Differential Revision: 776
This commit is contained in:
parent
3b76dd11a9
commit
6eea500bbd
1 changed files with 62 additions and 0 deletions
62
src/docs/userguide/utf8.diviner
Normal file
62
src/docs/userguide/utf8.diviner
Normal file
|
@ -0,0 +1,62 @@
|
|||
@title User Guide: UTF-8 and Character Encoding
|
||||
@group userguide
|
||||
|
||||
How Phabricator handles character encodings.
|
||||
|
||||
= Overview =
|
||||
|
||||
Phabricator stores all internal text data as UTF-8, processes all text data
|
||||
as UTF-8, outputs in UTF-8, and expects all inputs to be UTF-8. Principally,
|
||||
this means that you should write your source code in UTF-8. In most cases this
|
||||
does not require you to change anything, because ASCII text is a subset of
|
||||
UTF-8.
|
||||
|
||||
= Detecting and Repairing Files =
|
||||
|
||||
It is recommended that you write source files only in ASCII text, but
|
||||
Phabricator fully supports UTF-8 source files. However, it won't currently do
|
||||
encoding transformation, so if you have source files which are not valid UTF-8
|
||||
you may run into issues.
|
||||
|
||||
If you have a project which isn't valid UTF-8 because a few files have random
|
||||
binary nonsense in them, there is a script in libphutil which can help you
|
||||
identify and fix them:
|
||||
|
||||
project/ $ libphutil/scripts/utils/utf8.php
|
||||
|
||||
Generally, run this script on all source files with "-t" to find files with bad
|
||||
byte ranges, and then run it without "-t" on each file to identify where there
|
||||
are problems. For example:
|
||||
|
||||
project/ $ find . -type f -name '*.c' -print0 | xargs -0 -n256 ./utf8 -t
|
||||
./hello_world.c
|
||||
|
||||
If this script exits without output, you're in good shape and all the files that
|
||||
were identified are valid UTF-8. If it found some problems, you need to repair
|
||||
them. You can identify the specific problems by omitting the "-t" flag:
|
||||
|
||||
project/ $ ./utf8.php hello_world.c
|
||||
FAIL hello_world.c
|
||||
|
||||
3 main()
|
||||
4 {
|
||||
5 printf ("Hello World<0xE9><0xD6>!\n");
|
||||
6 }
|
||||
7
|
||||
|
||||
This shows the offending bytes on line 5 (in the actual console display, they'll
|
||||
be highlighted). Often a codebase will mostly be valid UTF-8 but have a few
|
||||
scattered files that have other things in them, like curly quotes which someone
|
||||
copy-pasted from Word into a comment. In these cases, you can just manually
|
||||
identify and fix the problems pretty easily.
|
||||
|
||||
If you have a prohibitively large number of UTF-8 issues in your source code,
|
||||
Phabricator doesn't include any default tools to help you process them in a
|
||||
systematic way. You could hack up ##utf8.php## as a starting point, or use other
|
||||
tools to batch-process your source files.
|
||||
|
||||
NOTE: If you have a project which uses a //different encoding// for source
|
||||
files, there is no easy way to get it working with Phabricator or Arcanist right
|
||||
now. If it's not reasonable to switch to UTF-8, tell us more about your use case
|
||||
and we can evaluate supporting it. Since tools like Git don't work well with
|
||||
other encodings, the prevailing assumption is that this is a rare situation.
|
Loading…
Reference in a new issue