mirror of
https://we.phorge.it/source/phorge.git
synced 2024-12-21 21:10:56 +01:00
Use unicode mode when tokenizing strings like user realnames
Summary: Fixes T9732. We currently tokenize strings (like user realnames) in the default non-unicode mode, which can cause patterns like `\s` to work incorrectly. Use `/u` to use unicode-aware tokenization instead. Test Plan: The behavior of "\s" depends upon environmental settings like LC_ALL. With LC_ALL set to "C", `\xA0` is not considered a whitespace character. With LC_ALL set to "en_US", it is: ``` $ php -r 'setlocale(LC_ALL, "C"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";' 1 $ php -r 'setlocale(LC_ALL, "en_US"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";' 2 ``` To reproduce the original issue, I added an explicit: ``` setlocale(LC_ALL, "en_US"); ``` ...call before the `preg_split()` call. This caused "忠" to be improperly split. I then added "/u", and observed proper tokenization. Reviewers: chad Reviewed By: chad Subscribers: qiu8310 Maniphest Tasks: T9732 Differential Revision: https://secure.phabricator.com/D14441
This commit is contained in:
parent
37df419266
commit
152ddf5709
1 changed files with 1 additions and 1 deletions
|
@ -107,7 +107,7 @@ abstract class PhabricatorTypeaheadDatasource extends Phobject {
|
|||
return array();
|
||||
}
|
||||
|
||||
$tokens = preg_split('/\s+|[-\[\]]/', $string);
|
||||
$tokens = preg_split('/\s+|[-\[\]]/u', $string);
|
||||
return array_unique($tokens);
|
||||
}
|
||||
|
||||
|
|
Loading…
Reference in a new issue