Use unicode mode when tokenizing strings like user realnames

Summary: Fixes T9732. We currently tokenize strings (like user realnames) in the default non-unicode mode, which can cause patterns like `\s` to work incorrectly. Use `/u` to use unicode-aware tokenization instead. Test Plan: The behavior of "\s" depends upon environmental settings like LC_ALL. With LC_ALL set to "C", `\xA0` is not considered a whitespace character. With LC_ALL set to "en_US", it is: ``` $ php -r 'setlocale(LC_ALL, "C"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";' 1 $ php -r 'setlocale(LC_ALL, "en_US"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";' 2 ``` To reproduce the original issue, I added an explicit: ``` setlocale(LC_ALL, "en_US"); ``` ...call before the `preg_split()` call. This caused "忠" to be improperly split. I then added "/u", and observed proper tokenization. Reviewers: chad Reviewed By: chad Subscribers: qiu8310 Maniphest Tasks: T9732 Differential Revision: https://secure.phabricator.com/D14441
2025-02-01 09:28:22 +01:00 · 2015-11-08 05:36:42 -08:00 · 2015-11-08 05:36:42 -08:00 · 152ddf5709
commit 152ddf5709
parent 37df419266
1 changed files with 1 additions and 1 deletions
--- a/src/applications/typeahead/datasource/PhabricatorTypeaheadDatasource.php
+++ b/src/applications/typeahead/datasource/PhabricatorTypeaheadDatasource.php
@ -107,7 +107,7 @@ abstract class PhabricatorTypeaheadDatasource extends Phobject {
      return array();
    }

-    $tokens = preg_split('/\s+|[-\[\]]/', $string);
+    $tokens = preg_split('/\s+|[-\[\]]/u', $string);
    return array_unique($tokens);
  }