mirror of
https://we.phorge.it/source/phorge.git
synced 2024-12-24 22:40:55 +01:00
Split Ferret engine strings for tokenization on any sequence of whitespace
Summary: Ref T12819. Currently, strings are split only on spaces, but newlines (and, if they exist, tabs) should also split strings. Without this, we can fail to get the proper term boundary tokens for words which begin at the start of a line or end at the end of a line. Test Plan: Reindexed a document with "xyz\nabc", saw `"yz "` and `" ab"` term boundary tokens generate properly. Reviewers: chad Reviewed By: chad Maniphest Tasks: T12819 Differential Revision: https://secure.phabricator.com/D18579
This commit is contained in:
parent
4cae4a3b76
commit
7ea6de6e9c
1 changed files with 1 additions and 1 deletions
|
@ -75,7 +75,7 @@ abstract class PhabricatorFerretEngine extends Phobject {
|
||||||
|
|
||||||
public function tokenizeString($value) {
|
public function tokenizeString($value) {
|
||||||
$value = trim($value, ' ');
|
$value = trim($value, ' ');
|
||||||
$value = preg_split('/ +/', $value);
|
$value = preg_split('/\s+/u', $value);
|
||||||
return $value;
|
return $value;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue