Summary:
Ref T12819. Ferret currently does substring search, but this is not the default mode users expect: when you search for the "RICO" act, you do not expect to find documents containing "apRICOt" even though "RICO" is a substring.
To support term search, index the corpus as a list of terms with puncutation removed and whitespace normalized so the engine can match against it.
Test Plan:
Ran `storage upgrade`, ran `search index`, saw sensible database results:
```
rawCorpus: This is the task description.
Hark! Whom'st'dve eaten this "food" shall surely ~perish~?? #blessed
normalCorpus: thi the task descript hark whom dve eaten food shall sure perish bless
termCorpus: This is the task description Hark Whom'st'dve eaten this food shall surely perish blessed
```
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12819
Differential Revision: https://secure.phabricator.com/D18498
Summary:
Ref T12819. Two minor improvements from live data:
- Tokenize in a UTF8-aware way.
- When one document fails to index, kill the transaction explicitly (rather than leaving it hanging) so we don't cause other failures later.
Test Plan: Created some UTF8 documents locally, indexed them, got clean results.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12819
Differential Revision: https://secure.phabricator.com/D18487
Summary:
Ref T12819. I gave this stuff a sweet code name because all the terms related to "fulltext" and "search" already mean 5 different things. It, uh, ferrets out documents for you?
I'm building this to work a lot like the existing ngram index, which seems to work pretty well. If this sticks, it will auto-resolve the join issue (in T12443) by letting us do the entire thing locally in a JOIN and thus dodge a lot of mess.
This index gets built alongside other indexes, but only shows up in the UI if you have prototypes enabled. If you do, it appears under the existing fulltext field in Maniphest. No existing functionality is affected or disrupted.
NOTE: The query engine half of this is still EXTREMELY primitive, and this probably performs worse than the existing field for now. If this doesn't show obvious signs of being awful on `secure` I'll improve that in followup changes.
Test Plan:
Indexed my tasks, ran some simple queries, got the results I wanted, even for queries "ko", "k", "v0.1".
{F5147746}
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12819, T12443
Differential Revision: https://secure.phabricator.com/D18484
Summary:
Ref T9979. This uses ngrams (specifically, trigrams) to build a reasonably efficient index for substring matching. Specifically, for a package like "Example", with ID 123, we store rows like this:
```
< ex, 123>
<exa, 123>
<xam, 123>
<amp, 123>
<mpl, 123>
<ple, 123>
<le , 123>
```
When the user searches for `exam`, we join this table for packages with tokens `exa` and `xam`. MySQL can do this a lot more efficiently than it can process a `LIKE "%exam%"` query against a huge table.
When the user searches for a one-letter or two-letter string, we only search the beginnings of words. This is probably what they want, the only thing we can do quickly, and a reasonable/expected behavior for typeaheads.
Test Plan:
- Ran storage upgrades and search indexer.
- Searched for stuff with "name contains".
- Used typehaead and got sensible results.
- Searched for `aabbccddeeffgghhiijjkkllmmnnooppqqrrssttuuvvwwxxyyzz` and saw only 16 joins.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9979
Differential Revision: https://secure.phabricator.com/D14846