phorge-phorge

mirror of https://we.phorge.it/source/phorge.git synced 2024-12-20 04:20:55 +01:00

Author	SHA1	Message	Date
epriestley	0e2e525bb4	Add a "terms" corpus to Ferret fields Summary: Ref T12819. Ferret currently does substring search, but this is not the default mode users expect: when you search for the "RICO" act, you do not expect to find documents containing "apRICOt" even though "RICO" is a substring. To support term search, index the corpus as a list of terms with puncutation removed and whitespace normalized so the engine can match against it. Test Plan: Ran `storage upgrade`, ran `search index`, saw sensible database results: ``` rawCorpus: This is the task description. Hark! Whom'st'dve eaten this "food" shall surely ~perish~?? #blessed normalCorpus: thi the task descript hark whom dve eaten food shall sure perish bless termCorpus: This is the task description Hark Whom'st'dve eaten this food shall surely perish blessed ``` Reviewers: chad Reviewed By: chad Maniphest Tasks: T12819 Differential Revision: https://secure.phabricator.com/D18498	2017-08-30 11:29:14 -07:00
epriestley	4005a465f7	Make Ferret indexing more robust (UTF8, exception handling) Summary: Ref T12819. Two minor improvements from live data: - Tokenize in a UTF8-aware way. - When one document fails to index, kill the transaction explicitly (rather than leaving it hanging) so we don't cause other failures later. Test Plan: Created some UTF8 documents locally, indexed them, got clean results. Reviewers: chad Reviewed By: chad Maniphest Tasks: T12819 Differential Revision: https://secure.phabricator.com/D18487	2017-08-28 15:49:57 -07:00
epriestley	f97157e7ed	Build a prototype fulltext engine ("Ferret") using only basic MySQL primitives Summary: Ref T12819. I gave this stuff a sweet code name because all the terms related to "fulltext" and "search" already mean 5 different things. It, uh, ferrets out documents for you? I'm building this to work a lot like the existing ngram index, which seems to work pretty well. If this sticks, it will auto-resolve the join issue (in T12443) by letting us do the entire thing locally in a JOIN and thus dodge a lot of mess. This index gets built alongside other indexes, but only shows up in the UI if you have prototypes enabled. If you do, it appears under the existing fulltext field in Maniphest. No existing functionality is affected or disrupted. NOTE: The query engine half of this is still EXTREMELY primitive, and this probably performs worse than the existing field for now. If this doesn't show obvious signs of being awful on `secure` I'll improve that in followup changes. Test Plan: Indexed my tasks, ran some simple queries, got the results I wanted, even for queries "ko", "k", "v0.1". {F5147746} Reviewers: chad Reviewed By: chad Maniphest Tasks: T12819, T12443 Differential Revision: https://secure.phabricator.com/D18484	2017-08-28 14:52:59 -07:00
epriestley	96fe8c0b83	Implement basic ngram search for Owners Package names Summary: Ref T9979. This uses ngrams (specifically, trigrams) to build a reasonably efficient index for substring matching. Specifically, for a package like "Example", with ID 123, we store rows like this: ``` < ex, 123> <exa, 123> <xam, 123> <amp, 123> <mpl, 123> <ple, 123> <le , 123> ``` When the user searches for `exam`, we join this table for packages with tokens `exa` and `xam`. MySQL can do this a lot more efficiently than it can process a `LIKE "%exam%"` query against a huge table. When the user searches for a one-letter or two-letter string, we only search the beginnings of words. This is probably what they want, the only thing we can do quickly, and a reasonable/expected behavior for typeaheads. Test Plan: - Ran storage upgrades and search indexer. - Searched for stuff with "name contains". - Used typehaead and got sensible results. - Searched for `aabbccddeeffgghhiijjkkllmmnnooppqqrrssttuuvvwwxxyyzz` and saw only 16 joins. Reviewers: chad Reviewed By: chad Maniphest Tasks: T9979 Differential Revision: https://secure.phabricator.com/D14846	2015-12-22 08:00:33 -08:00

4 commits