1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2025-02-02 01:48:23 +01:00
Commit graph

15 commits

Author SHA1 Message Date
epriestley
1de130c9f5 Allow the Ferret engine to remove "common" ngrams from the index
Summary:
Ref T13000. This adds support for tracking "common" ngrams, which occur in too many documents to be useful as part of the ngram index.

If an ngram is listed in the "common" table, it won't be written when indexing documents, or queried for when searching for them.

In this change, nothing actually writes to the "common" table. I'll start writing to the table in a followup change.

Specifically, I plan to do this:

  - A new GC process updates the "common" table periodically, by writing ngrams which appear in more than X% of documents to it, for some value of X, if there are at least a minimum number of documents (maybe like 4,000).
  - A new GC process deletes ngrams that have been added to the common table from the existing indexes.

Hopefully, this will pare down the ngrams index to something reasonable over time without requiring any manual tuning.

Test Plan:
  - Ran some queries and indexes.
  - Manually inserted ngrams `xxx` and `yyy` into the ngrams table, searched and indexed, saw them ignored as viable ngrams for search/index.

Reviewers: amckinley

Reviewed By: amckinley

Maniphest Tasks: T13000

Differential Revision: https://secure.phabricator.com/D18672
2017-10-03 13:27:42 -07:00
epriestley
086a125ad5 Improve performance of Ferret engine ngram extraction, particularly for large input strings
Summary:
See PHI87. Ref T12974. The `array_slice()` method of splitting the string apart can perform poorly for large input strings. I think this is mostly just the large number of calls plus building and returning an array being not entirely trivial.

We can just use `substr()` instead, as long as we're a little bit careful about keeping track of where we're slicing the string if it has UTF8 characters.

Test Plan:
  - Created a task with a single, unbroken blob of base64 encoded data as the description, roughly 100KB long.
  - Saw indexing performance improve from ~6s to ~1.5s after patch.
  - Before: https://secure.phabricator.com/xhprof/profile/PHID-FILE-nrxs4lwdvupbve5lhl6u/
  - After: https://secure.phabricator.com/xhprof/profile/PHID-FILE-6vs2akgjj5nbqt7yo7ul/

Reviewers: amckinley

Reviewed By: amckinley

Maniphest Tasks: T12974

Differential Revision: https://secure.phabricator.com/D18649
2017-09-27 10:41:39 -07:00
epriestley
a1d9a2389d Improve Ferret engine indexing performance for large blocks of text
Summary:
See PHI87. Ref T12974. Currently, we do a lot more work here than we need to: we call `phutil_utf8_strtolower()` on each token, but can do it once at the beginning on the whole block.

Additionally, since ngrams don't care about order, we only need to convert unique tokens into ngrams. This saves us some `phutil_utf8v()`. These calls can be slow for large inputs.

Test Plan:
  - Created a ~4MB task description.
  - Ran `bin/search index Txxx --profile ...` to profile indexing performance before and after the change.
  - Saw total runtime drop form 38s to 9s.
  - Before: <https://secure.phabricator.com/xhprof/profile/PHID-FILE-wiht5d7lkyazaywwxovw/>
  - After: <https://secure.phabricator.com/xhprof/profile/PHID-FILE-efxv56q2hulr6kjrxbx6/>

Reviewers: amckinley

Reviewed By: amckinley

Maniphest Tasks: T12974

Differential Revision: https://secure.phabricator.com/D18647
2017-09-27 08:15:40 -07:00
epriestley
7ea6de6e9c Split Ferret engine strings for tokenization on any sequence of whitespace
Summary:
Ref T12819. Currently, strings are split only on spaces, but newlines (and, if they exist, tabs) should also split strings.

Without this, we can fail to get the proper term boundary tokens for words which begin at the start of a line or end at the end of a line.

Test Plan: Reindexed a document with "xyz\nabc", saw `"yz "` and `" ab"` term boundary tokens generate properly.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18579
2017-09-08 09:39:57 -07:00
epriestley
2218caee0f Reduce the amount of boilerplate that implementing FerretInterface requires
Summary:
See brief discussion in D18554. All the index tables are the same for every application (and, at this point, seem unlikely to change) and we never actually pass these objects around (they're only used internally).

In some other cases (like Transactions) not every application has the same tables (for example, Differential has extra field for inline comments), and/or we pass the objects around (lots of stuff uses `$xactions` directly).

However, in this case, and in Edges, we don't interact with any representation of the database state directly in much of the code, and it doesn't change from application to application.

Just automatically define document, field, and ngram tables for anything which implements `FerretInterface`. This makes the query and index logic a tiny bit messier but lets us delete a ton of boilerplate classes.

Test Plan: Indexed objects, searched for objects. Same results as before with much less code. Ran `bin/storage upgrade`, got a clean bill of health.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18559
2017-09-07 13:23:31 -07:00
epriestley
3ff9d4a4ca Support Ferret engine for searching users
Summary:
Ref T12819. Adds support for indexing user accounts so they appear in global fulltext results.

Also, always rank users ahead of other results.

Test Plan: Indexed users. Searched for a user, got that user.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18552
2017-09-07 13:22:12 -07:00
epriestley
a2a2b3f7f4 Sort global fulltext results by overall relevance
Summary:
Ref T12819. Currently, under the Ferret engine, we query each application's index separately and then aggregate the results.

At the moment, results are aggregated by type first, then by actual rank. For example, all the revisions appear first, then all the tasks.

Instead, surface the internal ranking data from the underlying query and sort by it.

Test Plan: Searched for "A B" with a task named "A B" and a revision named "A". Saw task first. Broadly, saw mixed task and revision order in result sets.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18551
2017-09-07 13:21:58 -07:00
epriestley
8059db894d Use the Ferret engine fulltext document table to drive auxiliary fulltext constraints
Summary:
Ref T12819. I started trying to get individual engines to drive these constraints (e.g., `ManiphestTaskQuery` can do most of the work) but this is a big pain, especially since most engines don't support "any owner" or "no owner", and not everything has an owner, and so on and so on. Going down this path would have meant a huge pile of stub functions everywhere, I think.

Instead, drive these through the main engine using the fulltext document table, which already has everything we need to apply these constraints in a uniform way.

Also tweak some parts of query construction and result ordering.

Test Plan: Searched for documents by author, owner, unowned, any owner, tags, subscribers, fulltext in global search. Got sensible results without any application-specific code.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18550
2017-09-07 13:21:42 -07:00
epriestley
4ea677ba97 Skeleton support for running global fulltext queries via the Ferret engine
Summary:
Ref T12819. Provides a Ferret-engine-based fulltext engine to ultimately replace the InnoDB fulltext engine.

This is still pretty basic (hard-coded and buggy) but technically sort of works.

To activate this, you must explicitly configure it, so it isn't visible to users yet.

Test Plan: Searched for objects with global fulltext search, got a mixture of matching revisions and tasks back.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18548
2017-09-06 13:15:36 -07:00
epriestley
20aad35e60 Move Ferret engine "title:..." field definitions to the engine itself
Summary: Ref T12819. Move these out of the core engine into the Ferret engine. In the future different applications can define different functions, like "summary:..." or whatever. This may get more formalization when I possibly do "author:" and such some time down the road.

Test Plan: Searched for "title:...". Searched for "dog:...", got a useful error.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18536
2017-09-05 11:57:51 -07:00
epriestley
46abc11114 Reduce the number of magic strings in the Ferret implementation
Summary:
Ref T12819. Push more of the magic `' '` stuff into the engine and simplify calls to ngram construction.

Also fixes a bug where a task with title "apple banana" and description "cherry doughnut" could match query "banana cherry" by separating separate term segments with newlines instead of spaces.

Test Plan:
  - Indexed some objects.
  - Searched (term, substring, quoted terms).
  - Viewed index in database.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18534
2017-09-05 11:57:35 -07:00
epriestley
4a7593f47f Consolidate more Ferret engine code into FerretEngine
Summary: Ref T12819. Earlier I separated some ngram code into an "ngram engine" hoping to share it across the simple Ngrams stuff and the full Ferret stuff, but they actually use slightly different rules. Just pull more of this stuff into FerretEngine to reduce the number of moving pieces and the amount of code duplication.

Test Plan: Searched for terms, rebuilt indexes.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18533
2017-09-05 11:57:18 -07:00
epriestley
0e2e525bb4 Add a "terms" corpus to Ferret fields
Summary:
Ref T12819. Ferret currently does substring search, but this is not the default mode users expect: when you search for the "RICO" act, you do not expect to find documents containing "apRICOt" even though "RICO" is a substring.

To support term search, index the corpus as a list of terms with puncutation removed and whitespace normalized so the engine can match against it.

Test Plan:
Ran `storage upgrade`, ran `search index`, saw sensible database results:

```
   rawCorpus: This is the task description.

Hark! Whom'st'dve eaten this "food" shall surely ~perish~?? #blessed
normalCorpus: thi the task descript hark whom dve eaten food shall sure perish bless
  termCorpus:  This is the task description Hark Whom'st'dve eaten this food shall surely perish blessed
```

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18498
2017-08-30 11:29:14 -07:00
epriestley
77ef38f9a8 Aggregate corpus data in Ferret field rows
Summary:
Ref T12819. This addresses two issues:

  - One practical issue is that right now, if you search for "dog cat", and they appear in different fields (for example, "dog" appears ONLY in the title, while "cat" appears ONLY in a comment) we won't find the document. This is somewhat rare -- usually, if "dog" appears in the title, it's also repeated in the description -- but I think clearly a bug. To attack this, start automatically creating a virtual "ALL" field with the full document text which we'll use as the primary thing we match against.
  - For fields which may occur more than once -- today, only comments -- aggregate them all into one big "all of the text" row instead of writing one row per comment. This partly addresses the first point ("dog" in one comment and "cat" in a different comment won't be found) and partly makes some of the query gymnastics easier.

Test Plan:
Ran `bin/storage upgrade`, ran `bin/search index <Txxx>`, saw sensible corpus values in the database:

```
mysql> select * from maniphest_task_ffield\G
*************************** 1. row ***************************
          id: 3
  documentID: 1981
    fieldKey: full
   rawCorpus: This is the task title
This is the task description.
normalCorpus: thi the task titl
thi the task descript
*************************** 2. row ***************************
          id: 4
  documentID: 1981
    fieldKey: titl
   rawCorpus: This is the task title
normalCorpus: thi the task titl
*************************** 3. row ***************************
          id: 5
  documentID: 1981
    fieldKey: body
   rawCorpus: This is the task description.
normalCorpus: thi the task descript
3 rows in set (0.00 sec)
```

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18497
2017-08-30 11:28:30 -07:00
epriestley
f97157e7ed Build a prototype fulltext engine ("Ferret") using only basic MySQL primitives
Summary:
Ref T12819. I gave this stuff a sweet code name because all the terms related to "fulltext" and "search" already mean 5 different things. It, uh, ferrets out documents for you?

I'm building this to work a lot like the existing ngram index, which seems to work pretty well. If this sticks, it will auto-resolve the join issue (in T12443) by letting us do the entire thing locally in a JOIN and thus dodge a lot of mess.

This index gets built alongside other indexes, but only shows up in the UI if you have prototypes enabled. If you do, it appears under the existing fulltext field in Maniphest. No existing functionality is affected or disrupted.

NOTE: The query engine half of this is still EXTREMELY primitive, and this probably performs worse than the existing field for now. If this doesn't show obvious signs of being awful on `secure` I'll improve that in followup changes.

Test Plan:
Indexed my tasks, ran some simple queries, got the results I wanted, even for queries "ko", "k", "v0.1".

{F5147746}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819, T12443

Differential Revision: https://secure.phabricator.com/D18484
2017-08-28 14:52:59 -07:00