phorge-phorge/resources/sql/autopatches/20171002.cngram.10.phriction.sql at 2d635fb76e9b20cce53ed68fe1dfd4daa79aa8c2 - revi-archive/phorge-phorge - SiliconForest Atelier

revi-archive/phorge-phorge

mirror of https://we.phorge.it/source/phorge.git synced 2024-11-25 08:12:40 +01:00

epriestley 1de130c9f5 Allow the Ferret engine to remove "common" ngrams from the index

Summary:
Ref T13000. This adds support for tracking "common" ngrams, which occur in too many documents to be useful as part of the ngram index.

If an ngram is listed in the "common" table, it won't be written when indexing documents, or queried for when searching for them.

In this change, nothing actually writes to the "common" table. I'll start writing to the table in a followup change.

Specifically, I plan to do this:

  - A new GC process updates the "common" table periodically, by writing ngrams which appear in more than X% of documents to it, for some value of X, if there are at least a minimum number of documents (maybe like 4,000).
  - A new GC process deletes ngrams that have been added to the common table from the existing indexes.

Hopefully, this will pare down the ngrams index to something reasonable over time without requiring any manual tuning.

Test Plan:
  - Ran some queries and indexes.
  - Manually inserted ngrams `xxx` and `yyy` into the ngrams table, searched and indexed, saw them ignored as viable ngrams for search/index.

Reviewers: amckinley

Reviewed By: amckinley

Maniphest Tasks: T13000

Differential Revision: https://secure.phabricator.com/D18672

2017-10-03 13:27:42 -07:00

7 lines

324 B

SQL

Raw Blame History

 CREATE TABLE {$NAMESPACE}_phriction.phriction_document_fngrams_common (
   id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
   ngram CHAR(3) NOT NULL COLLATE {$COLLATE_TEXT},
   needsCollection BOOL NOT NULL,
   UNIQUE KEY `key_ngram` (ngram),
   KEY `key_collect` (needsCollection)
 ) ENGINE=InnoDB, COLLATE {$COLLATE_TEXT};