Summary:
Ref T13000. Since other changes have generally made the ngrams table manageable, I'm not planning to enable common ngrams by default at this time.
Instead, make the threshold configurable with "--threshold" so we can guide installs through tuning this if they want (e.g. PHI110), and tune hosted instances.
(This might eventually become automatic, but just smoothing this bit off for now feels reasonable to me.)
Test Plan: Ran with `--reset`, and with various invalid and valid `--threshold` arguments.
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T13000
Differential Revision: https://secure.phabricator.com/D18710
Summary: Noticed a couple of typos in the docs, and then things got out of hand.
Test Plan:
- Stared at the words until my eyes watered and the letters began to swim on the screen.
- Consulted a dictionary.
Reviewers: #blessed_reviewers, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: epriestley, yelirekim, PHID-OPKG-gm6ozazyms6q6i22gyam
Differential Revision: https://secure.phabricator.com/D18693
Summary:
Ref T13000. Currently, queries can only be executed from the web UI, which requires logging in as a user. I really want to avoid doing that wherever we can, but being able to execute queries on an instance (and, particularly, see the ngrams and timings on the underlying lookups) would have been helpful in several cases.
Improve tooling a bit in advance of the "common ngrams" stuff going out since it seems likely that it will be useful if issues arise.
Test Plan: Ran `bin/search query --query ...`, got useful minimal output. Ran with `--trace` to get internals.
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T13000
Differential Revision: https://secure.phabricator.com/D18690
Summary:
Ref T13000. After an ngram is marked as "common", we can delete it from the storage table.
Currently, the only way to get ngrams marked as "common" is to manually run `bin/search ngrams`, so this has no impact on normal installs.
Test Plan: Ran `bin/garbage collect`, saw it start chewing through my local Maniphest ngrams table and removing common ngrams.
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T13000
Differential Revision: https://secure.phabricator.com/D18687
Summary:
Depends on D18672. Ref T13000. This does an on-demand build of the common ngrams table.
Plan here is:
- Push to `secure`.
- Build the common ngrams table here.
- See if stuff breaks?
If it looks okay on this dataset, we can build out the GC support and try it in production.
Test Plan:
- Locally, my dataset has a bunch of `bin/lipsum` tasks with similar, common words.
- Verified that ipsum terms now skip ngrams. For "lorem ipsum" search performance actually IMPROVED by skipping the ngrams table (12s to 9s).
- Queried for normal terms, got very fast results using the ngram table, as normal.
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T13000
Differential Revision: https://secure.phabricator.com/D18673
Summary: Ref T12845. Converts the cluster and project config options to the new stuff; this is mostly just shifting boilerplate around.
Test Plan: Edited, deleted, and mangled these options from the web UI and CLI.
Reviewers: chad, amckinley
Reviewed By: amckinley
Maniphest Tasks: T12845
Differential Revision: https://secure.phabricator.com/D18166
Summary: Ref T8788. See D17702. This allows `bin/search index` to index stuff which only implements `Ngrams`, not `Fulltext`.
Test Plan: Kinda poked around `bin/search index` a bit, yell if you hit more issues deeper down the stack?
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T8788
Differential Revision: https://secure.phabricator.com/D17704
Summary:
Ref T12450. There's currently a bad behavior where inserting a document into one search service marks it as up to date everywhere.
This isn't nearly as obvious as it should be because `bin/search index` doesn't make it terribly clear when a document was skipped because the index version was already up to date.
When running `bin/seach index` without `--force` or `--background`, keep track of updated vs not-updated documents and print out some guidance. In other configurations, try to provide more help too.
Test Plan: {F4452134}
Reviewers: chad, 20after4
Reviewed By: 20after4
Maniphest Tasks: T12450
Differential Revision: https://secure.phabricator.com/D17597
Summary:
Ref T12450. Normally, we validate config when:
- You restart the webserver.
- You edit it with `bin/config set ...`.
- You edit it with the web UI.
However, you can also change config by editing `local.json`, `some_env.conf.php`, a `SiteConfig` class, etc. In these cases, you may miss config warnings.
Explicitly re-run search config checks from `bin/search`, similar to the additional database checks we run from `bin/storage`, to try to produce a better error message if the user has made a configuration error.
Test Plan:
```
$ ./bin/search init
Usage Exception: Setting "cluster.search" is misconfigured: Invalid search engine type: elastic. Valid types are: elasticsearch, mysql.
```
Reviewers: chad, 20after4
Reviewed By: 20after4
Maniphest Tasks: T12450
Differential Revision: https://secure.phabricator.com/D17574
Summary:
Ref T12450. This mostly just smooths out the text a little to improve consistency. Also:
- Use `isWritable()`.
- Make the "skipping because not writable" message more clear and tailored.
- Try not to use the word "index" too much to avoid confusion with `bin/search index` -- instead, talk about "initialize a service".
Test Plan: Ran `bin/search init` with a couple of different (writable / not writable) configs, saw slightly clearer messaging.
Reviewers: chad, 20after4
Reviewed By: 20after4
Maniphest Tasks: T12450
Differential Revision: https://secure.phabricator.com/D17572
Summary:
This implements a simplistic `PhabricatorRepositoryFulltextEngine`
Currently only the repository name, description, timestamps and
status are indexed.
Note: I had to change the `search index` workflow to disambiguate
PhabricatorRepository from PhabricatorRepositoryCommit
Test Plan:
* ran `./bin/search index --type PhabricatorRepository --force`
* searched for some repositories. Saw reasonable results matching on either title or description.
* Edited a repository in the web ui
* Added unique key words to the repo description.
* I was then able to find that repo by searching for the new keywords.
Reviewers: #blessed_reviewers, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: Korvin
Tags: #search, #diffusion
Differential Revision: https://secure.phabricator.com/D17300
Summary:
The goal is to make fulltext search back-ends more extensible, configurable and robust.
When this is finished it will be possible to have multiple search storage back-ends and
potentially multiple instances of each.
Individual instances can be configured with roles such as 'read', 'write' which control
which hosts will receive writes to the index and which hosts will respond to queries.
These two roles make it possible to have any combination of:
* read-only
* write-only
* read-write
* disabled
This 'roles' mechanism is extensible to add new roles should that be needed in the future.
In addition to supporting multiple elasticsearch and mysql search instances, this refactors
the connection health monitoring infrastructure from PhabricatorDatabaseHealthRecord and
utilizes the same system for monitoring the health of elasticsearch nodes. This will
allow Wikimedia's phabricator to be redundant across data centers (mysql already is,
elasticsearch should be as well).
The real-world use-case I have in mind here is writing to two indexes (two elasticsearch clusters
in different data centers) but reading from only one. Then toggling the 'read' property when
we want to migrate to the other data center (and when we migrate from elasticsearch 2.x to 5.x)
Hopefully this is useful in the upstream as well.
Remaining TODO:
* test cases
* documentation
Test Plan:
(WARNING) This will most likely require the elasticsearch index to be deleted and re-created due to schema changes.
Tested with elasticsearch versions 2.4 and 5.2 using the following config:
```lang=json
"cluster.search": [
{
"type": "elasticsearch",
"hosts": [
{
"host": "localhost",
"roles": { "read": true, "write": true }
}
],
"port": 9200,
"protocol": "http",
"path": "/phabricator",
"version": 5
},
{
"type": "mysql",
"roles": { "write": true }
}
]
Also deployed the same changes to Wikimedia's production Phabricator instance without any issues whatsoever.
```
Reviewers: epriestley, #blessed_reviewers
Reviewed By: epriestley, #blessed_reviewers
Subscribers: Korvin, epriestley
Tags: #elasticsearch, #clusters, #wikimedia
Differential Revision: https://secure.phabricator.com/D17384
Summary:
Ref T9979. I picked this name long before the advent of modern "Engine" architecture and it ended up being pretty confusing.
Rename "SearchEngine" (currently: mysql or elasticsearch, used to store and query fulltext indexes) to "FulltextStorageEngine" to make it more clear what it does and disambituate it from ApplicationSearch, which also has a bunch of stuff called "SearchEngine", "SearchEngineExtension", etc.
Test Plan: Grepped for `phabricatorsearchengine`.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9979
Differential Revision: https://secure.phabricator.com/D14843
Summary: Ref T9979. This simplifies/standardizes the code a bit, but mostly gives us more consistent class names and structure.
Test Plan:
- Used `bin/search index --type ...` to index documents of every indexable type.
- Searched for documents by unique text, found them.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9979
Differential Revision: https://secure.phabricator.com/D14842
Summary:
Ref T9979. There are currently some hacks around Conpherence indexing: it does not really use the fulltext index, but its own specialized index. However, it's kind of hacked up so it can get reindexed by the normal indexing pipeline.
Lift it up into IndexEngine, instead of FulltextEngine. Specifically, the new stuff is going to look like this:
- IndexEngine: Rebuild all indexes.
- ConpherenceIndexExtension: Rebuild thread indexes.
- ProjectMemberIndexExtension: Rebuild project membership views.
- NgramIndexExtension: Rebuild ngram indexes.
- FulltextIndexExtension / FulltextEngine: Rebuild fulltext indexes, a special type of index.
- FulltextCommentExtension: Rebuild comment fulltext indexes.
- FulltextProjectExtension: Rebuild project fulltext indexes.
- etc.
Most of this is at least sort-of-in-place as of this diff, although some of the part in the middle is still pretty rough.
Test Plan:
- Made a unique comment in a Conpherence thread.
- Used `bin/search index --force` to rebuild the index.
- Searched for the comment.
- Found the thread.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9979
Differential Revision: https://secure.phabricator.com/D14841
Summary:
Ref T9890. Ref T9979. Several adjacent goals:
- The `SearchEngine` vs `ApplicationSearchEngine` thing is really confusing. There are also a bunch of confusing class names and class relationships within the fulltext indexing. I want to rename these classes to be more standard (`IndexEngine`, `IndexEngineExtension`, etc). Rename `SearchIndexer` to `IndexEngine`. A future change will rename `SearchEngine`.
- Add the index locks described in T9890.
- Structure things a little more normally so future diffs can do the "EngineExtension" thing more cleanly.
Test Plan:
Indexing:
- Renamed a task to have a unique word in the title.
- Ran `bin/search index Txxx`.
- Searched for unique word.
- Found task.
Locking:
- Added a `sleep(10)` after the `lock()` call.
- Ran `bin/search index Txxx` in two windows.
- Saw first one lock, sleep 10 seconds, index.
- Saw second one give up temporarily after failing to grab the lock.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9890, T9979
Differential Revision: https://secure.phabricator.com/D14834
Summary: Fixes T5991. If //all requested documents// failed to index, consider this a catastrophic failure and exit with an error code.
Test Plan:
- Ran `bin/search index --type TASK`, observed successful exit despite a small number of un-indexable documents.
- Ran `bin/search index PHID-TASK-xxx` for an invalid task, observed exception on exit after complete failure.
- Ran normal indexing through daemons.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T5991
Differential Revision: https://secure.phabricator.com/D14174
Summary: Use `PhutilClassMaQuery` instead of `PhutilSymbolLoader`, mostly for consistency. Depends on D13588.
Test Plan: Poked around a bunch of pages.
Reviewers: #blessed_reviewers, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: epriestley, Korvin
Differential Revision: https://secure.phabricator.com/D13589
Summary: Remove the `PhabricatorDefaultSearchEngineSelector` class. This is quite similar to D12053.
Test Plan: Went to `/view/PhabricatorSearchApplication/` and saw the storage engine configuration. Set `search.elastic.host` and saw the highlighted storage engine change.
Reviewers: #blessed_reviewers, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: Korvin, epriestley
Differential Revision: https://secure.phabricator.com/D12670
Summary: Derped this up in D11234.
Test Plan: Ran `bin/search index --all`.
Reviewers: joshuaspence
Reviewed By: joshuaspence
Subscribers: epriestley
Differential Revision: https://secure.phabricator.com/D11273
Summary:
Ref T3165. Builds a dedicated index for Conpherence to avoid scale/policy filtering concerns.
- This is pretty one-off but I think it's generally OK.
- There's no UI for it.
- `ConpherenceFulltextQuery` is very low-level. You would need to do another query on the PHIDs it returns to actually show anything to the user.
- The `previousTransactionPHID` is so you can load chat context efficiently. Specifically, if you want to show results like this:
> previous line of context
> **line of chat that matches the query**
> next line of context
...you can read the previous lines out of `previousTransactionPHID` directly, and the next lines by issuing one query with `WHERE previousTransactionPHID IN (...)`.
I'm not 100% sure this is useful, but it seemed like a reasonable thing to provide, since there's no way to query this efficiently otherwise and I figure a lot of chat might make way more sense with a couple of lines of context.
Test Plan:
- Indexed a thread manually (whole thing indexed).
- Indexed a thread by updating it (just the new comment indexed).
- Wrote a hacky test script and got reasonable-looking query results.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T3165
Differential Revision: https://secure.phabricator.com/D11234
Summary:
When the index does not exist and auto_create_index isn't
enabled, running ./bin/index results in a failure. That's
T5990
Instead create an index properly. This also allows us to do
nice things like do a proper mapping and analysis like for
substring matching like outlined by @fabe in T6552.
Test Plan:
Deleted and created index multiple times to verify
proper index creation and usage.
Reviewers: epriestley, #blessed_reviewers
Reviewed By: epriestley, #blessed_reviewers
Subscribers: Korvin, manybubbles, chasemp, fabe, epriestley
Differential Revision: https://secure.phabricator.com/D10955
Summary: I'm pretty sure that `@group` annotations are useless now... see D9855. Also fixed various other minor issues.
Test Plan: Eye-ball it.
Reviewers: #blessed_reviewers, epriestley, chad
Reviewed By: #blessed_reviewers, epriestley
Subscribers: epriestley, Korvin, hach-que
Differential Revision: https://secure.phabricator.com/D9859
Summary: Currently, the `./bin/search index` script produces a lot of output (one line for every indexed object). Instead, use a `PhutilConsoleProgressBar` to indicate progress. This is much less verbose and gives a real indication of how long the script should take to complete.
Test Plan: Ran `./bin/search index` and verified that a progress bar was output.
Reviewers: epriestley, #blessed_reviewers
Reviewed By: epriestley, #blessed_reviewers
Subscribers: epriestley, Korvin
Differential Revision: https://secure.phabricator.com/D9364
Summary: Fixes T3857. Earlier work made this trivial and just left product questions, which I've answered by requiring the daemons to run on reasonable installs.
Test Plan: Ran `bin/search index` and `bin/search index --background`. Observed indexes write in the former case and tasks queue in the latter case. Commented with a unique string on a revision and searched for it a moment later, got exactly one result (that revision), verifying that reindexing works correctly.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T3857
Differential Revision: https://secure.phabricator.com/D7966
Summary:
Ref T2015. Not directly related to Drydock, but I've wanted to do this for a bit.
Introduce a common base class for all the workflows in the scripts in `bin/*`. This slightly reduces code duplication by moving `isExecutable()` to the base, but also provides `getViewer()`. This is a little nicer than `PhabricatorUser::getOmnipotentUser()` and gives us a layer of indirection if we ever want to introduce more general viewer mechanisms in scripts.
Test Plan: Lint; ran some of the scripts.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2015
Differential Revision: https://secure.phabricator.com/D7838
Summary:
Part one of a large and complicated plot:
- The last filter for Maniphest "pro" queries is "Group By".
- This is currently executed in a convoluted and ridiculous way, loading massive amounts of data.
- The primary reason it works like it does is that we don't have a project name index available in Maniphest, so we can't sort in the DB.
- So, I want to provide a name index to Maniphest and push this work to the DB.
To do that, my plan is:
- Index projects in Search.
- Add a "did update index" event.
- Have Maniphest listen for it.
- When projects are updated, update their indexes in Maniphest.
- Rewrite the giant mess of "group by: project" to be somewhat reasonable.
- This may also extend to some future "group by: assignee".
This is the first small step down this path, which just indexes projects in search.
Test Plan: Ran `bin/search index --type project`, then searched for projects.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Differential Revision: https://secure.phabricator.com/D6955
Summary: Ref T2715. This only ever supported like 10% of object types; get rid of it in favor of the new infra.
Test Plan:
- Ran `bin/search index D12`; `bin/search index <some valid phid>`, `bin/search index derp`.
- Turned off Search jump, searched for `D12`.
- Used `phid.lookup`.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2715
Differential Revision: https://secure.phabricator.com/D6519
Summary: Fixes T2651. This could be futher generalized but it's a bit out of the way.
Test Plan: See chatlog.
Reviewers: chad
Reviewed By: chad
CC: aran
Maniphest Tasks: T2651
Differential Revision: https://secure.phabricator.com/D5236
Summary: Mostly applies a new call spacing rule; also a few things that have slipped through via pull requests and such
Test Plan: `find src/ -type f -name '*.php' | xargs -n16 arc lint --output summary --apply-patches`
Reviewers: chad
Reviewed By: chad
CC: aran
Differential Revision: https://secure.phabricator.com/D5002
Summary:
The search indexing API has several problems right now:
- Always runs in-process.
- It would be nice to push this into the task queue for performance. However, the API currently passses an object all the way through (and some indexers depend on preloaded object attributes), so it can't be dumped into the task queue at any stage since we can't serialize it.
- Being able to use the task queue will also make rebuilding indexes faster.
- Instead, make the API phid-oriented.
- No uniform indexing API.
- Each "Editor" currently calls SomeCustomIndexer::indexThing(). This won't work with AbstractTransactions. The API is also just weird.
- Instead, provide a uniform API.
- No uniform CLI.
- We have `scripts/search/reindex_everything.php`, but it doesn't actually index everything. Each new document type needs to be separately added to it, leading to stuff like D3839. Third-party applications can't provide indexers.
- Instead, let indexers expose documents for indexing.
- Not application-oriented.
- All the indexers live in search/ right now, which isn't the right organization in an application-orietned view of the world.
- Instead, move indexers to applications and load them with SymbolLoader.
Test Plan:
- `bin/search index`
- Indexed one revision, one task.
- Indexed `--type TASK`, `--type DREV`, etc., for all types.
- Indexed `--all`.
- Added the word "saboteur" to a revision, task, wiki page, and question and then searched for it.
- Creating users is a pain; searched for a user after indexing.
- Creating commits is a pain; searched for a commit after indexing.
- Mocks aren't currently loadable in the result view, so their indexing is moot.
Reviewers: btrahan, vrana
Reviewed By: btrahan
CC: 20after4, aran
Maniphest Tasks: T1991, T2104
Differential Revision: https://secure.phabricator.com/D4261