1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2025-02-16 08:48:38 +01:00
Commit graph

28 commits

Author SHA1 Message Date
epriestley
c5e4bd8187 Fix some minor errors (DarkConsole warning, unstable Ferret sort)
Summary:
DarkConsole could warn when "Analyze Query Plans" was not active.

`msort()` is not stable, so Ferret results with similar relevance could be returned out-of-order.

Test Plan: Saw fewer traces and more-stable result ordering.

Differential Revision: https://secure.phabricator.com/D19236
2018-03-18 15:12:25 -07:00
epriestley
da0a08a7e1 Make "mysql" mean "Ferret engine" in Fulltext search
Summary: Ref T12819. Swaps constants so existing configurations that use a "mysql" engine now use the Ferret engine, not an InnoDB/MyISAM FULLTEXT engine.

Test Plan: Swapped my local config back to "mysql" (the default), saw Ferret engine results in the UI.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18590
2017-09-11 18:05:12 -07:00
epriestley
39b74572e6 Return fulltext tokens from the Ferret fulltext engine
Summary:
Ref T12819. These render the little "Searched For: X, Y, U V" hint about how something was parsed.

(This might get a "substring" color or "title only" color or something in the future.)

Test Plan: {F5178807}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18589
2017-09-11 18:04:56 -07:00
epriestley
a2a2b3f7f4 Sort global fulltext results by overall relevance
Summary:
Ref T12819. Currently, under the Ferret engine, we query each application's index separately and then aggregate the results.

At the moment, results are aggregated by type first, then by actual rank. For example, all the revisions appear first, then all the tasks.

Instead, surface the internal ranking data from the underlying query and sort by it.

Test Plan: Searched for "A B" with a task named "A B" and a revision named "A". Saw task first. Broadly, saw mixed task and revision order in result sets.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18551
2017-09-07 13:21:58 -07:00
epriestley
8059db894d Use the Ferret engine fulltext document table to drive auxiliary fulltext constraints
Summary:
Ref T12819. I started trying to get individual engines to drive these constraints (e.g., `ManiphestTaskQuery` can do most of the work) but this is a big pain, especially since most engines don't support "any owner" or "no owner", and not everything has an owner, and so on and so on. Going down this path would have meant a huge pile of stub functions everywhere, I think.

Instead, drive these through the main engine using the fulltext document table, which already has everything we need to apply these constraints in a uniform way.

Also tweak some parts of query construction and result ordering.

Test Plan: Searched for documents by author, owner, unowned, any owner, tags, subscribers, fulltext in global search. Got sensible results without any application-specific code.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18550
2017-09-07 13:21:42 -07:00
epriestley
4ea677ba97 Skeleton support for running global fulltext queries via the Ferret engine
Summary:
Ref T12819. Provides a Ferret-engine-based fulltext engine to ultimately replace the InnoDB fulltext engine.

This is still pretty basic (hard-coded and buggy) but technically sort of works.

To activate this, you must explicitly configure it, so it isn't visible to users yet.

Test Plan: Searched for objects with global fulltext search, got a mixture of matching revisions and tasks back.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18548
2017-09-06 13:15:36 -07:00
epriestley
faca1deea5 Remove the fulltext "reconstructDocument()" method
Summary: Ref T12819. This was originally intended for debugging, but never actually used and not clearly useful. There are no callers and it probably does not work. Just get rid of it.

Test Plan: Grepped for callers; none exist.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18544
2017-09-06 11:48:35 -07:00
epriestley
cfb86dddd2 Warn users that compound terms separated by apostrophes don't work in the MySQL FULLTEXT index either
Summary: Ref T12928. Like `v0.1`, terms in the form `yo's` (sequences of two or fewer characters separated by apostrophes) do not get indexed.

Test Plan: {F5078192}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12928

Differential Revision: https://secure.phabricator.com/D18324
2017-08-02 16:06:08 -07:00
epriestley
018d1b77bf Identify compound short search tokens in the form "xx.yy" as unqueryable in the search UI
Summary:
Ref T12928. The index doesn't work for these, so show the user that there's a problem and drop the terms.

This doesn't fix the problem, but makes the behavior more clear.

Test Plan:
{F5053703}

{F5053704}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12928

Differential Revision: https://secure.phabricator.com/D18254
2017-07-20 14:24:09 -07:00
epriestley
f880000eb0 Stem fulltext tokens before filtering them for stopwords
Summary:
Fixes T12596. A query for a token (like "having") which stems to a stopword (like "have") currently survives filtering. Stem it first so it gets caught.

Also, for InnoDB, a custom stopword table can be configured. If it is, read that instead of the default stopword list (I configured it locally, but the default list is reasonable so we never formally recommended installs configure it).

Test Plan:
Queried for words that stem to stopwords, saw them filtered:

{F4915843}

Queried for the original problem query and saw "having" caught with "have" in the stopword list:

{F4915844}

Fiddled with local InnoDB stopword table config and saw the stopword list get loaded correctly.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12596

Differential Revision: https://secure.phabricator.com/D17728
2017-04-19 10:02:21 -07:00
epriestley
ada9046e31 Fix a fulltext search issue where finding token length and stopwords could fail
Summary:
Ref T12137. If a database is missing the InnoDB or MyISAM table engines, the big combined query to get both will fail.

Instead, try InnoDB first and then MyISAM.

(I have both engines locally so this worked until I deployed it.)

Test Plan: Faked an InnoDB error like `secure`, got a MyISAM result.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12137

Differential Revision: https://secure.phabricator.com/D17673
2017-04-12 19:22:46 -07:00
epriestley
3245e74f16 Show users how fulltext search queries are parsed and executed; don't query stopwords or short tokens
Summary:
Depends on D17670. Fixes T12137. Fixes T12003. Ref T2632.

This shows users a readout of which terms were actually searched for.

This also drops those terms from the query we submit to the backend, dodging the weird behaviors / search engine bugs in T12137.

This might need some design tweaking.

Test Plan: {F4899825}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12137, T12003, T2632

Differential Revision: https://secure.phabricator.com/D17672
2017-04-12 19:07:54 -07:00
epriestley
cb49acc2ca Update Phabricator to use intermediate tokens from the query compiler
Summary:
Depends on D17669. Ref T12137. Ref T12003. Ref T2632. Ref T7860.

Converts Phabricator to the new parse + compile workflow with intermediate tokens.

Also fixes a bug where searches for `cat"` or similar (unmatched quotes) wouldn't produce a nice exception.

Test Plan:
  - Fulltext searched.
  - Fulltext searched in Conpherence.
  - Fulltext searched with bad syntax.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12137, T12003, T7860, T2632

Differential Revision: https://secure.phabricator.com/D17670
2017-04-12 19:07:33 -07:00
epriestley
7e6f37fffb Rename "ElasticSearch" filenames to "Elasticsearch" (2/2)
Sometimes git does some odd magic on case-insensitive filesystems, try to
trick it.

Auditors: chad
2017-04-02 14:59:36 -07:00
epriestley
a9e2732a5c Spell "Elasticsearch" correctly, not "ElasticSearch"
Summary: Ref T12450. These are like 95% my fault, but Elastic appears to spell the name "Elasticsearch" consistently in their branding.

Test Plan: `grep ElasticSearch`

Reviewers: chad, 20after4

Maniphest Tasks: T12450

Differential Revision: https://secure.phabricator.com/D17601
2017-04-02 14:58:59 -07:00
epriestley
6d81675032 Remove "url" from Elasticsearch index
Summary:
Ref T12450. This was added a very very long time ago (D2298).

I don't want to put this in the upstream index anymore because I don't want to encourage third parties to develop software which reads the index directly. Reading the index directly is a big skeleton key which bypasses policy checks.

This was added before much of the policy model existed, when that wasn't as much of a concern. On a tecnhnical note, this also doesn't update when `phabricator.base-uri` changes.

This can be written as a search index extension if an install relies on it for some bizarre reason, although none should and I'm unaware of any actual use cases in the wild for it, even at Facebook.

Test Plan: Indexed some random stuff into ElasticSearch.

Reviewers: chad, 20after4

Reviewed By: chad

Maniphest Tasks: T12450

Differential Revision: https://secure.phabricator.com/D17600
2017-04-02 13:26:45 -07:00
Mukunda Modell
cb1d904654 Make sure writes go to the right cluster
Summary:
Two little issues

1. there was an extra call to getHostForWrite,
2. The engine instance was shared between multiple service definitions so it
was overwriting the list of writable hosts from one service with hosts from another.

Test Plan:
tested in wikimedia production with multiple services defined like this:

```language=json
 [
        {
          "hosts": [
            {
              "host": "search.svc.codfw.wmnet",
              "protocol": "https",
              "roles": {
                "read": true,
                "write": true
              },
              "version": 5
            }
          ],
          "path": "/phabricator",
          "port": 9243,
          "type": "elasticsearch"
        },
        {
          "hosts": [
            {
              "host": "search.svc.eqiad.wmnet",
              "protocol": "https",
              "roles": {
                "read": true,
                "write": true
              },
              "version": 5
            }
          ],
          "path": "/phabricator",
          "port": 9243,
          "type": "elasticsearch"
        }
      ]
```

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: epriestley

Differential Revision: https://secure.phabricator.com/D17581
2017-03-30 18:08:05 +00:00
Mukunda Modell
67a1c40476 Set content-type to application/json
Summary:
Elasticsearch really wants a raw json body and it fails to accept
the request as of es version 5.3

Test Plan:
Tested with elasticsearch 5.2 and 5.3.

Before this change 5.2 worked but 5.3 failed with
`HTTP/406 "Content-Type header [application/x-www-form-urlencoded] is not supported"` [1]

After this change, both worked.

[1] https://phabricator.wikimedia.org/P5158

Reviewers: epriestley, #blessed_reviewers

Reviewed By: epriestley, #blessed_reviewers

Subscribers: Korvin

Differential Revision: https://secure.phabricator.com/D17580
2017-03-30 18:07:47 +00:00
Mukunda Modell
654f0f6043 Make messages translatable and more sensible.
Summary:
These exception messages & comments didn't quite match reality.
Fixed and added pht() around a couple of them.

Test Plan: I didn't test this :P

Reviewers: epriestley, #blessed_reviewers

Reviewed By: epriestley, #blessed_reviewers

Subscribers: Korvin

Differential Revision: https://secure.phabricator.com/D17578
2017-03-28 23:17:35 +00:00
Mukunda Modell
699228c73b Address some New Search Configuration Errata
Summary:
  [ ] Write an "Upgrading: ..." guidance task with narrow instructions for installs that are upgrading.
  [ ] Do we need to add an indexing activity (T11932) for installs with ElasticSearch?
  [ ] We should more clearly detail exactly which versions of ElasticSearch are supported (for example, is ElasticSearch <2 no longer supported)? From T9893 it seems like we may //only// have supported ElasticSearch <2 before, so are the two regions of support totally nonoverlapping and all ElasticSearch users will need to upgrade?
  [ ] Documentation should provide stronger guidance toward MySQL and away from Elastic for the vast majority of installs, because we've historically seen users choosing Elastic when they aren't actually trying to solve any specific problem.
  [ ] When users search for fulltext results in Maniphest and hit too many documents, the current behavior is approximately silent failure (see T12443). D17384 has also lowered the ceiling for ElasticSearch, although previous changes lowered it for MySQL search. We should not fail silently, and ideally should build toward T12003.
  [ ] D17384 added a new "keywords" field, but MySQL does not search it (I think?). The behavior should be as consistent across MySQL and Elastic as we can make it. Likely cleaner is giving "Project" objects a body, with "slugs" and "description" separated by newlines?
  [ ] `PhabricatorSearchEngineTestCase` is now pointless and only detects local misconfigurations.
  [ ] It would be nice to build a practical test suite instead, where we put specific documents into the index and then search for them. The upstream test could run against MySQL, and some `bin/search test` could run against a configured engine like ElasticSearch. This would make it easier to make sure that behavior was as uniform as possible across engine implementations.
  [ ] Does every assigned task now match "user" in ElasticSearch?
  [x] `PhabricatorElasticFulltextStorageEngine` has a `json_encode()` which should be `phutil_json_encode()`.
  [ ] `PhabricatorSearchService` throws an untranslated exception.
  [ ] When a search cluster is down, we probably don't degrade with much grace (unhandled exception)?
  [ ] I haven't run bin/search init, but bin/search index doesn't warn me that I may want to. This might be worth adding. The UI does warn me.
  [ ] bin/search init warns me that the index is "incorrect". It might be more clear to distinguish between "missing" and "incorrect", since it's more comforting to users to see "everything is as we expect, doing normal first-time setup now" than "something is wrong, fixing it".
  [ ] CLI message "Initializing search service "ElasticSearch"" does not end with a period, which is inconsistent with other UI messages.
  [ ] It might be nice to let bin/search commands like init and index select a specific service (or even service + host) to act on, as bin/storage --ref ... now does. You can generally get the result you want by fiddling with config.
  [ ] When a service isn't writable, bin/search init reports "Search cluster has no hosts for role "write".". This is accurate but does not provide guidance: it might be more useful to the user to explain "This service is not writable, so we're skipping index check for it.".
  [x] Even with write off for MySQL, bin/search index --type task --trace still updates MySQL, I think? I may be misreading the trace output. But this behavior doesn't make sense if it is the actual behavior, and it seems like reindexAbstractDocument() uses "all services", not "writable services", and the MySQL engine doesn't make sure it's writable before indexing.
  [x] Searching or user fails to find task Grant users tokens when a mention is created, suggesting that stemming is not working.
  [x] Searching for users finds that task, but fails to find a task containing "per user per month" in a comment, also suggesting that stemming is not working.
  [x] Searching for maniphest fails to find task maniphest.query elephant, suggesting that tokenization in ElasticSearch is not as good as the MySQL tokenization for these words (see D17330).
  [x] The "index incorrect" warning UI uses inconsistent title case.
  [x] The "index incorrect" warning UI could format the command to be run more cleanly (with addCommand(), I think).

refs T12450

Test Plan:
* Stared blankly at the code.
* Disabled 'write' role on mysql fulltext service.
* Edited a task, ran search indexer, verified that the mysql index wasn't being updated.

Reviewers: epriestley, #blessed_reviewers

Reviewed By: epriestley, #blessed_reviewers

Subscribers: Korvin

Maniphest Tasks: T12450

Differential Revision: https://secure.phabricator.com/D17564
2017-03-28 20:19:38 +00:00
Mukunda Modell
e41c25de50 Support multiple fulltext search clusters with 'cluster.search' config
Summary:
The goal is to make fulltext search back-ends more extensible, configurable and robust.

When this is finished it will be possible to have multiple search storage back-ends and
potentially multiple instances of each.

Individual instances can be configured with roles such as 'read', 'write' which control
which hosts will receive writes to the index and which hosts will respond to queries.

These two roles make it possible to have any combination of:

* read-only
* write-only
* read-write
* disabled

This 'roles' mechanism is extensible to add new roles should that be needed in the future.

In addition to supporting multiple elasticsearch and mysql search instances, this refactors
the connection health monitoring infrastructure from PhabricatorDatabaseHealthRecord and
utilizes the same system for monitoring the health of elasticsearch nodes. This will
allow Wikimedia's phabricator to be redundant across data centers (mysql already is,
elasticsearch should be as well).

The real-world use-case I have in mind here is writing to two indexes (two elasticsearch clusters
in different data centers) but reading from only one. Then toggling the 'read' property when
we want to migrate to the other data center (and when we migrate from elasticsearch 2.x to 5.x)

Hopefully this is useful in the upstream as well.

Remaining TODO:

* test cases
* documentation

Test Plan:
(WARNING) This will most likely require the elasticsearch index to be deleted and re-created due to schema changes.

Tested with elasticsearch versions 2.4 and 5.2 using the following config:

```lang=json
  "cluster.search": [
    {
      "type": "elasticsearch",
      "hosts": [
        {
          "host": "localhost",
          "roles": { "read": true, "write": true }
        }
      ],
      "port": 9200,
      "protocol": "http",
      "path": "/phabricator",
      "version": 5
    },
    {
      "type": "mysql",
      "roles": { "write": true }
     }
  ]

Also deployed the same changes to Wikimedia's production Phabricator instance without any issues whatsoever.
```

Reviewers: epriestley, #blessed_reviewers

Reviewed By: epriestley, #blessed_reviewers

Subscribers: Korvin, epriestley

Tags: #elasticsearch, #clusters, #wikimedia

Differential Revision: https://secure.phabricator.com/D17384
2017-03-26 08:16:47 +00:00
epriestley
23a202866a When running a fulltext query with no query, enforce order by document creation date
Summary:
Fixes T11929. When running with a query, we no longer enforce an order on the subquery join to produce results more quickly when searching for common strings.

However, this means that empty queries (like those issued by "Close as Duplicate") don't order subquery results.

Restore a `dateCreated` order if there is no query text.

Test Plan: Artificially set limit to 10, still saw 10 most recent tasks.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11929

Differential Revision: https://secure.phabricator.com/D16960
2016-11-29 12:14:57 -08:00
epriestley
b2cdebefea Fix two errors from the error logs
Summary: Found these in the `secure` error logs: one bad call, one bad column.

Test Plan: Searched for empty string. Double-checked method name.

Reviewers: chad

Reviewed By: chad

Differential Revision: https://secure.phabricator.com/D16948
2016-11-26 07:50:57 -08:00
epriestley
7c5b5327c8 Use stemming in the MySQL fulltext search engine
Summary:
Ref T6740. When we index a document, also save a copy of the stemmed version.

When querying, search the combined corpus for the terms.

(We may need to tune this a bit later since it's possible for literal, quoted terms to match in the stemmed section, but I think this wil rarely cause issues in practice.)

A downside here is that search sort of breaks if you upgrade into this and don't reindex. I wasn't able to find a way to issue the query that remained compatible with older indexes and didn't have awful performance, so my plan is:

  - Put this on `secure`.
  - Rebuild the index.
  - If things look good after a couple of days, add a way that we can tell people they need to rebuild the search index with a setup warning.

We might get some reports between now and then, but if this is super awful we should know by the end of the weekend.

Test Plan:
WOW AMAZING

{F2021466}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T6740

Differential Revision: https://secure.phabricator.com/D16947
2016-11-25 15:30:50 -08:00
epriestley
2b7ec1deea Boost search result title matches
Summary: Ref T6740. When a query matches a document title, boost results.

Test Plan:
  - Searched for `button bar`.
  - Before:

{F2019463}

  - After:

{F2019470}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T6740

Differential Revision: https://secure.phabricator.com/D16945
2016-11-25 15:25:49 -08:00
epriestley
54470a12d4 Execute fulltext queries using a subquery instead of by ordering the entire result set
Summary:
Ref T6740. Currently, we issue fulltext queries with an "ORDER BY <score>" on the entire result set.

For very large result sets, this can require MySQL to do a lot of work. However, this work is generally useless: if you search for some common word like "diff" or "internet" or whatever and match 4,000 documents, the chance that we can score whatever thing you were thinking of at the top of the result set is nearly nothing. It's more useful to return quickly, and let the user see that they need to narrow their query to get useful results.

Instead of doing all that work, let MySQL find up to 1,000 results, then pick the best ones out of those.

This actual change is a little flimsy, since our index isn't really big enough to suffer indexing issues. However, searching for common terms on my local install (where I have some large repositories imported and indexed) drops from ~40ms to ~10ms.

My hope is to improve downstream performance for queries like "translatewiki" here, particularly:

<https://phabricator.wikimedia.org/T143863>

That query matches about 300 trillion documents but there's a ~0% chance that the one the user wants is at the top. It takes a couple of seconds to execute, for me. Better to return quickly and let the user refine their results.

I think this will also make some other changes related to stemming easier.

This also removes the "list users first" ordering on the query, which made performance more complicated and seems irrelevant now that we have the typeahead.

Test Plan:
  - Searched for some common terms like "code" locally, saw similar results with better performance.
  - Searched for useful queries (e.g., small result set), got identical results.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T6740

Differential Revision: https://secure.phabricator.com/D16944
2016-11-25 15:19:54 -08:00
epriestley
a956047989 Use PhutilQueryCompiler in Phabricator fulltext search
Summary:
Ref T11741. Fixes T10642. Parse and compile user queries with a consistent ruleset, then submit queries to the backend using whatever ruleset MySQL is configured with.

This means that `ft_boolean_syntax` no longer needs to be configured (we'll just do the right thing in all cases).

This should improve behavior with RDS immediately (T10642), and allow us to improve behavior with InnoDB in the future (T11741).

Test Plan:
  - Ran various queries in the UI, saw the expected results.
  - Ran bad queries, got useful errors.
  - Searched threads in Conpherence.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10642, T11741

Differential Revision: https://secure.phabricator.com/D16939
2016-11-25 14:46:10 -08:00
epriestley
23c42486e4 Rename "SearchEngine" to "FulltextStorageEngine"
Summary:
Ref T9979. I picked this name long before the advent of modern "Engine" architecture and it ended up being pretty confusing.

Rename "SearchEngine" (currently: mysql or elasticsearch, used to store and query fulltext indexes) to "FulltextStorageEngine" to make it more clear what it does and disambituate it from ApplicationSearch, which also has a bunch of stuff called "SearchEngine", "SearchEngineExtension", etc.

Test Plan: Grepped for `phabricatorsearchengine`.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9979

Differential Revision: https://secure.phabricator.com/D14843
2015-12-21 17:26:19 -08:00