1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2024-11-28 09:42:41 +01:00
Commit graph

36 commits

Author SHA1 Message Date
epriestley
b46e2bb4cc Convert cluster/projects config options to newer modular structure
Summary: Ref T12845. Converts the cluster and project config options to the new stuff; this is mostly just shifting boilerplate around.

Test Plan: Edited, deleted, and mangled these options from the web UI and CLI.

Reviewers: chad, amckinley

Reviewed By: amckinley

Maniphest Tasks: T12845

Differential Revision: https://secure.phabricator.com/D18166
2017-06-27 12:35:54 -07:00
epriestley
c71d9c601f Pass all Throwables to Exception Handlers, not just Exceptions
Summary:
Ref T12855. PHP7 introduced "Throwables", which are sort of like super exceptions. Some errors that PHP raises at runtime have become Throwables instead of old-school errors now.

The major effect this has is blank pages during development under PHP7 for certain classes of errors: they skip all the nice "show a pretty error" handlers and

This isn't a compelete fix, but catches the most common classes of unexpected Throwable and sends them through the normal machinery. Principally, it shows a nice stack trace again instead of a blank page for a larger class of typos and minor mistakes.

Test Plan:
Before: blank page. After:

{F5007979}

Reviewers: chad, amckinley

Reviewed By: chad

Maniphest Tasks: T12855

Differential Revision: https://secure.phabricator.com/D18136
2017-06-20 05:44:51 -07:00
epriestley
3245e74f16 Show users how fulltext search queries are parsed and executed; don't query stopwords or short tokens
Summary:
Depends on D17670. Fixes T12137. Fixes T12003. Ref T2632.

This shows users a readout of which terms were actually searched for.

This also drops those terms from the query we submit to the backend, dodging the weird behaviors / search engine bugs in T12137.

This might need some design tweaking.

Test Plan: {F4899825}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12137, T12003, T2632

Differential Revision: https://secure.phabricator.com/D17672
2017-04-12 19:07:54 -07:00
epriestley
cb49acc2ca Update Phabricator to use intermediate tokens from the query compiler
Summary:
Depends on D17669. Ref T12137. Ref T12003. Ref T2632. Ref T7860.

Converts Phabricator to the new parse + compile workflow with intermediate tokens.

Also fixes a bug where searches for `cat"` or similar (unmatched quotes) wouldn't produce a nice exception.

Test Plan:
  - Fulltext searched.
  - Fulltext searched in Conpherence.
  - Fulltext searched with bad syntax.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12137, T12003, T7860, T2632

Differential Revision: https://secure.phabricator.com/D17670
2017-04-12 19:07:33 -07:00
epriestley
7e6f37fffb Rename "ElasticSearch" filenames to "Elasticsearch" (2/2)
Sometimes git does some odd magic on case-insensitive filesystems, try to
trick it.

Auditors: chad
2017-04-02 14:59:36 -07:00
epriestley
a9e2732a5c Spell "Elasticsearch" correctly, not "ElasticSearch"
Summary: Ref T12450. These are like 95% my fault, but Elastic appears to spell the name "Elasticsearch" consistently in their branding.

Test Plan: `grep ElasticSearch`

Reviewers: chad, 20after4

Maniphest Tasks: T12450

Differential Revision: https://secure.phabricator.com/D17601
2017-04-02 14:58:59 -07:00
epriestley
304d19f92a After a fulltext write to a particular service fails, keep trying writes to other services
Summary:
Ref T12450. Currently, if a write fails, we stop and don't try to write to other index services. There's no technical reason not to keep trying writes, it makes some testing easier, and it would improve behavior in a scenario where engines are configured as "primary" and "backup" and the primary service is having some issues.

Also, make "no writable services are configured" acceptable, rather than an error. This state is probably goofy but if we want to detect it I think it should probably be a config-validation issue, not a write-time check. I also think it's not totally unreasonable to want to just turn off all writes for a while (maybe to reduce load while you're doing a background update).

Test Plan:
  - Configured a bad ElasticSearch engine and a good MySQL engine.
  - Ran `bin/search index ... --force`.
  - Saw MySQL get updated even though ElasticSearch failed.

Reviewers: chad, 20after4

Reviewed By: 20after4

Maniphest Tasks: T12450

Differential Revision: https://secure.phabricator.com/D17599
2017-04-02 13:47:52 -07:00
Mukunda Modell
cb1d904654 Make sure writes go to the right cluster
Summary:
Two little issues

1. there was an extra call to getHostForWrite,
2. The engine instance was shared between multiple service definitions so it
was overwriting the list of writable hosts from one service with hosts from another.

Test Plan:
tested in wikimedia production with multiple services defined like this:

```language=json
 [
        {
          "hosts": [
            {
              "host": "search.svc.codfw.wmnet",
              "protocol": "https",
              "roles": {
                "read": true,
                "write": true
              },
              "version": 5
            }
          ],
          "path": "/phabricator",
          "port": 9243,
          "type": "elasticsearch"
        },
        {
          "hosts": [
            {
              "host": "search.svc.eqiad.wmnet",
              "protocol": "https",
              "roles": {
                "read": true,
                "write": true
              },
              "version": 5
            }
          ],
          "path": "/phabricator",
          "port": 9243,
          "type": "elasticsearch"
        }
      ]
```

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: epriestley

Differential Revision: https://secure.phabricator.com/D17581
2017-03-30 18:08:05 +00:00
Mukunda Modell
654f0f6043 Make messages translatable and more sensible.
Summary:
These exception messages & comments didn't quite match reality.
Fixed and added pht() around a couple of them.

Test Plan: I didn't test this :P

Reviewers: epriestley, #blessed_reviewers

Reviewed By: epriestley, #blessed_reviewers

Subscribers: Korvin

Differential Revision: https://secure.phabricator.com/D17578
2017-03-28 23:17:35 +00:00
epriestley
88798354e8 Soften a possible cluster search setup fatal
Summary:
Ref T12450. The way that config repair and setup issues interact is kind of complicated, and if `cluster.search` is invalid we may end up using `cluster.search` before we repair it.

I poked at things for a bit but wasn't confident I could get it to consistently repair before we use it without doing a big messy change.

The only thing that really matters is whether "type" is valid or not, so just put a slightly softer/more-tailored check in for that.

Test Plan:
  - With `"type": "elastic"`, loaded setup issues.
  - Before patch: hard fatal.
  - After patch: softer fatal with more useful messaging.

{F4321048}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12450

Differential Revision: https://secure.phabricator.com/D17576
2017-03-28 15:28:16 -07:00
epriestley
5f939dcce0 Re-run config validation from bin/search
Summary:
Ref T12450. Normally, we validate config when:

  - You restart the webserver.
  - You edit it with `bin/config set ...`.
  - You edit it with the web UI.

However, you can also change config by editing `local.json`, `some_env.conf.php`, a `SiteConfig` class, etc. In these cases, you may miss config warnings.

Explicitly re-run search config checks from `bin/search`, similar to the additional database checks we run from `bin/storage`, to try to produce a better error message if the user has made a configuration error.

Test Plan:
```
$ ./bin/search init
Usage Exception: Setting "cluster.search" is misconfigured: Invalid search engine type: elastic. Valid types are: elasticsearch, mysql.
```

Reviewers: chad, 20after4

Reviewed By: 20after4

Maniphest Tasks: T12450

Differential Revision: https://secure.phabricator.com/D17574
2017-03-28 14:53:26 -07:00
epriestley
8879118b69 Fix a mid-air collision around SearchService roles
My D17571 didn't interact nicely with D17564, which added callsites for one of
the methods I removed.

Auditors: 20after4
2017-03-28 14:01:45 -07:00
epriestley
c40be811ea Fix isReadable() and isWritable() in SearchService
Summary:
Ref T12450. Minor cleanup:

  - setRoles() has no callers.
  - getRoles() has no callers (these two methods are leftovers from an earlier iteration of the change).
  - The `hasRole()` logic doesn't work since nothing calls `setRole()`.
  - `hasRole()` has only `isreadable/iswritable` as callers.
  - The `isReadable()/isWritable()` logic doesn't work since `hasRole()` doesn't work.

Instead, just check if there are any readable/writable hosts. `Host` already inherits its config from `Service` so this gets the same answer without any fuss.

Also add some read/write constants to make grepping this stuff a little easier.

Test Plan:
  - Grepped for all removed symbols, saw only newer-generation calls in `Host`.
  - See next diff for use of `isWritable()`.

Reviewers: chad, 20after4

Reviewed By: 20after4

Maniphest Tasks: T12450

Differential Revision: https://secure.phabricator.com/D17571
2017-03-28 13:58:46 -07:00
Mukunda Modell
699228c73b Address some New Search Configuration Errata
Summary:
  [ ] Write an "Upgrading: ..." guidance task with narrow instructions for installs that are upgrading.
  [ ] Do we need to add an indexing activity (T11932) for installs with ElasticSearch?
  [ ] We should more clearly detail exactly which versions of ElasticSearch are supported (for example, is ElasticSearch <2 no longer supported)? From T9893 it seems like we may //only// have supported ElasticSearch <2 before, so are the two regions of support totally nonoverlapping and all ElasticSearch users will need to upgrade?
  [ ] Documentation should provide stronger guidance toward MySQL and away from Elastic for the vast majority of installs, because we've historically seen users choosing Elastic when they aren't actually trying to solve any specific problem.
  [ ] When users search for fulltext results in Maniphest and hit too many documents, the current behavior is approximately silent failure (see T12443). D17384 has also lowered the ceiling for ElasticSearch, although previous changes lowered it for MySQL search. We should not fail silently, and ideally should build toward T12003.
  [ ] D17384 added a new "keywords" field, but MySQL does not search it (I think?). The behavior should be as consistent across MySQL and Elastic as we can make it. Likely cleaner is giving "Project" objects a body, with "slugs" and "description" separated by newlines?
  [ ] `PhabricatorSearchEngineTestCase` is now pointless and only detects local misconfigurations.
  [ ] It would be nice to build a practical test suite instead, where we put specific documents into the index and then search for them. The upstream test could run against MySQL, and some `bin/search test` could run against a configured engine like ElasticSearch. This would make it easier to make sure that behavior was as uniform as possible across engine implementations.
  [ ] Does every assigned task now match "user" in ElasticSearch?
  [x] `PhabricatorElasticFulltextStorageEngine` has a `json_encode()` which should be `phutil_json_encode()`.
  [ ] `PhabricatorSearchService` throws an untranslated exception.
  [ ] When a search cluster is down, we probably don't degrade with much grace (unhandled exception)?
  [ ] I haven't run bin/search init, but bin/search index doesn't warn me that I may want to. This might be worth adding. The UI does warn me.
  [ ] bin/search init warns me that the index is "incorrect". It might be more clear to distinguish between "missing" and "incorrect", since it's more comforting to users to see "everything is as we expect, doing normal first-time setup now" than "something is wrong, fixing it".
  [ ] CLI message "Initializing search service "ElasticSearch"" does not end with a period, which is inconsistent with other UI messages.
  [ ] It might be nice to let bin/search commands like init and index select a specific service (or even service + host) to act on, as bin/storage --ref ... now does. You can generally get the result you want by fiddling with config.
  [ ] When a service isn't writable, bin/search init reports "Search cluster has no hosts for role "write".". This is accurate but does not provide guidance: it might be more useful to the user to explain "This service is not writable, so we're skipping index check for it.".
  [x] Even with write off for MySQL, bin/search index --type task --trace still updates MySQL, I think? I may be misreading the trace output. But this behavior doesn't make sense if it is the actual behavior, and it seems like reindexAbstractDocument() uses "all services", not "writable services", and the MySQL engine doesn't make sure it's writable before indexing.
  [x] Searching or user fails to find task Grant users tokens when a mention is created, suggesting that stemming is not working.
  [x] Searching for users finds that task, but fails to find a task containing "per user per month" in a comment, also suggesting that stemming is not working.
  [x] Searching for maniphest fails to find task maniphest.query elephant, suggesting that tokenization in ElasticSearch is not as good as the MySQL tokenization for these words (see D17330).
  [x] The "index incorrect" warning UI uses inconsistent title case.
  [x] The "index incorrect" warning UI could format the command to be run more cleanly (with addCommand(), I think).

refs T12450

Test Plan:
* Stared blankly at the code.
* Disabled 'write' role on mysql fulltext service.
* Edited a task, ran search indexer, verified that the mysql index wasn't being updated.

Reviewers: epriestley, #blessed_reviewers

Reviewed By: epriestley, #blessed_reviewers

Subscribers: Korvin

Maniphest Tasks: T12450

Differential Revision: https://secure.phabricator.com/D17564
2017-03-28 20:19:38 +00:00
Mukunda Modell
e41c25de50 Support multiple fulltext search clusters with 'cluster.search' config
Summary:
The goal is to make fulltext search back-ends more extensible, configurable and robust.

When this is finished it will be possible to have multiple search storage back-ends and
potentially multiple instances of each.

Individual instances can be configured with roles such as 'read', 'write' which control
which hosts will receive writes to the index and which hosts will respond to queries.

These two roles make it possible to have any combination of:

* read-only
* write-only
* read-write
* disabled

This 'roles' mechanism is extensible to add new roles should that be needed in the future.

In addition to supporting multiple elasticsearch and mysql search instances, this refactors
the connection health monitoring infrastructure from PhabricatorDatabaseHealthRecord and
utilizes the same system for monitoring the health of elasticsearch nodes. This will
allow Wikimedia's phabricator to be redundant across data centers (mysql already is,
elasticsearch should be as well).

The real-world use-case I have in mind here is writing to two indexes (two elasticsearch clusters
in different data centers) but reading from only one. Then toggling the 'read' property when
we want to migrate to the other data center (and when we migrate from elasticsearch 2.x to 5.x)

Hopefully this is useful in the upstream as well.

Remaining TODO:

* test cases
* documentation

Test Plan:
(WARNING) This will most likely require the elasticsearch index to be deleted and re-created due to schema changes.

Tested with elasticsearch versions 2.4 and 5.2 using the following config:

```lang=json
  "cluster.search": [
    {
      "type": "elasticsearch",
      "hosts": [
        {
          "host": "localhost",
          "roles": { "read": true, "write": true }
        }
      ],
      "port": 9200,
      "protocol": "http",
      "path": "/phabricator",
      "version": 5
    },
    {
      "type": "mysql",
      "roles": { "write": true }
     }
  ]

Also deployed the same changes to Wikimedia's production Phabricator instance without any issues whatsoever.
```

Reviewers: epriestley, #blessed_reviewers

Reviewed By: epriestley, #blessed_reviewers

Subscribers: Korvin, epriestley

Tags: #elasticsearch, #clusters, #wikimedia

Differential Revision: https://secure.phabricator.com/D17384
2017-03-26 08:16:47 +00:00
epriestley
b2cdebefea Fix two errors from the error logs
Summary: Found these in the `secure` error logs: one bad call, one bad column.

Test Plan: Searched for empty string. Double-checked method name.

Reviewers: chad

Reviewed By: chad

Differential Revision: https://secure.phabricator.com/D16948
2016-11-26 07:50:57 -08:00
epriestley
8c89fc38fc Allow persistent connections to be configured per database host
Summary: Ref T11044. Fixes T11672. In T11672, persistent connections seem to work fine, but they can require `max_connections` and other settings to be raised. Since most users don't need them, make them an advanced option.

Test Plan: Configured persistent connections, loaded some pages, observed persistent connections get used.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11044, T11672

Differential Revision: https://secure.phabricator.com/D16913
2016-11-22 10:55:45 -08:00
epriestley
e6bfa1bd23 Remove "mysql.configuration-provider" configuration option
Summary:
Ref T11044. This was old Facebook cruft for reading configuration from SMC (and maybe doing some other questionable things). See D183.

(See also D175 for discussion of this from 2011.)

In modern Phabricator, you can subclass `SiteConfig` to provide dynamic configuration, and we do so in the Phacility cluster. This lets you change any config, and change in response to requests (e.g., for instancing) and is generally more powerful than this mechanism was.

This configuration provider theoretically let you roll your own replication or partitioning, but in practice I believe no one ever did, and no one ever could have anyway without more support in the upstream (for migrations, read-after-write, etc).

Test Plan:
  - Grepped for removed option.
  - Browsed around with clustering off.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11044

Differential Revision: https://secure.phabricator.com/D16911
2016-11-22 09:24:46 -08:00
epriestley
4da74166fe When storage is partitioned, refuse to serve requests unless web and databases agree on partitioning
Summary:
Ref T11044. One popular tool in a modern operations environment is Puppet. The primary purpose of this tool is to randomly revert hosts to older or different configurations.

Introducing an element of chaotic unpredictability into operations trains staff to be on high alert at all times, rather than lulled into complacency by predictability or consistency.

When Puppet reverts a Phabricator host's configuration to an older version, we might start writing data to a lot of crazy places where it shouldn't go. This will create a big sticky mess that is virtually impossible to undo, mostly because we'll get two files with ID 123 or two tasks with ID 456 or whatever else and good luck with that.

Instead, after changing the partition layout, require `bin/storage partition` to be run. This writes a copy of the config everywhere.

Then, when we start serving web requests, make sure every database has the exact same config. This will foil Puppet by refusing to run requests on hosts it has reverted.

Test Plan:
  - Changed partition configuration.
  - Ran Phabricator.
  - FOILED!
  - Ran `bin/storage partition` to sync config.
  - Things worked again.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11044

Differential Revision: https://secure.phabricator.com/D16910
2016-11-22 04:15:46 -08:00
epriestley
bac27fb403 Remove "mysql.implementation" configuration
Summary:
Ref T11044. Fixes T10931. This option has essentially never been useful for anything, and we've picked the best implementation for a long time (MySQLi if available, MySQL if not).

I am not aware of any reason to ever set this manually. If someone comes up with some bizarre but legitimate use case that I haven't thought of, we can modularize it.

Test Plan: Browsed around. Grepped for `mysql.implementation`.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10931, T11044

Differential Revision: https://secure.phabricator.com/D16909
2016-11-22 04:15:34 -08:00
epriestley
bcfd515b32 Run all minor setup checks on all configured database hosts
Summary:
Fixes T10759. Fixes T11817. This runs all the general sanity/configuration checks on all the active servers.

None of these warnings are very important, and this doesn't change any logical stuff.

Depends on D16904.

Test Plan: Painstakingly triggered each warning, verified that they rendered correctly and that messages told me which host was affected.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10759, T11817

Differential Revision: https://secure.phabricator.com/D16905
2016-11-21 15:55:54 -08:00
epriestley
326d5bf800 Detect replicating masters and fatal (also, warn on nonreplicating replicas)
Summary:
Ref T10759. Check master/replica status during startup.

After D16903, this also means that we check this status after a database comes back online after being unreachable.

If a master is replicating, fatal (since this can do a million kinds of bad things).

If a replica is not replicating, warn (this just means the replica is behind so some data is at risk).

Also: if your masters were actually configured properly (mine weren't until this change detected it), we would throw away patches as we applied them, so they would only apply to the //first// master. Instead, properly apply all migration patches to all masters.

Test Plan:
  - Started Phabricator with a replicating master, got a fatal.
  - Stopped replication on a replica, got a warning.
  - With two non-replicating masters, upgraded storage.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10759

Differential Revision: https://secure.phabricator.com/D16904
2016-11-21 15:55:22 -08:00
epriestley
55e21565b5 Support application partitioning across multiple masters
Summary:
Ref T11044. I'm going to hold this until after the release cut, but I think it's good to go.

This allows installs to configure multiple masters in `cluster.databases` and partition applications across them (for example, put Maniphest on a dedicated database).

When we make a Maniphest connection we go look up which master we should be hitting first, then connect to it.

This has at least approximately been planned for many years, so the actual change is largely just making sure that your config makes sense.

Test Plan:
  - Configured `db001.epriestley.com` and `db002.epriestley.com` as master/master.
  - Partitioned applications between them.
  - Interacted with various applications, saw writes go to the correct host.
  - Viewed "Database Servers" and saw partitioning information.
  - Ran schema upgrades.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11044

Differential Revision: https://secure.phabricator.com/D16876
2016-11-19 14:14:39 -08:00
epriestley
558d194302 Update bin/storage workflows to accommodate multiple masters
Summary: Depends on D16847. Ref T11044. This updates the remaining storage-related workflows from the CLI to accommodate multiple masters.

Test Plan:
  - Configured multiple masters.
  - Ran all `bin/storage` workflows.
  - Ran `arc unit --everything`.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11044

Differential Revision: https://secure.phabricator.com/D16848
2016-11-12 16:37:47 -08:00
epriestley
bc15eee3f2 Update SchemaQuery and the web UI to accommodate multiple master databases
Summary:
Depends on D16115. Ref T11044. In the brave new world of multiple masters, we need to check the schemata on each master when looking for missing storage patches, keys, schema changes, etc.

This realigns all the "check out what's up with that schema" calls to work for multiple hosts, and updates the web UI to include a "Server" column and allow you to browse per-server.

This doesn't update `bin/storage`, so it breaks things on its own (and unit tests probably won't pass). I'll update that in the next change.

Test Plan: Configured local environment in cluster mode with multiple masters, saw both hosts' status reported in web UI.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11044

Differential Revision: https://secure.phabricator.com/D16847
2016-11-12 16:36:52 -08:00
epriestley
ecc598f18d Support multiple database masters and convert easy callers
Summary:
Ref T11044. This moves toward partitioned application databases:

  - You can define multiple masters.
  - Convert all the easily-convertible code to become multi-master aware.

This doesn't convert most of `bin/storage` or "Config > Database (Stuff)" yet, as both are quite involved. They still work for now, but only operate on the first master instead of all masters.

Test Plan: Configured multiple masters, browsed around, ran `bin/storage` commands, ran `bin/storage --host ...`.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11044

Differential Revision: https://secure.phabricator.com/D16115
2016-11-12 16:30:20 -08:00
epriestley
d0013d0898 Distinguish between unreachable cluster database hosts and missing MySQL databases
Summary:
Fixes T11577. When we connect to a host and try to select a database which does not exist, we currently treat it as though the host wasn't reachable.

This isn't correct, and prevents storage from being initialized while already in cluster mode, since the "config" database won't exist yet the first time we connect.

Instead, distinguish between `AphrontSchemaQueryException` (thrown on connection if the requested database is not present) and other errors.

Test Plan:
  - Put Phabricator into cluster database mode (`cluster.databases = ...`).
  - Swapped `storage.default-namespace` to force initialization of a new install.
  - Ran `bin/storage upgrade`.
    - Before patch: Immediate fatal about unreachablility.
    - After patch: Database initialized.
  - Also ran initialization steps in tranditional single-host mode (`cluster.databases` empty, `mysql.host` configured).

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11577

Differential Revision: https://secure.phabricator.com/D16489
2016-09-02 08:23:21 -07:00
epriestley
01040e4573 Correctly disinguish between "0 seconds behind master" and "not replicating"
Summary: Fixes T11159. We get two different values here (`NULL` and `0`) with different meanings.

Test Plan:
  - Ran `STOP SLAVE;`.
  - Saw this:

{F1710181}

  - Ran `START SLAVE;`.
  - Back to normal.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11159

Differential Revision: https://secure.phabricator.com/D16225
2016-07-03 18:14:07 -07:00
epriestley
ac35246d0d Never sever non-cluster database; write more read-only documentation
Summary:
Ref T4571. Write more of the missing documentation sections and clarify a few things.

Since the "replicating master" check needs a special permission, imposes a performance penalty, is probably very difficult to misconfigure, and likely not a big deal anyway, just drop the idea of trying to automatically detect + prevent it. We still show if it's an issue on the status page, provided we have permission to check.

When you don't have any cluster databases configured, never stop trying to connect to the default master database. We might want to do this eventually as load reduction, but just don't muddy the waters too much for now while things stabilize.

Test Plan:
  - Tested functionality in cluster, non-cluster, and degraded-cluster modes.
  - Used status console to monitor a health check cycle.
  - Read docs.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15679
2016-04-11 08:44:11 -07:00
epriestley
ebff07d019 Automatically sever databases after prolonged unreachability
Summary:
Ref T4571. When a database goes down briefly, we fall back to replicas.

However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related).

Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database).

We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed.

This is similar to what most load balancers do when pulling web servers in and out of pools.

For now, the specific numbers are:

  - We do at most one health check every 3 seconds.
  - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes).
  - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it.

Test Plan:
  - Configured a bad `master`.
  - Browsed around for a bit, initially saw "unrechable master" errors.
  - After about 15 seconds, saw "major interruption" errors instead.
  - Fixed the config for `master`.
  - Browsed around for a while longer.
  - After about 15 seconds, things recovered.
  - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good:

{F1213397}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15677
2016-04-11 08:43:52 -07:00
epriestley
146fb646f9 Automatically degrade to read-only mode when unable to connect to the master
Summary:
Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible.

If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail.

Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related.

This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own.

Test Plan:
  - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately.
  - Faked full SEVERED mode, requests hit replicas and put me in the mode properly.
  - Made stuff work, hit some good pages.
  - Hit some non-cluster pages.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 12:20:13 -07:00
epriestley
e0a8cac703 When no master database is configured, automatically degrade to read-only mode
Summary: Ref T4571. If `cluster.databases` is configured but only has replicas, implicitly drop to read-only mode and send writes to a replica.

Test Plan:
  - Disabled the `master`, saw Phabricator automatically degrade into read-only mode against replicas.
  - (Also tested: explicit read-only mode, non-cluster mode, properly configured cluster mode).

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15672
2016-04-10 12:19:55 -07:00
epriestley
c178f29cdb Use new first-class MySQL timeout support in Phabricator
Summary: Fixes T6710. After D15669, we support a proper timeout parameter, so we don't need this hack anymore.

Test Plan: See D15669: forced a MySQL connector, set a low timeout, set a bad database, saw fast failures.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T6710

Differential Revision: https://secure.phabricator.com/D15670
2016-04-10 12:19:00 -07:00
epriestley
6a4a9bb2d2 When cluster.databases is configured, read the master connection from it
Summary:
Ref T4571. Ref T10759. Ref T10758. This isn't complete, but gets most of the job done:

  - When `cluster.databases` is set up, most things ignore `mysql.host` now.
  - You can `bin/storage upgrade` and stuff works.
  - You can browse around in the web UI and stuff works.

There's still a lot of weird tricky stuff to navigate, and this has real no advantages over configuring a single server yet (no automatic failover, etc).

Test Plan:
  - Configured `cluster.databases` to point at my `t1.micro` hosts in EC2 (master + replica).
  - Ran `bin/storage upgrade`, got a new install setup on them properly.
  - Survived setup warnings, browsed around.
  - Switched back to local config, ran `bin/storage upgrade`, browsed around, went through setup checks.
  - Intentionally broke config (bad hosts, no masters) and things seemed to react reasonably well.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571, T10758, T10759

Differential Revision: https://secure.phabricator.com/D15668
2016-04-10 12:18:42 -07:00
epriestley
0439645d5b Add a "Database Cluster Status" console in Config
Summary: Ref T4571. The configuration option still doesn't do anything, but add a status panel for basic setup monitoring.

Test Plan:
Here's what a good version looks like:

{F1212291}

Also faked most of the errors it can detect and got helpful diagnostic messages like this:

{F1212292}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15667
2016-04-09 20:34:13 -07:00
epriestley
3f51b78539 Lay cluster.databases configuration groundwork for database clustering
Summary:
Ref T4571. This adds a new option which allows you to upgrade your one-host configuration to a multi-host configuration by configuring it.

Doing this currently does nothing. I wrote a lot of words about what it is //supposed// to do in the future, though.

Test Plan:
  - Tried to configure the option in all the possible bad ways, got errors.
  - Read documentation.

Reviewers: chad

Reviewed By: chad

Subscribers: eadler

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15663
2016-04-09 13:41:16 -07:00