1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2024-12-04 04:32:43 +01:00
Commit graph

9 commits

Author SHA1 Message Date
epriestley
d0013d0898 Distinguish between unreachable cluster database hosts and missing MySQL databases
Summary:
Fixes T11577. When we connect to a host and try to select a database which does not exist, we currently treat it as though the host wasn't reachable.

This isn't correct, and prevents storage from being initialized while already in cluster mode, since the "config" database won't exist yet the first time we connect.

Instead, distinguish between `AphrontSchemaQueryException` (thrown on connection if the requested database is not present) and other errors.

Test Plan:
  - Put Phabricator into cluster database mode (`cluster.databases = ...`).
  - Swapped `storage.default-namespace` to force initialization of a new install.
  - Ran `bin/storage upgrade`.
    - Before patch: Immediate fatal about unreachablility.
    - After patch: Database initialized.
  - Also ran initialization steps in tranditional single-host mode (`cluster.databases` empty, `mysql.host` configured).

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11577

Differential Revision: https://secure.phabricator.com/D16489
2016-09-02 08:23:21 -07:00
epriestley
01040e4573 Correctly disinguish between "0 seconds behind master" and "not replicating"
Summary: Fixes T11159. We get two different values here (`NULL` and `0`) with different meanings.

Test Plan:
  - Ran `STOP SLAVE;`.
  - Saw this:

{F1710181}

  - Ran `START SLAVE;`.
  - Back to normal.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11159

Differential Revision: https://secure.phabricator.com/D16225
2016-07-03 18:14:07 -07:00
epriestley
ac35246d0d Never sever non-cluster database; write more read-only documentation
Summary:
Ref T4571. Write more of the missing documentation sections and clarify a few things.

Since the "replicating master" check needs a special permission, imposes a performance penalty, is probably very difficult to misconfigure, and likely not a big deal anyway, just drop the idea of trying to automatically detect + prevent it. We still show if it's an issue on the status page, provided we have permission to check.

When you don't have any cluster databases configured, never stop trying to connect to the default master database. We might want to do this eventually as load reduction, but just don't muddy the waters too much for now while things stabilize.

Test Plan:
  - Tested functionality in cluster, non-cluster, and degraded-cluster modes.
  - Used status console to monitor a health check cycle.
  - Read docs.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15679
2016-04-11 08:44:11 -07:00
epriestley
ebff07d019 Automatically sever databases after prolonged unreachability
Summary:
Ref T4571. When a database goes down briefly, we fall back to replicas.

However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related).

Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database).

We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed.

This is similar to what most load balancers do when pulling web servers in and out of pools.

For now, the specific numbers are:

  - We do at most one health check every 3 seconds.
  - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes).
  - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it.

Test Plan:
  - Configured a bad `master`.
  - Browsed around for a bit, initially saw "unrechable master" errors.
  - After about 15 seconds, saw "major interruption" errors instead.
  - Fixed the config for `master`.
  - Browsed around for a while longer.
  - After about 15 seconds, things recovered.
  - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good:

{F1213397}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15677
2016-04-11 08:43:52 -07:00
epriestley
146fb646f9 Automatically degrade to read-only mode when unable to connect to the master
Summary:
Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible.

If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail.

Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related.

This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own.

Test Plan:
  - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately.
  - Faked full SEVERED mode, requests hit replicas and put me in the mode properly.
  - Made stuff work, hit some good pages.
  - Hit some non-cluster pages.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 12:20:13 -07:00
epriestley
e0a8cac703 When no master database is configured, automatically degrade to read-only mode
Summary: Ref T4571. If `cluster.databases` is configured but only has replicas, implicitly drop to read-only mode and send writes to a replica.

Test Plan:
  - Disabled the `master`, saw Phabricator automatically degrade into read-only mode against replicas.
  - (Also tested: explicit read-only mode, non-cluster mode, properly configured cluster mode).

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15672
2016-04-10 12:19:55 -07:00
epriestley
c178f29cdb Use new first-class MySQL timeout support in Phabricator
Summary: Fixes T6710. After D15669, we support a proper timeout parameter, so we don't need this hack anymore.

Test Plan: See D15669: forced a MySQL connector, set a low timeout, set a bad database, saw fast failures.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T6710

Differential Revision: https://secure.phabricator.com/D15670
2016-04-10 12:19:00 -07:00
epriestley
6a4a9bb2d2 When cluster.databases is configured, read the master connection from it
Summary:
Ref T4571. Ref T10759. Ref T10758. This isn't complete, but gets most of the job done:

  - When `cluster.databases` is set up, most things ignore `mysql.host` now.
  - You can `bin/storage upgrade` and stuff works.
  - You can browse around in the web UI and stuff works.

There's still a lot of weird tricky stuff to navigate, and this has real no advantages over configuring a single server yet (no automatic failover, etc).

Test Plan:
  - Configured `cluster.databases` to point at my `t1.micro` hosts in EC2 (master + replica).
  - Ran `bin/storage upgrade`, got a new install setup on them properly.
  - Survived setup warnings, browsed around.
  - Switched back to local config, ran `bin/storage upgrade`, browsed around, went through setup checks.
  - Intentionally broke config (bad hosts, no masters) and things seemed to react reasonably well.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571, T10758, T10759

Differential Revision: https://secure.phabricator.com/D15668
2016-04-10 12:18:42 -07:00
epriestley
0439645d5b Add a "Database Cluster Status" console in Config
Summary: Ref T4571. The configuration option still doesn't do anything, but add a status panel for basic setup monitoring.

Test Plan:
Here's what a good version looks like:

{F1212291}

Also faked most of the errors it can detect and got helpful diagnostic messages like this:

{F1212292}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15667
2016-04-09 20:34:13 -07:00