1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2025-01-11 23:31:03 +01:00
phorge-phorge/src/applications/cache
epriestley ebff07d019 Automatically sever databases after prolonged unreachability
Summary:
Ref T4571. When a database goes down briefly, we fall back to replicas.

However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related).

Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database).

We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed.

This is similar to what most load balancers do when pulling web servers in and out of pools.

For now, the specific numbers are:

  - We do at most one health check every 3 seconds.
  - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes).
  - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it.

Test Plan:
  - Configured a bad `master`.
  - Browsed around for a bit, initially saw "unrechable master" errors.
  - After about 15 seconds, saw "major interruption" errors instead.
  - Fixed the config for `master`.
  - Browsed around for a while longer.
  - After about 15 seconds, things recovered.
  - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good:

{F1213397}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15677
2016-04-11 08:43:52 -07:00
..
__tests__ Introduce a request cache mechanism 2015-06-04 17:27:31 -07:00
garbagecollector Provide bin/garbage for interacting with garbage collection 2015-10-02 09:17:24 -07:00
management Add dialog to purge opcode/data caches 2015-09-10 14:19:02 -07:00
spec Write 500 words on how to restart webservers 2015-12-02 09:16:10 -08:00
storage Fix visiblity of LiskDAO::getConfiguration() 2015-01-14 06:54:13 +11:00
PhabricatorCaches.php Automatically sever databases after prolonged unreachability 2016-04-11 08:43:52 -07:00
PhabricatorKeyValueDatabaseCache.php Add a cluster.read-only option 2016-04-09 13:40:47 -07:00