phorge-phorge

mirror of https://we.phorge.it/source/phorge.git synced 2025-03-10 11:24:49 +01:00

Author	SHA1	Message	Date
epriestley	58eef68b7c	Rough cut of repository cluster status panel Summary: Ref T4292. This adds some very basic cluster/device data to the new management view. Nothing interesting yet. Also deal with disabled bindings a little more cleanly. Test Plan: {F1214619} Reviewers: chad Reviewed By: chad Maniphest Tasks: T4292 Differential Revision: https://secure.phabricator.com/D15685	2016-04-12 05:38:10 -07:00
epriestley	ac35246d0d	Never sever non-cluster database; write more read-only documentation Summary: Ref T4571. Write more of the missing documentation sections and clarify a few things. Since the "replicating master" check needs a special permission, imposes a performance penalty, is probably very difficult to misconfigure, and likely not a big deal anyway, just drop the idea of trying to automatically detect + prevent it. We still show if it's an issue on the status page, provided we have permission to check. When you don't have any cluster databases configured, never stop trying to connect to the default master database. We might want to do this eventually as load reduction, but just don't muddy the waters too much for now while things stabilize. Test Plan: - Tested functionality in cluster, non-cluster, and degraded-cluster modes. - Used status console to monitor a health check cycle. - Read docs. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15679	2016-04-11 08:44:11 -07:00
epriestley	ebff07d019	Automatically sever databases after prolonged unreachability Summary: Ref T4571. When a database goes down briefly, we fall back to replicas. However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related). Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database). We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed. This is similar to what most load balancers do when pulling web servers in and out of pools. For now, the specific numbers are: - We do at most one health check every 3 seconds. - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes). - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it. Test Plan: - Configured a bad `master`. - Browsed around for a bit, initially saw "unrechable master" errors. - After about 15 seconds, saw "major interruption" errors instead. - Fixed the config for `master`. - Browsed around for a while longer. - After about 15 seconds, things recovered. - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good: {F1213397} Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15677	2016-04-11 08:43:52 -07:00
epriestley	0439645d5b	Add a "Database Cluster Status" console in Config Summary: Ref T4571. The configuration option still doesn't do anything, but add a status panel for basic setup monitoring. Test Plan: Here's what a good version looks like: {F1212291} Also faked most of the errors it can detect and got helpful diagnostic messages like this: {F1212292} Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15667	2016-04-09 20:34:13 -07:00

4 commits