1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2024-12-04 04:32:43 +01:00
phorge-phorge/src/infrastructure/cluster/PhabricatorDatabaseRef.php

662 lines
16 KiB
PHP
Raw Normal View History

<?php
final class PhabricatorDatabaseRef
extends Phobject {
const STATUS_OKAY = 'okay';
const STATUS_FAIL = 'fail';
const STATUS_AUTH = 'auth';
const STATUS_REPLICATION_CLIENT = 'replication-client';
const REPLICATION_OKAY = 'okay';
const REPLICATION_MASTER_REPLICA = 'master-replica';
const REPLICATION_REPLICA_NONE = 'replica-none';
const REPLICATION_SLOW = 'replica-slow';
const REPLICATION_NOT_REPLICATING = 'not-replicating';
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
const KEY_REFS = 'cluster.db.refs';
const KEY_INDIVIDUAL = 'cluster.db.individual';
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
private $host;
private $port;
private $user;
private $pass;
private $disabled;
private $isMaster;
private $isIndividual;
private $connectionLatency;
private $connectionStatus;
private $connectionMessage;
private $replicaStatus;
private $replicaMessage;
private $replicaDelay;
Automatically sever databases after prolonged unreachability Summary: Ref T4571. When a database goes down briefly, we fall back to replicas. However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related). Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database). We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed. This is similar to what most load balancers do when pulling web servers in and out of pools. For now, the specific numbers are: - We do at most one health check every 3 seconds. - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes). - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it. Test Plan: - Configured a bad `master`. - Browsed around for a bit, initially saw "unrechable master" errors. - After about 15 seconds, saw "major interruption" errors instead. - Fixed the config for `master`. - Browsed around for a while longer. - After about 15 seconds, things recovered. - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good: {F1213397} Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15677
2016-04-10 23:18:09 +02:00
private $healthRecord;
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
private $didFailToConnect;
private $isDefaultPartition;
private $applicationMap = array();
private $masterRef;
private $replicaRefs = array();
public function setHost($host) {
$this->host = $host;
return $this;
}
public function getHost() {
return $this->host;
}
public function setPort($port) {
$this->port = $port;
return $this;
}
public function getPort() {
return $this->port;
}
public function setUser($user) {
$this->user = $user;
return $this;
}
public function getUser() {
return $this->user;
}
public function setPass(PhutilOpaqueEnvelope $pass) {
$this->pass = $pass;
return $this;
}
public function getPass() {
return $this->pass;
}
public function setIsMaster($is_master) {
$this->isMaster = $is_master;
return $this;
}
public function getIsMaster() {
return $this->isMaster;
}
public function setDisabled($disabled) {
$this->disabled = $disabled;
return $this;
}
public function getDisabled() {
return $this->disabled;
}
public function setConnectionLatency($connection_latency) {
$this->connectionLatency = $connection_latency;
return $this;
}
public function getConnectionLatency() {
return $this->connectionLatency;
}
public function setConnectionStatus($connection_status) {
$this->connectionStatus = $connection_status;
return $this;
}
public function getConnectionStatus() {
if ($this->connectionStatus === null) {
throw new PhutilInvalidStateException('queryAll');
}
return $this->connectionStatus;
}
public function setConnectionMessage($connection_message) {
$this->connectionMessage = $connection_message;
return $this;
}
public function getConnectionMessage() {
return $this->connectionMessage;
}
public function setReplicaStatus($replica_status) {
$this->replicaStatus = $replica_status;
return $this;
}
public function getReplicaStatus() {
return $this->replicaStatus;
}
public function setReplicaMessage($replica_message) {
$this->replicaMessage = $replica_message;
return $this;
}
public function getReplicaMessage() {
return $this->replicaMessage;
}
public function setReplicaDelay($replica_delay) {
$this->replicaDelay = $replica_delay;
return $this;
}
public function getReplicaDelay() {
return $this->replicaDelay;
}
public function setIsIndividual($is_individual) {
$this->isIndividual = $is_individual;
return $this;
}
public function getIsIndividual() {
return $this->isIndividual;
}
public function setIsDefaultPartition($is_default_partition) {
$this->isDefaultPartition = $is_default_partition;
return $this;
}
public function getIsDefaultPartition() {
return $this->isDefaultPartition;
}
public function setApplicationMap(array $application_map) {
$this->applicationMap = $application_map;
return $this;
}
public function getApplicationMap() {
return $this->applicationMap;
}
public function setMasterRef(PhabricatorDatabaseRef $master_ref) {
$this->masterRef = $master_ref;
return $this;
}
public function getMasterRef() {
return $this->masterRef;
}
public function addReplicaRef(PhabricatorDatabaseRef $replica_ref) {
$this->replicaRefs[] = $replica_ref;
return $this;
}
public function getReplicaRefs() {
return $this->replicaRefs;
}
public function getRefKey() {
$host = $this->getHost();
$port = $this->getPort();
if (strlen($port)) {
return "{$host}:{$port}";
}
return $host;
}
public static function getConnectionStatusMap() {
return array(
self::STATUS_OKAY => array(
'icon' => 'fa-exchange',
'color' => 'green',
'label' => pht('Okay'),
),
self::STATUS_FAIL => array(
'icon' => 'fa-times',
'color' => 'red',
'label' => pht('Failed'),
),
self::STATUS_AUTH => array(
'icon' => 'fa-key',
'color' => 'red',
'label' => pht('Invalid Credentials'),
),
self::STATUS_REPLICATION_CLIENT => array(
'icon' => 'fa-eye-slash',
'color' => 'yellow',
'label' => pht('Missing Permission'),
),
);
}
public static function getReplicaStatusMap() {
return array(
self::REPLICATION_OKAY => array(
'icon' => 'fa-download',
'color' => 'green',
'label' => pht('Okay'),
),
self::REPLICATION_MASTER_REPLICA => array(
'icon' => 'fa-database',
'color' => 'red',
'label' => pht('Replicating Master'),
),
self::REPLICATION_REPLICA_NONE => array(
'icon' => 'fa-download',
'color' => 'red',
'label' => pht('Not A Replica'),
),
self::REPLICATION_SLOW => array(
'icon' => 'fa-hourglass',
'color' => 'red',
'label' => pht('Slow Replication'),
),
self::REPLICATION_NOT_REPLICATING => array(
'icon' => 'fa-exclamation-triangle',
'color' => 'red',
'label' => pht('Not Replicating'),
),
);
}
public static function getClusterRefs() {
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
$cache = PhabricatorCaches::getRequestCache();
$refs = $cache->getKey(self::KEY_REFS);
if (!$refs) {
$refs = self::newRefs();
$cache->setKey(self::KEY_REFS, $refs);
}
return $refs;
}
public static function getLiveIndividualRef() {
$cache = PhabricatorCaches::getRequestCache();
$ref = $cache->getKey(self::KEY_INDIVIDUAL);
if (!$ref) {
$ref = self::newIndividualRef();
$cache->setKey(self::KEY_INDIVIDUAL, $ref);
}
return $ref;
}
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
public static function newRefs() {
$default_port = PhabricatorEnv::getEnvConfig('mysql.port');
$default_port = nonempty($default_port, 3306);
$default_user = PhabricatorEnv::getEnvConfig('mysql.user');
$default_pass = PhabricatorEnv::getEnvConfig('mysql.pass');
$default_pass = new PhutilOpaqueEnvelope($default_pass);
$config = PhabricatorEnv::getEnvConfig('cluster.databases');
return id(new PhabricatorDatabaseRefParser())
->setDefaultPort($default_port)
->setDefaultUser($default_user)
->setDefaultPass($default_pass)
->newRefs($config);
}
public static function queryAll() {
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
$refs = self::newRefs();
foreach ($refs as $ref) {
if ($ref->getDisabled()) {
continue;
}
$conn = $ref->newManagementConnection();
$t_start = microtime(true);
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
$replica_status = false;
try {
$replica_status = queryfx_one($conn, 'SHOW SLAVE STATUS');
$ref->setConnectionStatus(self::STATUS_OKAY);
} catch (AphrontAccessDeniedQueryException $ex) {
$ref->setConnectionStatus(self::STATUS_REPLICATION_CLIENT);
$ref->setConnectionMessage(
pht(
'No permission to run "SHOW SLAVE STATUS". Grant this user '.
'"REPLICATION CLIENT" permission to allow Phabricator to '.
'monitor replica health.'));
} catch (AphrontInvalidCredentialsQueryException $ex) {
$ref->setConnectionStatus(self::STATUS_AUTH);
$ref->setConnectionMessage($ex->getMessage());
} catch (AphrontQueryException $ex) {
$ref->setConnectionStatus(self::STATUS_FAIL);
$class = get_class($ex);
$message = $ex->getMessage();
$ref->setConnectionMessage(
pht(
'%s: %s',
get_class($ex),
$ex->getMessage()));
}
$t_end = microtime(true);
$ref->setConnectionLatency($t_end - $t_start);
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
if ($replica_status !== false) {
$is_replica = (bool)$replica_status;
if ($ref->getIsMaster() && $is_replica) {
$ref->setReplicaStatus(self::REPLICATION_MASTER_REPLICA);
$ref->setReplicaMessage(
pht(
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
'This host has a "master" role, but is replicating data from '.
'another host ("%s")!',
idx($replica_status, 'Master_Host')));
} else if (!$ref->getIsMaster() && !$is_replica) {
$ref->setReplicaStatus(self::REPLICATION_REPLICA_NONE);
$ref->setReplicaMessage(
pht(
'This host has a "replica" role, but is not replicating data '.
'from a master (no output from "SHOW SLAVE STATUS").'));
} else {
$ref->setReplicaStatus(self::REPLICATION_OKAY);
}
if ($is_replica) {
$latency = idx($replica_status, 'Seconds_Behind_Master');
if (!strlen($latency)) {
$ref->setReplicaStatus(self::REPLICATION_NOT_REPLICATING);
} else {
$latency = (int)$latency;
$ref->setReplicaDelay($latency);
if ($latency > 30) {
$ref->setReplicaStatus(self::REPLICATION_SLOW);
$ref->setReplicaMessage(
pht(
'This replica is lagging far behind the master. Data is at '.
'risk!'));
}
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
}
}
}
}
return $refs;
}
public function newManagementConnection() {
return $this->newConnection(
array(
'retries' => 0,
Automatically sever databases after prolonged unreachability Summary: Ref T4571. When a database goes down briefly, we fall back to replicas. However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related). Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database). We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed. This is similar to what most load balancers do when pulling web servers in and out of pools. For now, the specific numbers are: - We do at most one health check every 3 seconds. - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes). - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it. Test Plan: - Configured a bad `master`. - Browsed around for a bit, initially saw "unrechable master" errors. - After about 15 seconds, saw "major interruption" errors instead. - Fixed the config for `master`. - Browsed around for a while longer. - After about 15 seconds, things recovered. - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good: {F1213397} Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15677
2016-04-10 23:18:09 +02:00
'timeout' => 2,
));
}
public function newApplicationConnection($database) {
return $this->newConnection(
array(
'database' => $database,
));
}
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
public function isSevered() {
// If we only have an individual database, never sever our connection to
// it, at least for now. It's possible that using the same severing rules
// might eventually make sense to help alleviate load-related failures,
// but we should wait for all the cluster stuff to stabilize first.
if ($this->getIsIndividual()) {
return false;
}
Automatically sever databases after prolonged unreachability Summary: Ref T4571. When a database goes down briefly, we fall back to replicas. However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related). Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database). We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed. This is similar to what most load balancers do when pulling web servers in and out of pools. For now, the specific numbers are: - We do at most one health check every 3 seconds. - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes). - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it. Test Plan: - Configured a bad `master`. - Browsed around for a bit, initially saw "unrechable master" errors. - After about 15 seconds, saw "major interruption" errors instead. - Fixed the config for `master`. - Browsed around for a while longer. - After about 15 seconds, things recovered. - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good: {F1213397} Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15677
2016-04-10 23:18:09 +02:00
if ($this->didFailToConnect) {
return true;
}
$record = $this->getHealthRecord();
$is_healthy = $record->getIsHealthy();
if (!$is_healthy) {
return true;
}
return false;
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
}
public function isReachable(AphrontDatabaseConnection $connection) {
Automatically sever databases after prolonged unreachability Summary: Ref T4571. When a database goes down briefly, we fall back to replicas. However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related). Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database). We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed. This is similar to what most load balancers do when pulling web servers in and out of pools. For now, the specific numbers are: - We do at most one health check every 3 seconds. - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes). - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it. Test Plan: - Configured a bad `master`. - Browsed around for a bit, initially saw "unrechable master" errors. - After about 15 seconds, saw "major interruption" errors instead. - Fixed the config for `master`. - Browsed around for a while longer. - After about 15 seconds, things recovered. - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good: {F1213397} Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15677
2016-04-10 23:18:09 +02:00
$record = $this->getHealthRecord();
$should_check = $record->getShouldCheck();
if ($this->isSevered() && !$should_check) {
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
return false;
}
try {
$connection->openConnection();
$reachable = true;
} catch (AphrontSchemaQueryException $ex) {
// We get one of these if the database we're trying to select does not
// exist. In this case, just re-throw the exception. This is expected
// during first-time setup, when databases like "config" will not exist
// yet.
throw $ex;
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
} catch (Exception $ex) {
$reachable = false;
}
Automatically sever databases after prolonged unreachability Summary: Ref T4571. When a database goes down briefly, we fall back to replicas. However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related). Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database). We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed. This is similar to what most load balancers do when pulling web servers in and out of pools. For now, the specific numbers are: - We do at most one health check every 3 seconds. - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes). - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it. Test Plan: - Configured a bad `master`. - Browsed around for a bit, initially saw "unrechable master" errors. - After about 15 seconds, saw "major interruption" errors instead. - Fixed the config for `master`. - Browsed around for a while longer. - After about 15 seconds, things recovered. - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good: {F1213397} Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15677
2016-04-10 23:18:09 +02:00
if ($should_check) {
$record->didHealthCheck($reachable);
}
Automatically degrade to read-only mode when unable to connect to the master Summary: Ref T4571. If we fail to connect to the master, automatically try to degrade into a temporary read-only mode ("UNREACHABLE") for the remainder of the request, if possible. If the request was something like "load the homepage", that'll work fine. If it was something like "submit a comment", there's nothing we can do and we just have to fail. Detecting this condition imposes a performance penalty: every request checks the connection and gives the database a long time to respond, since we don't want to drop writes unless we have to. So the degraded mode works, but it's really slow, and may perpetuate the problem if the root issue is load-related. This lays the groundwork for improving this case by degrading futher into a "SEVERED" mode which will persist across requests. In the future, if several requests in a short period of time fail, we'll sever the database host and refuse to try to connect to it for a little while, connecting directly to replicas instead (basically, we're "health checking" the master, like a load balancer would health check a web application server). This will give us a better (much faster) degraded mode in a major service disruption, and reduce load on the master if the root cause is load-related, giving it a better chance of recovering on its own. Test Plan: - Disabled master in config by changing the host/username, got degraded automatically to UNREACAHBLE mode immediately. - Faked full SEVERED mode, requests hit replicas and put me in the mode properly. - Made stuff work, hit some good pages. - Hit some non-cluster pages. Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15674
2016-04-10 14:51:34 +02:00
if (!$reachable) {
$this->didFailToConnect = true;
}
return $reachable;
}
Automatically sever databases after prolonged unreachability Summary: Ref T4571. When a database goes down briefly, we fall back to replicas. However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related). Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database). We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed. This is similar to what most load balancers do when pulling web servers in and out of pools. For now, the specific numbers are: - We do at most one health check every 3 seconds. - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes). - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it. Test Plan: - Configured a bad `master`. - Browsed around for a bit, initially saw "unrechable master" errors. - After about 15 seconds, saw "major interruption" errors instead. - Fixed the config for `master`. - Browsed around for a while longer. - After about 15 seconds, things recovered. - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good: {F1213397} Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15677
2016-04-10 23:18:09 +02:00
public function checkHealth() {
$health = $this->getHealthRecord();
$should_check = $health->getShouldCheck();
if ($should_check) {
// This does an implicit health update.
$connection = $this->newManagementConnection();
$this->isReachable($connection);
}
return $this;
}
public function getHealthRecord() {
if (!$this->healthRecord) {
$this->healthRecord = new PhabricatorDatabaseHealthRecord($this);
}
return $this->healthRecord;
}
public static function getActiveDatabaseRefs() {
$refs = array();
foreach (self::getMasterDatabaseRefs() as $ref) {
$refs[] = $ref;
}
foreach (self::getReplicaDatabaseRefs() as $ref) {
$refs[] = $ref;
}
return $refs;
}
public static function getAllMasterDatabaseRefs() {
$refs = self::getClusterRefs();
if (!$refs) {
return array(self::getLiveIndividualRef());
}
$masters = array();
foreach ($refs as $ref) {
if ($ref->getDisabled()) {
continue;
}
if ($ref->getIsMaster()) {
$masters[] = $ref;
}
}
return $masters;
}
public static function getMasterDatabaseRefs() {
$refs = self::getAllMasterDatabaseRefs();
return self::getEnabledRefs($refs);
}
public function isApplicationHost($database) {
return isset($this->applicationMap[$database]);
}
public static function getMasterDatabaseRefForApplication($application) {
$masters = self::getMasterDatabaseRefs();
$application_master = null;
$default_master = null;
foreach ($masters as $master) {
if ($master->isApplicationHost($application)) {
$application_master = $master;
break;
}
if ($master->getIsDefaultPartition()) {
$default_master = $master;
}
}
if ($application_master) {
$masters = array($application_master);
} else if ($default_master) {
$masters = array($default_master);
} else {
$masters = array();
}
$masters = self::getEnabledRefs($masters);
$master = head($masters);
return $master;
}
public static function newIndividualRef() {
$conf = PhabricatorEnv::newObjectFromConfig(
'mysql.configuration-provider',
array(null, 'w', null));
return id(new self())
->setHost($conf->getHost())
->setPort($conf->getPort())
->setUser($conf->getUser())
->setPass($conf->getPassword())
->setIsIndividual(true)
->setIsMaster(true);
}
public static function getAllReplicaDatabaseRefs() {
$refs = self::getClusterRefs();
if (!$refs) {
return array();
}
$replicas = array();
foreach ($refs as $ref) {
if ($ref->getIsMaster()) {
continue;
}
$replicas[] = $ref;
}
return $replicas;
}
public static function getReplicaDatabaseRefs() {
$refs = self::getAllReplicaDatabaseRefs();
return self::getEnabledRefs($refs);
}
private static function getEnabledRefs(array $refs) {
foreach ($refs as $key => $ref) {
if ($ref->getDisabled()) {
unset($refs[$key]);
}
}
return $refs;
}
public static function getReplicaDatabaseRefForApplication($application) {
$replicas = self::getReplicaDatabaseRefs();
$application_replicas = array();
$default_replicas = array();
foreach ($replicas as $replica) {
$master = $replica->getMaster();
if ($master->isApplicationHost($application)) {
$application_replicas[] = $replica;
}
if ($master->getIsDefaultPartition()) {
$default_replicas[] = $replica;
}
}
if ($application_replicas) {
$replicas = $application_replicas;
} else {
$replicas = $default_replicas;
}
$replicas = self::getEnabledRefs($replicas);
// TODO: We may have multiple replicas to choose from, and could make
// more of an effort to pick the "best" one here instead of always
// picking the first one. Once we've picked one, we should try to use
// the same replica for the rest of the request, though.
return head($replicas);
}
private function newConnection(array $options) {
Automatically sever databases after prolonged unreachability Summary: Ref T4571. When a database goes down briefly, we fall back to replicas. However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related). Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database). We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed. This is similar to what most load balancers do when pulling web servers in and out of pools. For now, the specific numbers are: - We do at most one health check every 3 seconds. - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes). - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it. Test Plan: - Configured a bad `master`. - Browsed around for a bit, initially saw "unrechable master" errors. - After about 15 seconds, saw "major interruption" errors instead. - Fixed the config for `master`. - Browsed around for a while longer. - After about 15 seconds, things recovered. - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good: {F1213397} Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15677
2016-04-10 23:18:09 +02:00
// If we believe the database is unhealthy, don't spend as much time
// trying to connect to it, since it's likely to continue to fail and
// hammering it can only make the problem worse.
$record = $this->getHealthRecord();
if ($record->getIsHealthy()) {
$default_retries = 3;
$default_timeout = 10;
} else {
$default_retries = 0;
$default_timeout = 2;
}
$spec = $options + array(
'user' => $this->getUser(),
'pass' => $this->getPass(),
'host' => $this->getHost(),
'port' => $this->getPort(),
'database' => null,
Automatically sever databases after prolonged unreachability Summary: Ref T4571. When a database goes down briefly, we fall back to replicas. However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related). Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database). We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed. This is similar to what most load balancers do when pulling web servers in and out of pools. For now, the specific numbers are: - We do at most one health check every 3 seconds. - If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes). - If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it. Test Plan: - Configured a bad `master`. - Browsed around for a bit, initially saw "unrechable master" errors. - After about 15 seconds, saw "major interruption" errors instead. - Fixed the config for `master`. - Browsed around for a while longer. - After about 15 seconds, things recovered. - Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good: {F1213397} Reviewers: chad Reviewed By: chad Maniphest Tasks: T4571 Differential Revision: https://secure.phabricator.com/D15677
2016-04-10 23:18:09 +02:00
'retries' => $default_retries,
'timeout' => $default_timeout,
);
return PhabricatorEnv::newObjectFromConfig(
'mysql.implementation',
array(
$spec,
));
}
}