mirror of
https://we.phorge.it/source/phorge.git
synced 2024-11-26 16:52:41 +01:00
Make "bin/repository thaw" workflow more clear when devices are disabled
Summary: Ref T13216. See PHI943. If autoscale lightning strikes all your servers at once and destroys them, the path to recovery can be unclear. You're "supposed" to: - demote all the devices; - disable the bindings; - bind the new servers; - put whatever working copies you can scrape up back on disk; - promote one of the new servers. However, the documentation is a bit misleading (it was sort of written with "you lost one or two devices" in mind, not "you lost every device") and demote-before-disable is unnecessary and slightly risky if servers come back online. There's also a missing guardrail before the promote step which lets you accidentally skip the demotion step and end up in a confusing state. Instead: - Add a guard rail: when you try to promote a new server, warn if inactive devices still have versions and tell the user to demote them. - Allow demotion of inactive devices: the order "disable, demote" is safer and more intuitive than "demote, disable" and there's no reason to require the unintuitive order. - Make the "cluster already has leaders" message more clear. - Make the documentation more clear. Test Plan: - Bound a repository to two devices. - Wrote to A to make it a leader, then disabled it (simulating a lightning strike). - Tried to promote B. Got a new, useful error ("demote A first"). - Demoted A (before: error about demoting inactive devices; now: works fine). - Promoted B. This worked. Reviewers: amckinley Reviewed By: amckinley Maniphest Tasks: T13216 Differential Revision: https://secure.phabricator.com/D19793
This commit is contained in:
parent
c32fa06266
commit
2a7ac8e388
2 changed files with 74 additions and 40 deletions
|
@ -120,33 +120,71 @@ final class PhabricatorRepositoryManagementThawWorkflow
|
||||||
$repository->getDisplayName()));
|
$repository->getDisplayName()));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if ($promote) {
|
||||||
|
// You can only promote active devices. (You may demote active or
|
||||||
|
// inactive devices.)
|
||||||
$bindings = $service->getActiveBindings();
|
$bindings = $service->getActiveBindings();
|
||||||
$bindings = mpull($bindings, null, 'getDevicePHID');
|
$bindings = mpull($bindings, null, 'getDevicePHID');
|
||||||
if (empty($bindings[$device->getPHID()])) {
|
if (empty($bindings[$device->getPHID()])) {
|
||||||
throw new PhutilArgumentUsageException(
|
throw new PhutilArgumentUsageException(
|
||||||
pht(
|
pht(
|
||||||
'Repository "%s" has no active binding to device "%s". Only '.
|
'Repository "%s" has no active binding to device "%s". Only '.
|
||||||
'actively bound devices can be promoted or demoted.',
|
'actively bound devices can be promoted.',
|
||||||
$repository->getDisplayName(),
|
$repository->getDisplayName(),
|
||||||
$device->getName()));
|
$device->getName()));
|
||||||
}
|
}
|
||||||
|
|
||||||
$versions = PhabricatorRepositoryWorkingCopyVersion::loadVersions(
|
$versions = PhabricatorRepositoryWorkingCopyVersion::loadVersions(
|
||||||
$repository->getPHID());
|
$repository->getPHID());
|
||||||
|
|
||||||
$versions = mpull($versions, null, 'getDevicePHID');
|
$versions = mpull($versions, null, 'getDevicePHID');
|
||||||
$versions = array_select_keys($versions, array_keys($bindings));
|
|
||||||
|
|
||||||
if ($versions && $promote) {
|
// Before we promote, make sure there are no outstanding versions on
|
||||||
throw new PhutilArgumentUsageException(
|
// devices with inactive bindings. If there are, you need to demote
|
||||||
pht(
|
// these first.
|
||||||
'Unable to promote "%s" for repository "%s": the leaders for '.
|
$inactive = array();
|
||||||
'this cluster are not ambiguous.',
|
foreach ($versions as $device_phid => $version) {
|
||||||
$device->getName(),
|
if (isset($bindings[$device_phid])) {
|
||||||
$repository->getDisplayName()));
|
continue;
|
||||||
|
}
|
||||||
|
$inactive[$device_phid] = $version;
|
||||||
|
}
|
||||||
|
|
||||||
|
if ($inactive) {
|
||||||
|
$handles = $viewer->loadHandles(array_keys($inactive));
|
||||||
|
|
||||||
|
$handle_list = iterator_to_array($handles);
|
||||||
|
$handle_list = mpull($handle_list, 'getName');
|
||||||
|
$handle_list = implode(', ', $handle_list);
|
||||||
|
|
||||||
|
throw new PhutilArgumentUsageException(
|
||||||
|
pht(
|
||||||
|
'Repository "%s" has versions on inactive devices. Demote '.
|
||||||
|
'(or reactivate) these devices before promoting a new '.
|
||||||
|
'leader: %s.',
|
||||||
|
$repository->getDisplayName(),
|
||||||
|
$handle_list));
|
||||||
|
}
|
||||||
|
|
||||||
|
// Now, make sure there are no outstanding versions on devices with
|
||||||
|
// active bindings. These also need to be demoted (or promoting is a
|
||||||
|
// mistake or already happened).
|
||||||
|
$active = array_select_keys($versions, array_keys($bindings));
|
||||||
|
if ($active) {
|
||||||
|
$handles = $viewer->loadHandles(array_keys($active));
|
||||||
|
|
||||||
|
$handle_list = iterator_to_array($handles);
|
||||||
|
$handle_list = mpull($handle_list, 'getName');
|
||||||
|
$handle_list = implode(', ', $handle_list);
|
||||||
|
|
||||||
|
throw new PhutilArgumentUsageException(
|
||||||
|
pht(
|
||||||
|
'Unable to promote "%s" for repository "%s" because this '.
|
||||||
|
'cluster already has one or more unambiguous leaders: %s.',
|
||||||
|
$device->getName(),
|
||||||
|
$repository->getDisplayName(),
|
||||||
|
$handle_list));
|
||||||
}
|
}
|
||||||
|
|
||||||
if ($promote) {
|
|
||||||
PhabricatorRepositoryWorkingCopyVersion::updateVersion(
|
PhabricatorRepositoryWorkingCopyVersion::updateVersion(
|
||||||
$repository->getPHID(),
|
$repository->getPHID(),
|
||||||
$device->getPHID(),
|
$device->getPHID(),
|
||||||
|
|
|
@ -414,28 +414,24 @@ present on the leaders but not present on the followers by examining the
|
||||||
push logs.
|
push logs.
|
||||||
|
|
||||||
If you are comfortable discarding these changes, you can instruct Phabricator
|
If you are comfortable discarding these changes, you can instruct Phabricator
|
||||||
that it can forget about the leaders in two ways: disable the service bindings
|
that it can forget about the leaders by doing this:
|
||||||
to all of the leader devices so they are no longer part of the cluster, or use
|
|
||||||
`bin/repository thaw` to `--demote` the leaders explicitly.
|
|
||||||
|
|
||||||
If you do this, **you will lose data**. Either action will discard any changes
|
- Disable the service bindings to all of the leader devices so they are no
|
||||||
on the affected leaders which have not replicated to other devices in the
|
longer part of the cluster.
|
||||||
cluster.
|
- Then, use `bin/repository thaw` to `--demote` the leaders explicitly.
|
||||||
|
|
||||||
To remove a device from the cluster, disable all of the bindings to it
|
To demote a device, run this command:
|
||||||
in Almanac, using the web UI.
|
|
||||||
|
|
||||||
{icon exclamation-triangle, color="red"} Any data which is only present on
|
|
||||||
the disabled device will be lost.
|
|
||||||
|
|
||||||
To demote a device without removing it from the cluster, run this command:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
phabricator/ $ ./bin/repository thaw rXYZ --demote repo002.corp.net
|
phabricator/ $ ./bin/repository thaw rXYZ --demote repo002.corp.net
|
||||||
```
|
```
|
||||||
|
|
||||||
{icon exclamation-triangle, color="red"} Any data which is only present on
|
{icon exclamation-triangle, color="red"} Any data which is only present on
|
||||||
**this** device will be lost.
|
the demoted device will be lost.
|
||||||
|
|
||||||
|
If you do this, **you will lose unreplicated data**. You will discard any
|
||||||
|
changes on the affected leaders which have not replicated to other devices
|
||||||
|
in the cluster.
|
||||||
|
|
||||||
|
|
||||||
Ambiguous Leaders
|
Ambiguous Leaders
|
||||||
|
|
Loading…
Reference in a new issue