1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2024-11-22 23:02:42 +01:00

Make "bin/repository thaw" workflow more clear when devices are disabled

Summary:
Ref T13216. See PHI943. If autoscale lightning strikes all your servers at once and destroys them, the path to recovery can be unclear. You're "supposed" to:

  - demote all the devices;
  - disable the bindings;
  - bind the new servers;
  - put whatever working copies you can scrape up back on disk;
  - promote one of the new servers.

However, the documentation is a bit misleading (it was sort of written with "you lost one or two devices" in mind, not "you lost every device") and demote-before-disable is unnecessary and slightly risky if servers come back online. There's also a missing guardrail before the promote step which lets you accidentally skip the demotion step and end up in a confusing state. Instead:

  - Add a guard rail: when you try to promote a new server, warn if inactive devices still have versions and tell the user to demote them.
  - Allow demotion of inactive devices: the order "disable, demote" is safer and more intuitive than "demote, disable" and there's no reason to require the unintuitive order.
  - Make the "cluster already has leaders" message more clear.
  - Make the documentation more clear.

Test Plan:
  - Bound a repository to two devices.
  - Wrote to A to make it a leader, then disabled it (simulating a lightning strike).
  - Tried to promote B. Got a new, useful error ("demote A first").
  - Demoted A (before: error about demoting inactive devices; now: works fine).
  - Promoted B. This worked.

Reviewers: amckinley

Reviewed By: amckinley

Maniphest Tasks: T13216

Differential Revision: https://secure.phabricator.com/D19793
This commit is contained in:
epriestley 2018-11-07 10:08:51 -08:00
parent c32fa06266
commit 2a7ac8e388
2 changed files with 74 additions and 40 deletions

View file

@ -120,33 +120,71 @@ final class PhabricatorRepositoryManagementThawWorkflow
$repository->getDisplayName()));
}
if ($promote) {
// You can only promote active devices. (You may demote active or
// inactive devices.)
$bindings = $service->getActiveBindings();
$bindings = mpull($bindings, null, 'getDevicePHID');
if (empty($bindings[$device->getPHID()])) {
throw new PhutilArgumentUsageException(
pht(
'Repository "%s" has no active binding to device "%s". Only '.
'actively bound devices can be promoted or demoted.',
'actively bound devices can be promoted.',
$repository->getDisplayName(),
$device->getName()));
}
$versions = PhabricatorRepositoryWorkingCopyVersion::loadVersions(
$repository->getPHID());
$versions = mpull($versions, null, 'getDevicePHID');
$versions = array_select_keys($versions, array_keys($bindings));
if ($versions && $promote) {
throw new PhutilArgumentUsageException(
pht(
'Unable to promote "%s" for repository "%s": the leaders for '.
'this cluster are not ambiguous.',
$device->getName(),
$repository->getDisplayName()));
// Before we promote, make sure there are no outstanding versions on
// devices with inactive bindings. If there are, you need to demote
// these first.
$inactive = array();
foreach ($versions as $device_phid => $version) {
if (isset($bindings[$device_phid])) {
continue;
}
$inactive[$device_phid] = $version;
}
if ($inactive) {
$handles = $viewer->loadHandles(array_keys($inactive));
$handle_list = iterator_to_array($handles);
$handle_list = mpull($handle_list, 'getName');
$handle_list = implode(', ', $handle_list);
throw new PhutilArgumentUsageException(
pht(
'Repository "%s" has versions on inactive devices. Demote '.
'(or reactivate) these devices before promoting a new '.
'leader: %s.',
$repository->getDisplayName(),
$handle_list));
}
// Now, make sure there are no outstanding versions on devices with
// active bindings. These also need to be demoted (or promoting is a
// mistake or already happened).
$active = array_select_keys($versions, array_keys($bindings));
if ($active) {
$handles = $viewer->loadHandles(array_keys($active));
$handle_list = iterator_to_array($handles);
$handle_list = mpull($handle_list, 'getName');
$handle_list = implode(', ', $handle_list);
throw new PhutilArgumentUsageException(
pht(
'Unable to promote "%s" for repository "%s" because this '.
'cluster already has one or more unambiguous leaders: %s.',
$device->getName(),
$repository->getDisplayName(),
$handle_list));
}
if ($promote) {
PhabricatorRepositoryWorkingCopyVersion::updateVersion(
$repository->getPHID(),
$device->getPHID(),

View file

@ -414,28 +414,24 @@ present on the leaders but not present on the followers by examining the
push logs.
If you are comfortable discarding these changes, you can instruct Phabricator
that it can forget about the leaders in two ways: disable the service bindings
to all of the leader devices so they are no longer part of the cluster, or use
`bin/repository thaw` to `--demote` the leaders explicitly.
that it can forget about the leaders by doing this:
If you do this, **you will lose data**. Either action will discard any changes
on the affected leaders which have not replicated to other devices in the
cluster.
- Disable the service bindings to all of the leader devices so they are no
longer part of the cluster.
- Then, use `bin/repository thaw` to `--demote` the leaders explicitly.
To remove a device from the cluster, disable all of the bindings to it
in Almanac, using the web UI.
{icon exclamation-triangle, color="red"} Any data which is only present on
the disabled device will be lost.
To demote a device without removing it from the cluster, run this command:
To demote a device, run this command:
```
phabricator/ $ ./bin/repository thaw rXYZ --demote repo002.corp.net
```
{icon exclamation-triangle, color="red"} Any data which is only present on
**this** device will be lost.
the demoted device will be lost.
If you do this, **you will lose unreplicated data**. You will discard any
changes on the affected leaders which have not replicated to other devices
in the cluster.
Ambiguous Leaders