mirror of
https://we.phorge.it/source/phorge.git
synced 2024-11-19 13:22:42 +01:00
Fix unbounded expansion of allocating resource pool
Summary: Ref T9252. I think there's a more complex version of this problem discussed elsewhere, but here's what we hit today: - 5 commits land at the same time and trigger 5 builds. - All of them go to acquire a working copy. - Working copies have a limit of 1 right now, so 1 of them gets the lease on it. - The other 4 all trigger allocation of //new// working copies. So now we have: 1 active, leased working copy and 4 pending, leased working copies. - The 4 pending working copies will never activate without manual intervention, so these 4 builds are stuck forever. To fix this, prevent WorkingCopies from giving out leases until they activate. So now the leases won't acquire until we know the working copy is good, which solves the first problem. However, this creates a secondary problem: - As above, all 5 go to acquire a working copy. - One gets it. - The other 4 trigger allocations, but no longer acquire leases. This is an improvement. - Every time the leases update, they trigger another allocation, but never acquire. They trigger, say, a few thousand allocations. - Eventually the first build finishes up and the second lease acquires the working copy. After some time, all of the builds finish. - However, they generated an unboundedly large number of pending working copy resources during this time. This is technically "okay-ish", in that it did work correctly, it just generated a gigantic mess as a side effect. To solve this, at least for now, provide a mechanism to impose allocation rate limits and put a cap on the number of allocating resources of a given type. As hard-coded, this the greater of "1" or "25% of the active resources in the pool". So if there are 40 working copies active, we'll start allocating up to 10 more and then cut new allocations off until those allocations get sorted out. This prevents us from getting runaway queues of limitless size. This also imposes a total active working copy resource limit of 1, which incidentally also fixes the problem, although I expect to raise this soon. These mechanisms will need refinement, but the basic idea is: - Resources which aren't sure if they can actually activate should wait until they do activate before allowing leases to acquire them. I'm fairly confident this rule is a reasonable one. - Then we limit how many bookkeeping side effects Drydock can generate once it starts encountering limits. Broadly, some amount of mess is inevitable because Drydock is allowed to try things that might not work. In an extreme case we could prevent this mess by setting all these limits at "1" forever, which would degrade Drydock to effectively be a synchronous, blocking queue. The idea here is to put some amount of slack in the system (more than zero, but less than infinity) so we get the performance benefits of having a parallel, asyncronous system without a finite, manageable amount of mess. Numbers larger than 0 but less than infinity are pretty tricky, but I think rules like "X% of active resources" seem fairly reasonable, at least for resources like working copies. Test Plan: Ran something like this: ``` for i in `seq 1 5`; do sh -c '(./bin/harbormaster build --plan 10 rX... &) &'; done; ``` Saw 5 plans launch, acquire leases, proceed in an orderly fashion, and eventually finish successfully. Reviewers: hach-que, chad Reviewed By: chad Maniphest Tasks: T9252 Differential Revision: https://secure.phabricator.com/D14236
This commit is contained in:
parent
b2e89a9e48
commit
ee937e99fb
4 changed files with 106 additions and 4 deletions
|
@ -19,6 +19,10 @@ abstract class DrydockBlueprintImplementation extends Phobject {
|
|||
return array();
|
||||
}
|
||||
|
||||
public function getViewer() {
|
||||
return PhabricatorUser::getOmnipotentUser();
|
||||
}
|
||||
|
||||
|
||||
/* -( Lease Acquisition )-------------------------------------------------- */
|
||||
|
||||
|
@ -310,4 +314,67 @@ abstract class DrydockBlueprintImplementation extends Phobject {
|
|||
}
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* Apply standard limits on resource allocation rate.
|
||||
*
|
||||
* @param DrydockBlueprint The blueprint requesting an allocation.
|
||||
* @return bool True if further allocations should be limited.
|
||||
*/
|
||||
protected function shouldLimitAllocatingPoolSize(
|
||||
DrydockBlueprint $blueprint) {
|
||||
|
||||
// TODO: If this mechanism sticks around, these values should be
|
||||
// configurable by the blueprint implementation.
|
||||
|
||||
// Limit on total number of active resources.
|
||||
$total_limit = 1;
|
||||
|
||||
// Always allow at least this many allocations to be in flight at once.
|
||||
$min_allowed = 1;
|
||||
|
||||
// Allow this fraction of allocating resources as a fraction of active
|
||||
// resources.
|
||||
$growth_factor = 0.25;
|
||||
|
||||
$resource = new DrydockResource();
|
||||
$conn_r = $resource->establishConnection('r');
|
||||
|
||||
$counts = queryfx_all(
|
||||
$conn_r,
|
||||
'SELECT status, COUNT(*) N FROM %T WHERE blueprintPHID = %s',
|
||||
$resource->getTableName(),
|
||||
$blueprint->getPHID());
|
||||
$counts = ipull($counts, 'N', 'status');
|
||||
|
||||
$n_alloc = idx($counts, DrydockResourceStatus::STATUS_PENDING, 0);
|
||||
$n_active = idx($counts, DrydockResourceStatus::STATUS_ACTIVE, 0);
|
||||
$n_broken = idx($counts, DrydockResourceStatus::STATUS_BROKEN, 0);
|
||||
$n_released = idx($counts, DrydockResourceStatus::STATUS_RELEASED, 0);
|
||||
|
||||
// If we're at the limit on total active resources, limit additional
|
||||
// allocations.
|
||||
$n_total = ($n_alloc + $n_active + $n_broken + $n_released);
|
||||
if ($n_total >= $total_limit) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// If the number of in-flight allocations is fewer than the minimum number
|
||||
// of allowed allocations, don't impose a limit.
|
||||
if ($n_alloc < $min_allowed) {
|
||||
return false;
|
||||
}
|
||||
|
||||
$allowed_alloc = (int)ceil($n_active * $growth_factor);
|
||||
|
||||
// If the number of in-flight allocation is fewer than the number of
|
||||
// allowed allocations according to the pool growth factor, don't impose
|
||||
// a limit.
|
||||
if ($n_alloc < $allowed_alloc) {
|
||||
return false;
|
||||
}
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -29,6 +29,17 @@ final class DrydockWorkingCopyBlueprintImplementation
|
|||
public function canAllocateResourceForLease(
|
||||
DrydockBlueprint $blueprint,
|
||||
DrydockLease $lease) {
|
||||
$viewer = $this->getViewer();
|
||||
|
||||
if ($this->shouldLimitAllocatingPoolSize($blueprint)) {
|
||||
return false;
|
||||
}
|
||||
|
||||
// TODO: If we have a pending resource which is compatible with the
|
||||
// configuration for this lease, prevent a new allocation? Otherwise the
|
||||
// queue can fill up with copies of requests from the same lease. But
|
||||
// maybe we can deal with this with "pre-leasing"?
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
|
@ -37,6 +48,12 @@ final class DrydockWorkingCopyBlueprintImplementation
|
|||
DrydockResource $resource,
|
||||
DrydockLease $lease) {
|
||||
|
||||
// Don't hand out leases on working copies which have not activated, since
|
||||
// it may take an arbitrarily long time for them to acquire a host.
|
||||
if (!$resource->isActive()) {
|
||||
return false;
|
||||
}
|
||||
|
||||
$need_map = $lease->getAttribute('repositories.map');
|
||||
if (!is_array($need_map)) {
|
||||
return false;
|
||||
|
@ -320,8 +337,10 @@ final class DrydockWorkingCopyBlueprintImplementation
|
|||
}
|
||||
|
||||
private function loadRepositories(array $phids) {
|
||||
$viewer = $this->getViewer();
|
||||
|
||||
$repositories = id(new PhabricatorRepositoryQuery())
|
||||
->setViewer(PhabricatorUser::getOmnipotentUser())
|
||||
->setViewer($viewer)
|
||||
->withPHIDs($phids)
|
||||
->execute();
|
||||
$repositories = mpull($repositories, null, 'getPHID');
|
||||
|
@ -353,7 +372,7 @@ final class DrydockWorkingCopyBlueprintImplementation
|
|||
}
|
||||
|
||||
private function loadHostLease(DrydockResource $resource) {
|
||||
$viewer = PhabricatorUser::getOmnipotentUser();
|
||||
$viewer = $this->getViewer();
|
||||
|
||||
$lease_phid = $resource->getAttribute('host.leasePHID');
|
||||
|
||||
|
|
|
@ -293,6 +293,15 @@ final class DrydockResource extends DrydockDAO
|
|||
}
|
||||
}
|
||||
|
||||
public function isActive() {
|
||||
switch ($this->getStatus()) {
|
||||
case DrydockResourceStatus::STATUS_ACTIVE:
|
||||
return true;
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
public function logEvent($type, array $data = array()) {
|
||||
$log = id(new DrydockLog())
|
||||
->setEpoch(PhabricatorTime::getNow())
|
||||
|
|
|
@ -535,7 +535,7 @@ final class DrydockLeaseUpdateWorker extends DrydockWorker {
|
|||
// If this lease has been acquired but not activated, queue a task to
|
||||
// activate it.
|
||||
if ($lease->getStatus() == DrydockLeaseStatus::STATUS_ACQUIRED) {
|
||||
PhabricatorWorker::scheduleTask(
|
||||
$this->queueTask(
|
||||
__CLASS__,
|
||||
array(
|
||||
'leasePHID' => $lease->getPHID(),
|
||||
|
@ -691,7 +691,14 @@ final class DrydockLeaseUpdateWorker extends DrydockWorker {
|
|||
->setStatus(DrydockLeaseStatus::STATUS_BROKEN)
|
||||
->save();
|
||||
|
||||
$lease->scheduleUpdate();
|
||||
$this->queueTask(
|
||||
__CLASS__,
|
||||
array(
|
||||
'leasePHID' => $lease->getPHID(),
|
||||
),
|
||||
array(
|
||||
'objectPHID' => $lease->getPHID(),
|
||||
));
|
||||
|
||||
$lease->logEvent(
|
||||
DrydockLeaseActivationFailureLogType::LOGCONST,
|
||||
|
|
Loading…
Reference in a new issue