From ac7edf54afe4d9ba8ac077758d8d11de670c2a66 Mon Sep 17 00:00:00 2001 From: epriestley Date: Wed, 14 Oct 2015 06:18:10 -0700 Subject: [PATCH] Fix bad counting in SQL when enforcing Drydock allocator soft limits Summary: Ref T9252. This fixes a bug from D14236. D14272 discusses the observable effects of the bug, primarily that the window for racing is widened from ~a few milliseconds to several minutes under our configuration. This SQL query is missing a `GROUP BY` clause, so all of the resources get counted as having the same status (specifically, the alphabetically earliest status any resource had, I think). For test cases this often gets the right result since the number of resources may be small and they may all have the same status, but in production this isn't true. In particular, the allocator would sometimes see "35 destroyed resources" (or whatever), when the real counts were "32 destroyed resources + 3 pending resources". Since this allocator behavior is soft/advisory this didn't cause any actual problems, per se (we do expect races here occasionally), it just made the race very very easy to hit. For example, Drydock in production currently has three pending working copy resources. Although we do expect this to be //possible//, getting 4 resources when the configured limit is 1 should be hard (not lightning strike / cosmic radiaion hard, but "happens once a year" hard). Also exclude destroyed resources since we never care about them. Test Plan: Followed the plan from D14272 and restarted two Harbormaster workers at the same time. After this patch was applied, they no longer created two different resources (we expect it to be possible for this to happen, just very hard). We should still be able to force this race by putting something like `sleep(10)` right before the query, then `sleep(10)` right after it. That would prevent the allocators from seeing one another (so they would both think there were no other resources) and push us down the pathway where we exceed the soft limit. Reviewers: chad, hach-que Reviewed By: hach-que Maniphest Tasks: T9252 Differential Revision: https://secure.phabricator.com/D14274 --- .../drydock/blueprint/DrydockBlueprintImplementation.php | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/src/applications/drydock/blueprint/DrydockBlueprintImplementation.php b/src/applications/drydock/blueprint/DrydockBlueprintImplementation.php index e66b1616e5..0c53b19fcf 100644 --- a/src/applications/drydock/blueprint/DrydockBlueprintImplementation.php +++ b/src/applications/drydock/blueprint/DrydockBlueprintImplementation.php @@ -343,9 +343,12 @@ abstract class DrydockBlueprintImplementation extends Phobject { $counts = queryfx_all( $conn_r, - 'SELECT status, COUNT(*) N FROM %T WHERE blueprintPHID = %s', + 'SELECT status, COUNT(*) N FROM %T + WHERE blueprintPHID = %s AND status != %s + GROUP BY status', $resource->getTableName(), - $blueprint->getPHID()); + $blueprint->getPHID(), + DrydockResourceStatus::STATUS_DESTROYED); $counts = ipull($counts, 'N', 'status'); $n_alloc = idx($counts, DrydockResourceStatus::STATUS_PENDING, 0);