Summary:
Fixes T9994. Currently, when Drydock can't allocate a new resource because some limit has been reached, it waits patiently for a resource to become available.
It is possible that no resource will ever become available. Particularly with "Working Copy" resources, the new lease may want a copy of `rB`, but the resource may already be maxed out on `rA`.
Right now, no process exists to automatically reclaim the unused `rA`.
When we encounter this situation, try to reclaim one of the other resources if it is just sitting there unused.
Specifically:
- Add a "reclaim" command which means "release this resource //if// it is completely unused".
- Add a `bin/drydock reclaim` to send this command to every active resource.
- When we try to acquire a resource and can't, but only because of some kind of limit / utilization problem, try to release an unused resource to free up some room.
Test Plan:
- Set "Working Copy" resource limit to 1.
- Ran "Test Configuration" in `rA`, which worked.
- Ran "Test Configuration" in `rB`, which hung forever.
- Applied patch.
- Ran "Test Configuration" in `rB`, saw it reclaim the `rA` resource, use the slot, then succeed.
- Ran "Test Configuration" in `rA` again, saw it grab the slot back.
- Ran `bin/drydock reclaim` and saw it reclaim a bunch of old orphaned resources.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9994
Differential Revision: https://secure.phabricator.com/D14819
Summary:
Ref T9994. This fixes the first issue discussed on that task, which is that when a merge fails after "arc land", we would not clean up all the leases properly.
Specifically, when a merge fails, we use `queueTask()` to schedule a followup task. This followup destroys the lease and frees the underlying resource.
However, the default behavior of `queueTask()` is to //not queue tasks// if the parent task fails. This is a reasonable, safe behavior that was originally introduced in D8774, where it kept us from sending too much mail if a task did "send some mail" and then failed a little later on and got retried.
Since I think the default behavior is correct, I just special cased the behavior for Drydock to make it queue even on failure. These are the only types of followup tasks we currently want to queue on main task failure.
(It's possible that future Blueprints might want some kind of more specialized behavior, where some tasks queue only on success, but we can cross that bridge when we come to it.)
Test Plan:
- See T9994#149878 for test case setup.
- I ran that test case again with this patch, and saw the followup task queue properly in the `--trace` log, a correspoinding update task show up in `/daemon/`, and the lease get destroyed when I ran it a moment later.
{F1029915}
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9994
Differential Revision: https://secure.phabricator.com/D14818
Summary:
Ref T9252. Currently, error handling behavior isn't great and a lot of errors aren't dealt with properly. Try to improve this by making default behaviors better:
- Yields, slot lock exceptions, and aggregate or proxy exceptions containing an excpetion of these types turn into yields.
- All other exceptions are considered permanent failures. They break the resource and
This feels a little bit "magical" but I want to try to get the default behaviors to align reasonably well with expectations so that blueprints mostly don't need to have a ton of error handling. This will probably need at least some refinement down the road, but it's a reasonable rule for all exception/error conditions we currently have.
Test Plan: I did a clean build, but haven't vetted this super thoroughly. Next diff will do the same thing to leases, then I'll work on stabilizing this code better.
Reviewers: chad, hach-que
Reviewed By: hach-que
Maniphest Tasks: T9252
Differential Revision: https://secure.phabricator.com/D14211
Summary:
Ref T9252. Currently, Drydock Leases and Resources have several workers:
- Resources: ResourceWorker, ResourceUpdateWorker, ResourceDestroyWorker
- Leases: AllocatorWorker, LeaseWorker, LeaseUpdateWorker, LeaseDestroyWorker
This is kind of a lot of stuff, and it creates some problems.
In particular, leases and resources in early lifecycle phases (pending/allocating/acquiring) can't process commands yet, because that code is only in the "UpdateWorker" classes. If they aren't able to move forward because of a bug, they also can't be released because they can't react to the release command until later in their lifecycle. This creates a soft hang where I have to go wipe stuff out of the database since there's no other way to get rid of it.
Instead, I want leases and resources to be releasable from any (pre-release / pre-destroy) phase of their lifecycle. To support this, all the workers before the "UpdateWorker" need to be able to process commands.
A second, similar issue is that logging and exception handling behaviors are underpowered right now. Elsewhere I began improving this, but ran into issues where all of the workers needed to share very similar exception code. Merging them will make this future change simpler.
This diff fixes this for resources: it merges the Worker, UpdateWorker and DestroyWorker logic into UpdateWorker and throws away the other two workers.
Test Plan: Nothing substantive yet, see next diff. I'll do the same thing for Leases, then test both more thoroughly.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9252
Differential Revision: https://secure.phabricator.com/D14201
Summary:
Fixes T6569. This implements an expiry mechanism for Drydock resources which parallels the mechanism for leases.
A few things are missing that we'll probably need in the future:
- An "EXPIRES" command to update the expiration time. This would let resources be permanent while leased, then expire after, say, 24 hours without any leases.
- A callback like `shouldActuallyExpireRightNow()` for resources and leases that lets them decide not to expire at the last second.
- A callback like `didAcquireLease()` for resource blueprints, to parallel `didReleaseLease()`, letting them clear or extend their timer.
However, this stuff would mostly just let us tune behaviors, not really open up new capabilities.
Test Plan: Changed host resources to expire after 60 seconds, leased one, saw it vanish 60 seconds later.
Reviewers: hach-que, chad
Reviewed By: chad
Maniphest Tasks: T6569
Differential Revision: https://secure.phabricator.com/D14176
Summary:
Ref T9252. Drydock currently uses integer statuses, but there's no reason for this (they don't need to be ordered) and it makes debugging them, working with them, future APIs, etc., more cumbersome.
Switch to string instead.
Also rename `STATUS_OPEN` to `STATUS_ACTIVE` and `STATUS_CLOSED` to `STATUS_RELEASED` for consistency. This makes resources and leases have more similar states, and gives resource states more accurate names.
Test Plan: Browsed web UI, grepped for changed constants, applied patch, inspected database.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9252
Differential Revision: https://secure.phabricator.com/D14153
Summary:
Ref T9252. Leases currently have a `resourceID`, but this is a bit nonstandard and generally less flexible than giving them a `resourcePHID`.
In particular, a `resourcePHID` is easier to use when rendering interfaces, since you can get handles out of a PHID.
Add a PHID column, copy over all the PHIDs that correspond to existing IDs, then drop the ID column.
Test Plan:
- Browsed web UIs.
- Inspected database during/after migration.
- Grepped for `resourceID`.
- Allocated a new lease with `bin/drydock lease`.
Reviewers: chad, hach-que
Reviewed By: hach-que
Maniphest Tasks: T9252
Differential Revision: https://secure.phabricator.com/D14151
Summary: Ref T9252. Some leases or resources may need to remove data, tear down VMs, etc., during cleanup. After they are released, queue a "destroy" phase for performing teardown.
Test Plan:
- Used `bin/drydock lease ...` to create a working copy lease.
- Used `bin/drydock release-lease` and `bin/drydock release-resource` to release the lease and then the working copy and host.
- Saw working copy and host get destroyed and cleaned up properly.
Reviewers: hach-que, chad
Reviewed By: chad
Maniphest Tasks: T6569, T9252
Differential Revision: https://secure.phabricator.com/D14144
Summary:
Ref T9252. Broadly, Drydock currently races on releasing objects from the "active" state. To reproduce this:
- Scatter some sleep()s pretty much anywhere in the release code.
- Release several times from web UI or CLI in quick succession.
Resources or leases will execute some release code twice or otherwise do inconsistent things.
(I didn't chase down a detailed reproduction scenario for this since inspection of the code makes it clear that there are no meaningful locks or mechanisms preventing this.)
Instead, add a Harbormaster-style command queue to resources and leases. When something wants to do a release, it adds a command to the queue and schedules a worker. The workers acquire a lock, then try to consume commands from the queue.
This guarantees that only one process is responsible for writes to active resource/leases.
This is the last major step to giving resources and leases a single writer during all states:
- Resource, Unsaved: AllocatorWorker
- Resource, Pending: ResourceWorker (Possible rename to "Allocated?")
- Resource, Open: This diff, ResourceUpdateWorker. (Likely rename to "Active").
- Resource, Closed/Broken: Future destruction worker. (Likely rename to "Released" / "Broken"; maybe remove "Broken").
- Resource, Destroyed: No writes.
- Lease, Unsaved: Whatever wants the lease.
- Lease, Pending: AllocatorWorker
- Lease, Acquired: LeaseWorker
- Lease, Active: This diff, LeaseUpdateWorker.
- Lease, Released/Broken: Future destruction worker (Maybe remove "Broken"?)
- Lease, Expired: No writes. (Likely rename to "Destroyed").
In most phases, we can already guarantee that there is a single writer without doing any extra work. This is more complicated in the "Active" case because the release buttons on the web UI, the release tools on the CLI, the lease requestor itself, the garbage collector, and any other release process cleaning up related objects may try to effect a release. All of these could race one another (and, in many cases, race other processes from other phases because all of these get to act immediately) as this code is currently written. Using a queue here lets us make sure there's only a single writer in this phase.
One thing which is notable is that whatever acquires a lease **can not write to it**! It is never the writer once it queues the lease for activation. It can not write to any resources, either. And, likewise, Blueprints can not write to resources while acquiring or releasing leases.
We may need to provide a mechinism so that blueprints and/or resource/lease holders get to attach some storage to resources/leases for bookkeeping. For example, a blueprint might need to keep some kind of cache on a resource to help it manage state. But I think we can cross that bridge when we come to it, and nothing else would need to write to this storage so it's technically straightforward to introduce such a mechanism if we need one.
Test Plan:
- Viewed buttons in web UI, checked enabled/disabled states.
- Clicked the buttons.
- Saw commands show up in the command queue.
- Saw some daemon stuff get scheduled.
- Ran CLI tools, saw commands get consumed and resources/leases release.
Reviewers: hach-que, chad
Reviewed By: chad
Maniphest Tasks: T9252
Differential Revision: https://secure.phabricator.com/D14143