1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2025-01-23 21:18:19 +01:00
Commit graph

48 commits

Author SHA1 Message Date
epriestley
a1baedbd9a Lock resources briefly while acquiring leases on them to prevent acquiring near-death resources
Summary:
Depends on D19078. Ref T13073. Currently, there is a narrow window where we can acquire a resource after a reclaim has started against it.

To prevent this, briefly lock resources before acquiring them and make sure they're still good. If a resource isn't good, throw the lease back in the pool.

Test Plan:
This is tricky. You need:

  - Hoax blueprint with limits and a rule where leases of a given "flavor" can only be satisfied by resources of the same flavor.
  - Reduce the 3-minute "wait before resources can be released" to 3 seconds.
  - Limit Hoaxes to 1.
  - Allocate one "cherry" flavored Hoax and release the lease.
  - Add a `sleep(15)` to `releaseResource()` in `DrydockResourceUpdateWorker`, after the `canReclaimResource()` check, with a `print`.

Now:

  - Run `bin/phd debug task` in two windows.
  - Run `bin/drydock lease --type host --attributes flavor=banana` in a third window.
  - This will start to reclaim the existing "cherry" resource. Once one of the `phd` windows prints the "RECLAIMING" message run `bin/drydock lease --type host --attributes flavor=cherry` in a fourth window.
  - Before patch: the "cherry" lease acquired immediately, then was released and destroyed moments later.
  - After patch: the "cherry" lease yields.

Subscribers: PHID-OPKG-gm6ozazyms6q6i22gyam

Maniphest Tasks: T13073

Differential Revision: https://secure.phabricator.com/D19080
2018-02-13 13:22:13 -08:00
epriestley
30a0b103e6 When a lease acquires a resource but the resource fails to activate, throw the lease back in the pool
Summary:
Depends on D19075. Ref T13073. If a lease acquires a resource but finds that the resource builds directly into a dead state (which can happen for any number of reasonable reasons), reset the lease and throw it back in the pool.

This isn't the lease's fault and it hasn't caused any side effects or done anything we can't undo, so we can safely reset it and throw it back in the pool.

Test Plan:
  - Created a blueprint which throws from `allocateResource()` so that resources never activate.
  - Tried to lease it with `bin/drydock lease ...`.
  - Before patch: lease was broken and destroyed after a failed activation.
  - After patch: lease was returned to the pool after a failed activation.

Subscribers: PHID-OPKG-gm6ozazyms6q6i22gyam

Maniphest Tasks: T13073

Differential Revision: https://secure.phabricator.com/D19076
2018-02-13 13:17:54 -08:00
epriestley
07028cfc30 When bin/drydock lease is interrupted, release leases
Summary:
Depends on D19072. Ref T13073. Currently, you can leave leases stranded by using `^C` to interrupt the script. Handle signals and release leases on destruction if they haven't activated yet.

Also, print out more useful information before and after activation.

Test Plan: Mashed ^C while runnning `bin/drydock lease ... --trace`, saw the lease release.

Subscribers: yelirekim, PHID-OPKG-gm6ozazyms6q6i22gyam

Maniphest Tasks: T13073

Differential Revision: https://secure.phabricator.com/D19073
2018-02-13 13:14:21 -08:00
epriestley
4dd32dca3e When a Drydock Blueprint promises us a resource but can't deliver, continue believing in it
Summary:
Ref T13073. When a Blueprint says it will be able to allocate a resource but then throws an exception while attempting that allocation, we currently fail the lease permanently.

This is excessively harsh. This blueprint may have the best of intentions and have encountered a legitimately unforseeable failure (like a `vm.new` call to build a VM failed) and be able to succeed in the future.

Even if this blueprint is a dirty liar, other blueprints (or existing resources) may be able to satisfy the lease in the future.

Even if every blueprint is implemented incorrectly, leaving the lease alive lets it converge to success after the blueprints are fixed.

Instead of failing, log the issue and yield.

(In the future, it might make sense to distinguish more narrowly between "actually, all the resources are used up" and all other failure types, since the former is likely more routine and less concerning.)

Test Plan:
  - Wrote a broken `Hoax` blueprint which always claims it can allocate but never actually allocates (just `throw` in `allocateResource()`).
  - Used `bin/phd drydock lease` to acquire a Hoax lease.
  - Before patch: lease abruptly failed permanently.
  - After patch: lease yields after allocation fails.

{F5427747}

Subscribers: PHID-OPKG-gm6ozazyms6q6i22gyam

Maniphest Tasks: T13073

Differential Revision: https://secure.phabricator.com/D19070
2018-02-13 13:11:55 -08:00
Dmitri Iouchtchenko
9bd6a37055 Fix spelling
Summary: Noticed a couple of typos in the docs, and then things got out of hand.

Test Plan:
  - Stared at the words until my eyes watered and the letters began to swim on the screen.
  - Consulted a dictionary.

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: epriestley, yelirekim, PHID-OPKG-gm6ozazyms6q6i22gyam

Differential Revision: https://secure.phabricator.com/D18693
2017-10-09 10:48:04 -07:00
Nick Zheng
6861af0cbb When the push phase of "Land Revision" fails, show the error in the UI
Summary: T10447

Test Plan: tested on my dev instance

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: Korvin, yelirekim

Differential Revision: https://secure.phabricator.com/D15350
2016-02-26 09:38:39 -08:00
epriestley
7168d8edd9 Make Drydock reclaim unused resources when it reaches a resource limit
Summary:
Fixes T9994. Currently, when Drydock can't allocate a new resource because some limit has been reached, it waits patiently for a resource to become available.

It is possible that no resource will ever become available. Particularly with "Working Copy" resources, the new lease may want a copy of `rB`, but the resource may already be maxed out on `rA`.

Right now, no process exists to automatically reclaim the unused `rA`.

When we encounter this situation, try to reclaim one of the other resources if it is just sitting there unused.

Specifically:

  - Add a "reclaim" command which means "release this resource //if// it is completely unused".
  - Add a `bin/drydock reclaim` to send this command to every active resource.
  - When we try to acquire a resource and can't, but only because of some kind of limit / utilization problem, try to release an unused resource to free up some room.

Test Plan:
  - Set "Working Copy" resource limit to 1.
  - Ran "Test Configuration" in `rA`, which worked.
  - Ran "Test Configuration" in `rB`, which hung forever.
  - Applied patch.
  - Ran "Test Configuration" in `rB`, saw it reclaim the `rA` resource, use the slot, then succeed.
  - Ran "Test Configuration" in `rA` again, saw it grab the slot back.
  - Ran `bin/drydock reclaim` and saw it reclaim a bunch of old orphaned resources.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9994

Differential Revision: https://secure.phabricator.com/D14819
2015-12-18 11:55:51 -08:00
epriestley
e9af4f8970 Fix an issue where Drydock followup tasks would not queue if the main task failed
Summary:
Ref T9994. This fixes the first issue discussed on that task, which is that when a merge fails after "arc land", we would not clean up all the leases properly.

Specifically, when a merge fails, we use `queueTask()` to schedule a followup task. This followup destroys the lease and frees the underlying resource.

However, the default behavior of `queueTask()` is to //not queue tasks// if the parent task fails. This is a reasonable, safe behavior that was originally introduced in D8774, where it kept us from sending too much mail if a task did "send some mail" and then failed a little later on and got retried.

Since I think the default behavior is correct, I just special cased the behavior for Drydock to make it queue even on failure. These are the only types of followup tasks we currently want to queue on main task failure.

(It's possible that future Blueprints might want some kind of more specialized behavior, where some tasks queue only on success, but we can cross that bridge when we come to it.)

Test Plan:
  - See T9994#149878 for test case setup.
  - I ran that test case again with this patch, and saw the followup task queue properly in the `--trace` log, a correspoinding update task show up in `/daemon/`, and the lease get destroyed when I ran it a moment later.

{F1029915}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9994

Differential Revision: https://secure.phabricator.com/D14818
2015-12-18 08:17:04 -08:00
epriestley
f48a833704 Fix an issue with incorrect authorization handling in Working Copy build steps
Summary:
Fixes T9669. Two issues:

  - We were using `repositoryPHIDs` instead of `blueprintPHIDs` for the list of allowed blueprints. Use the correct value.
  - We weren't enforcing `allowedBlueprintPHIDs` fully correctly. We //did// require an authorization, so the net effect was correct in nearly all cases, but we could have selected from too large a pool in the case where the application itself was doing the authorization (e.g., from the command line).

Test Plan: Ran a build through Drydock/Harbormaster locally.

Reviewers: chad, tycho.tatitscheff

Reviewed By: chad, tycho.tatitscheff

Subscribers: tycho.tatitscheff

Maniphest Tasks: T9669

Differential Revision: https://secure.phabricator.com/D14368
2015-10-30 16:02:35 +00:00
epriestley
a763f9510e Add some Drydock documentation plus "Test Configuration" for repository automation
Summary:
Ref T182. Ref T9252.

  - Adds a "Test" repository operation that just runs `git status` to see if things work.
  - Adds a button for it in Edit Repository.
  - Shows operation status on the operation detail view to make this workflow work a little better.
  - Adds a lot of words. Words words words words.

Test Plan:
  - Tested repository operation.
  - Read words.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T182, T9252

Differential Revision: https://secure.phabricator.com/D14349
2015-10-27 18:04:02 +00:00
epriestley
0b24a6e200 Make "Land Revision" show merge conflicts more clearly
Summary:
Ref T182. We just show "an error happened" right now. Improve this behavior.

This error handling chain is a bit ad-hoc for now but we can formalize it as we hit other cases.

Test Plan:
{F910247}

{F910248}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T182

Differential Revision: https://secure.phabricator.com/D14343
2015-10-26 20:11:21 +00:00
epriestley
2326d5f8d0 Show lease on Repository Operation detail view and awaken on failures
Summary:
Ref T182. Couple of minor improvements here:

  - Show the Drydock lease when viewing a Repository Operation detail screen. This just makes it easier to jump around between relevant objects.
  - When tasks are waiting for a lease, awaken them when it breaks or is released, not just when it is acquired. This makes the queue move forward faster when errors occur.

Test Plan:
  - Viewed a repository operation and saw a link to the lease.
  - Did a bad land (intentional merge problem) and got an error in about ~3 seconds instead of ~17.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T182

Differential Revision: https://secure.phabricator.com/D14341
2015-10-26 20:00:49 +00:00
epriestley
9c39493796 Make WorkingCopyBlueprint responsible for performing merges
Summary:
Ref T182. Currently, the "RepositoryLand" operation is responsible for performing merges when landing a revision.

However, we'd like to be able to perform these merges in a larger set of cases in the future. For example:

  - After Releeph is revamped, when someone says "I want to merge bug fix X into stable branch Y", it would probably be nice to make that a Buildable and let tests run against it without requring that it actually be pushed anywhere.
  - Same deal if we want a merge-from-Diffusion or cherry-pick-from-Diffusion operation.
  - Similar deal if we want a "random web UI edits from Diffusion".

Move the merging part into WorkingCopy so more applications can share/use it in the future.

A big chunk of this is me making stuff up for now (the ol' undocumented dictionary full of arbitrary magic keys), but I anticipate formalizing it as we move along.

Test Plan: Pushed rGITTEST0d58eef3ce0fa5a10732d2efefc56aec126bc219 up from my local install via "Land Revision".

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T182

Differential Revision: https://secure.phabricator.com/D14337
2015-10-26 12:40:16 -07:00
epriestley
083a321dad Fix an issue where newly created Drydock resources could be improperly acquired
Summary:
Ref T9252. This is mostly a fix for an edge case from D14236. Here's the setup:

  - There are no resources.
  - A request for a new resource arrives.
  - We build a new resource.

Now, if we were leasing an existing resource, we'd call `canAcquireLeaseOnResource()` before acquiring a lease on the new resource.

However, for new resources we don't do that: we just acquire a lease immediately. This is wrong, because we now allow and expect some resources to be unleasable when created.

In a more complex workflow, this can also produce the wrong result and leave the lease acquired sub-optimally (and, today, deadlocked).

Make the "can we acquire?" pathway consistent for new and existing resources, so we always do the same set of checks.

Test Plan:
  - Started daemons.
  - Deleted all working copy resources.
  - Ran two working-copy-using build plans at the same time.
  - Before this change, one would often [1] acquire a lease on a pending resource which never allocated, then deadlock.
  - After this change, the same thing happens except that the lease remains pending and the work completes.

[1] Although the race this implies is allowed (resource pool limits are soft/advisory, and it is expected that we may occasionally run over them), it's MUCH easier to hit right now than I would expect it to be, so I think there's probably at least one more small bug here somewhere. I'll see if I can root it out after this change.

Reviewers: chad, hach-que

Reviewed By: hach-que

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14272
2015-10-14 06:16:21 -07:00
epriestley
43bee4562c If the stars align, make "Land Revision" kind of work
Summary:
Ref T182. If 35 other things are configured completely correctly, make it remotely possible that this button may do something approximating the thing that the user wanted.

This primarily fleshes out the idea that "operations" (like landing, merging or cherry-picking) can have some beahavior, and when we run an operation we do whatever that behavior is instead of just running `git show`.

Broadly, this isn't too terrible because Drydock seems like it actually works properly for the most part (???!?!).

Test Plan: {F876431}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T182

Differential Revision: https://secure.phabricator.com/D14270
2015-10-13 15:46:30 -07:00
epriestley
b4af57ec51 Rough cut of DrydockRepositoryOperation
Summary:
Ref T182. This doesn't do anything interesting yet and is mostly scaffolding, but here's roughly the workflow. From previous revision, you can configure "Repository Automation" for a repository:

{F875741}

If it's configured, a new "Land Revision" button shows up:

{F875743}

Once you click it you get a big warning dialog that it won't work, and then this shows up at the top of the revision (completely temporary/placeholder UI, some day a nice progress bar or whatever):

{F875747}

If you're lucky, the operation eventually sort of works:

{F875750}

It only runs `git show` right now, doesn't actually do any writes or anything.

Test Plan:
  - Clicked "Land Revision".
  - Watched `phd debug task`.
  - Saw it log `git show` to output.
  - Verified operation success in UI (by fiddling URL, no way to get there normally yet).

Reviewers: chad

Reviewed By: chad

Subscribers: revi

Maniphest Tasks: T182

Differential Revision: https://secure.phabricator.com/D14266
2015-10-13 15:46:12 -07:00
epriestley
1bdf225354 Use Drydock authorizations when acquiring leases
Summary:
Ref T9519. When acquiring leases on resources:

  - Only consider resources created by authorized blueprints.
  - Only consider authorized blueprints when creating new resources.
  - Fail with a tailored error if no blueprints are allowed.
  - Fail with a tailored error if missing authorizations are causing acquisition failure.

One somewhat-substantial issue with this is that it's pretty hard to figure out from the Harbormaster side. Specifically, the Build step UI does not show field value anywhere, so the presence of unapproved blueprints is not communicated. This is much more clear in Drydock. I'll plan to address this in future changes to Harbormaster, since there are other related/similar issues anyway.

Test Plan: {F872527}

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9519

Differential Revision: https://secure.phabricator.com/D14254
2015-10-12 17:02:35 -07:00
epriestley
ee937e99fb Fix unbounded expansion of allocating resource pool
Summary:
Ref T9252. I think there's a more complex version of this problem discussed elsewhere, but here's what we hit today:

  - 5 commits land at the same time and trigger 5 builds.
  - All of them go to acquire a working copy.
  - Working copies have a limit of 1 right now, so 1 of them gets the lease on it.
  - The other 4 all trigger allocation of //new// working copies. So now we have: 1 active, leased working copy and 4 pending, leased working copies.
  - The 4 pending working copies will never activate without manual intervention, so these 4 builds are stuck forever.

To fix this, prevent WorkingCopies from giving out leases until they activate. So now the leases won't acquire until we know the working copy is good, which solves the first problem.

However, this creates a secondary problem:

  - As above, all 5 go to acquire a working copy.
  - One gets it.
  - The other 4 trigger allocations, but no longer acquire leases. This is an improvement.
  - Every time the leases update, they trigger another allocation, but never acquire. They trigger, say, a few thousand allocations.
  - Eventually the first build finishes up and the second lease acquires the working copy. After some time, all of the builds finish.
  - However, they generated an unboundedly large number of pending working copy resources during this time.

This is technically "okay-ish", in that it did work correctly, it just generated a gigantic mess as a side effect.

To solve this, at least for now, provide a mechanism to impose allocation rate limits and put a cap on the number of allocating resources of a given type. As hard-coded, this the greater of "1" or "25% of the active resources in the pool".

So if there are 40 working copies active, we'll start allocating up to 10 more and then cut new allocations off until those allocations get sorted out. This prevents us from getting runaway queues of limitless size.

This also imposes a total active working copy resource limit of 1, which incidentally also fixes the problem, although I expect to raise this soon.

These mechanisms will need refinement, but the basic idea is:

  - Resources which aren't sure if they can actually activate should wait until they do activate before allowing leases to acquire them. I'm fairly confident this rule is a reasonable one.
  - Then we limit how many bookkeeping side effects Drydock can generate once it starts encountering limits.

Broadly, some amount of mess is inevitable because Drydock is allowed to try things that might not work. In an extreme case we could prevent this mess by setting all these limits at "1" forever, which would degrade Drydock to effectively be a synchronous, blocking queue.

The idea here is to put some amount of slack in the system (more than zero, but less than infinity) so we get the performance benefits of having a parallel, asyncronous system without a finite, manageable amount of mess.

Numbers larger than 0 but less than infinity are pretty tricky, but I think rules like "X% of active resources" seem fairly reasonable, at least for resources like working copies.

Test Plan:
Ran something like this:

```
for i in `seq 1 5`; do sh -c '(./bin/harbormaster build --plan 10 rX... &) &'; done;
```

Saw 5 plans launch, acquire leases, proceed in an orderly fashion, and eventually finish successfully.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14236
2015-10-05 15:59:16 -07:00
epriestley
4496176924 Add staging area support to Harbormaster/Drydock + various fixes
Summary:
Ref T9252. This primarily allows Harbormaster to request (and Drydock to fulfill) working copies with a patch from a staging area. Doing this means we can do builds on in-review changes from `arc diff`.

This is a little cobbled-together but should basically work.

Also fix some other issues:

  - Yielded, awakend workers are fine to update but could complain.
  - We can't log slot lock failures to resources if we don't end up saving them.
  - Killing the transaction would wipe out the log.
  - Fix some TODOs, etc.

Test Plan: Ran Harbormaster builds on a local revision.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14214
2015-10-01 16:55:01 -07:00
epriestley
d4a0b1c870 Remove names from Drydock resources
Summary:
Ref T9252. Long ago you sometimes manually created resources, so they had human-enterable names. However, users never make resources manually any more, so this field isn't really useful any more.

In particular, it means we write a lot of untranslatable strings like "Working Copy" to the database in the default locale. Instead, do the call at runtime so resource names are translatable.

Also clean up a few minor things I hit while kicking the tires here.

It's possible we might eventually want to introduce a human-choosable label so you can rename your favorite resources and this would just be a default name. I don't really have much of a use case for that yet, though, and I'm not sure there will ever be one.

Test Plan:
  - Restarted a Harbormaster build, got a clean build.
  - Released all leases/resources, restarted build, got a clean build with proper resource names.

Reviewers: hach-que, chad

Reviewed By: hach-que, chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14213
2015-10-01 08:13:43 -07:00
epriestley
b219bcfb3d Improve error and exception handling for Drydock leases
Summary:
Ref T9252. See companion change in D14211. This does the same thing for leases.

Particularly, most of the TODOs about error handling can just be removed because they'll do the right things by default now.

This and D14211 also move slot lock release to after resource destruction. This feels cleaner than trying to release early at release/break.

Test Plan: Restarted a Harbormaster build, got a clean build result. This needs more vetting but I'll clean up any issues as I hit them.

Reviewers: chad, hach-que

Reviewed By: hach-que

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14212
2015-10-01 08:13:20 -07:00
epriestley
e589d15231 Improve error and exception handling for Drydock resources
Summary:
Ref T9252. Currently, error handling behavior isn't great and a lot of errors aren't dealt with properly. Try to improve this by making default behaviors better:

  - Yields, slot lock exceptions, and aggregate or proxy exceptions containing an excpetion of these types turn into yields.
  - All other exceptions are considered permanent failures. They break the resource and

This feels a little bit "magical" but I want to try to get the default behaviors to align reasonably well with expectations so that blueprints mostly don't need to have a ton of error handling. This will probably need at least some refinement down the road, but it's a reasonable rule for all exception/error conditions we currently have.

Test Plan: I did a clean build, but haven't vetted this super thoroughly. Next diff will do the same thing to leases, then I'll work on stabilizing this code better.

Reviewers: chad, hach-que

Reviewed By: hach-que

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14211
2015-10-01 08:12:51 -07:00
epriestley
4ac82be5ed Merge the DrydockLease workers into a single worker
Summary:
Ref T9252. This is the same as D14201, but for lease stuff instead of resource stuff.

This one is a little heavier but still feels pretty reasonable to me at the end of the day (worker is <1K lines and has a ton of comment stuff).

Also fixes a few random bugs I hit in the task queue.

Test Plan:
  - Restarted some Harbormaster builds, saw them go through cleanly.
  - Released pre-activation resources/leases.
  - Probably still kinda buggy but I'll iron the details out over time.

Logs are starting to look somewhat plausible:

{F855747}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14202
2015-10-01 08:11:02 -07:00
epriestley
91e5ca0ee2 Merge the DrydockResource workers into a single worker
Summary:
Ref T9252. Currently, Drydock Leases and Resources have several workers:

  - Resources: ResourceWorker, ResourceUpdateWorker, ResourceDestroyWorker
  - Leases: AllocatorWorker, LeaseWorker, LeaseUpdateWorker, LeaseDestroyWorker

This is kind of a lot of stuff, and it creates some problems.

In particular, leases and resources in early lifecycle phases (pending/allocating/acquiring) can't process commands yet, because that code is only in the "UpdateWorker" classes. If they aren't able to move forward because of a bug, they also can't be released because they can't react to the release command until later in their lifecycle. This creates a soft hang where I have to go wipe stuff out of the database since there's no other way to get rid of it.

Instead, I want leases and resources to be releasable from any (pre-release / pre-destroy) phase of their lifecycle. To support this, all the workers before the "UpdateWorker" need to be able to process commands.

A second, similar issue is that logging and exception handling behaviors are underpowered right now. Elsewhere I began improving this, but ran into issues where all of the workers needed to share very similar exception code. Merging them will make this future change simpler.

This diff fixes this for resources: it merges the Worker, UpdateWorker and DestroyWorker logic into UpdateWorker and throws away the other two workers.

Test Plan: Nothing substantive yet, see next diff. I'll do the same thing for Leases, then test both more thoroughly.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14201
2015-10-01 08:10:40 -07:00
epriestley
8bf5905024 Add Drydock log types and more logging
Summary: Ref T9252. Make log types modular so they can be translated and have complicated rendering logic if necessary (currently, none have this).

Test Plan: {F855330}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14198
2015-10-01 08:10:07 -07:00
epriestley
ec6d69e74d Give Drydock resources a proper expiry mechanism
Summary:
Fixes T6569. This implements an expiry mechanism for Drydock resources which parallels the mechanism for leases.

A few things are missing that we'll probably need in the future:

  - An "EXPIRES" command to update the expiration time. This would let resources be permanent while leased, then expire after, say, 24 hours without any leases.
  - A callback like `shouldActuallyExpireRightNow()` for resources and leases that lets them decide not to expire at the last second.
  - A callback like `didAcquireLease()` for resource blueprints, to parallel `didReleaseLease()`, letting them clear or extend their timer.

However, this stuff would mostly just let us tune behaviors, not really open up new capabilities.

Test Plan: Changed host resources to expire after 60 seconds, leased one, saw it vanish 60 seconds later.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T6569

Differential Revision: https://secure.phabricator.com/D14176
2015-09-28 09:35:14 -07:00
epriestley
b441e8b81e Allow Drydock blueprints to be disabled
Summary: Ref T9252. If you have a blueprint and you do not like that blueprint very much, you can disable it.

Test Plan: Disabled / enabled some blueprints.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14156
2015-09-24 10:18:17 -07:00
epriestley
e117ace8c7 Convert Drydock lease and resource constants to strings
Summary:
Ref T9252. Drydock currently uses integer statuses, but there's no reason for this (they don't need to be ordered) and it makes debugging them, working with them, future APIs, etc., more cumbersome.

Switch to string instead.

Also rename `STATUS_OPEN` to `STATUS_ACTIVE` and `STATUS_CLOSED` to `STATUS_RELEASED` for consistency. This makes resources and leases have more similar states, and gives resource states more accurate names.

Test Plan: Browsed web UI, grepped for changed constants, applied patch, inspected database.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14153
2015-09-24 07:57:05 -07:00
epriestley
c6aade4392 Give Drydock leases a resourcePHID instead of a resourceID
Summary:
Ref T9252. Leases currently have a `resourceID`, but this is a bit nonstandard and generally less flexible than giving them a `resourcePHID`.

In particular, a `resourcePHID` is easier to use when rendering interfaces, since you can get handles out of a PHID.

Add a PHID column, copy over all the PHIDs that correspond to existing IDs, then drop the ID column.

Test Plan:
  - Browsed web UIs.
  - Inspected database during/after migration.
  - Grepped for `resourceID`.
  - Allocated a new lease with `bin/drydock lease`.

Reviewers: chad, hach-que

Reviewed By: hach-que

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14151
2015-09-24 04:19:27 -07:00
epriestley
309aadc595 Rename Drydock Lease STATUS_EXPIRED to STATUS_DESTROYED
Summary: Ref T9252. This is now more consistent (same as the equivalent Resource state) and accurate (leases can end up in this state a bunch of ways, including by expiring).

Test Plan: `grep`, browsed around web UI.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14150
2015-09-23 20:48:51 -07:00
epriestley
3379904237 Allow Drydock leases to expire after a time limit
Summary: Ref T6569. If a lease is activated with an expiration date, schedule a task to try to clean it up after that time.

Test Plan:
  - Used `bin/drydock lease ... --until ...` to activate a lease in the near future.
  - Waited for a bit.
  - Saw it expire and get destroyed at the scheduled time.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T6569

Differential Revision: https://secure.phabricator.com/D14148
2015-09-23 13:54:27 -07:00
epriestley
1f311d64c6 Give Drydock resources and leases a real "destroy" lifecycle phase
Summary: Ref T9252. Some leases or resources may need to remove data, tear down VMs, etc., during cleanup. After they are released, queue a "destroy" phase for performing teardown.

Test Plan:
  - Used `bin/drydock lease ...` to create a working copy lease.
  - Used `bin/drydock release-lease` and `bin/drydock release-resource` to release the lease and then the working copy and host.
  - Saw working copy and host get destroyed and cleaned up properly.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T6569, T9252

Differential Revision: https://secure.phabricator.com/D14144
2015-09-23 11:20:20 -07:00
epriestley
789df89c84 Add a command queue to Drydock to manage lease/resource release
Summary:
Ref T9252. Broadly, Drydock currently races on releasing objects from the "active" state. To reproduce this:

  - Scatter some sleep()s pretty much anywhere in the release code.
  - Release several times from web UI or CLI in quick succession.

Resources or leases will execute some release code twice or otherwise do inconsistent things.

(I didn't chase down a detailed reproduction scenario for this since inspection of the code makes it clear that there are no meaningful locks or mechanisms preventing this.)

Instead, add a Harbormaster-style command queue to resources and leases. When something wants to do a release, it adds a command to the queue and schedules a worker. The workers acquire a lock, then try to consume commands from the queue.

This guarantees that only one process is responsible for writes to active resource/leases.

This is the last major step to giving resources and leases a single writer during all states:

  - Resource, Unsaved: AllocatorWorker
  - Resource, Pending: ResourceWorker (Possible rename to "Allocated?")
  - Resource, Open: This diff, ResourceUpdateWorker. (Likely rename to "Active").
  - Resource, Closed/Broken: Future destruction worker. (Likely rename to "Released" / "Broken"; maybe remove "Broken").
  - Resource, Destroyed: No writes.
  - Lease, Unsaved: Whatever wants the lease.
  - Lease, Pending: AllocatorWorker
  - Lease, Acquired: LeaseWorker
  - Lease, Active: This diff, LeaseUpdateWorker.
  - Lease, Released/Broken: Future destruction worker (Maybe remove "Broken"?)
  - Lease, Expired: No writes. (Likely rename to "Destroyed").

In most phases, we can already guarantee that there is a single writer without doing any extra work. This is more complicated in the "Active" case because the release buttons on the web UI, the release tools on the CLI, the lease requestor itself, the garbage collector, and any other release process cleaning up related objects may try to effect a release. All of these could race one another (and, in many cases, race other processes from other phases because all of these get to act immediately) as this code is currently written. Using a queue here lets us make sure there's only a single writer in this phase.

One thing which is notable is that whatever acquires a lease **can not write to it**! It is never the writer once it queues the lease for activation. It can not write to any resources, either. And, likewise, Blueprints can not write to resources while acquiring or releasing leases.

We may need to provide a mechinism so that blueprints and/or resource/lease holders get to attach some storage to resources/leases for bookkeeping. For example, a blueprint might need to keep some kind of cache on a resource to help it manage state. But I think we can cross that bridge when we come to it, and nothing else would need to write to this storage so it's technically straightforward to introduce such a mechanism if we need one.

Test Plan:
  - Viewed buttons in web UI, checked enabled/disabled states.
  - Clicked the buttons.
  - Saw commands show up in the command queue.
  - Saw some daemon stuff get scheduled.
  - Ran CLI tools, saw commands get consumed and resources/leases release.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14143
2015-09-23 07:42:08 -07:00
epriestley
f1119ffcf5 Support working copies and separate allocate + activate steps for resources/leases in Drydock
Summary:
Ref T9253. For resources and leases that need to do something which takes a lot of time or requires waiting, allow them to allocate/acquire first and then activate later.

When we allocate a resource or acquire a lease, the blueprint can either activate it immediately (if all the work can happen quickly/inline) or activate it later. If the blueprint activates it later, we queue a worker to handle activating it.

Rebuild the "working copy" blueprint to work with this model: it allocates/acquires and activates in a separate step, once it is able to acquire a host.

Test Plan: With some power of imagination, brought up a bunch of working copies with `bin/drydock lease --type working-copy ...`

Reviewers: hach-que, chad

Reviewed By: hach-que, chad

Maniphest Tasks: T9253

Differential Revision: https://secure.phabricator.com/D14127
2015-09-21 04:46:24 -07:00
epriestley
3ac99006bf Implement optimistic "slot locks" in Drydock
Summary:
See discussion in D10304. There's a lot of context there, but the general idea is:

  - Blueprints should manage locks in a granular way during the actual allocation/acquisition phase.
  - Optimistic "slot locks" might a pretty good primitive to make that easy to implement and reason about in most cases.

The way these locks work is that you just pick some name for the lock (like the PHID of a resource) and say that it needs to be acquired for the allocation/acquisition to work:

```
...
->needSlotLock("mylock(PHID-XYZQ-...)")
...
```

When you fire off the acquisition or allocation, it fails unless it could acquire the slot with that name. This is really simple (no explicit lock management) and a pretty good fit for most of the locking that blueprints and leases need to do.

If you need to do limit-based locks (e.g., maximum of 3 locks) you could acquire a lock like this:

```
mylock(whatever).slot(2)
```

Blueprints generally only contend with themselves, so it's normally OK for them to pick whatever strategy works best for them in naming locks.

This may not work as well if you have a huge number of slots (e.g., 100TB you want to give out in 1MB chunks), or other complex needs for locks (like you have to synchronize access to some external resource), but slot locks don't need to be the only mechanism that blueprints use. If they run into a problem that slot locks aren't a good fit for, they can use something else instead. For now, slot locks seem like a good fit for the problems we currently face and most of the problems I anticipate facing.

(The release workflows have other race issues which I'm not addressing here. They work fine if nothing races, but aren't race-safe.)

Test Plan:
To create a race where the same binding is allocated as a resource twice:

  - Add `sleep(10)` near the beginning of `allocateResource()`, after the free bindings are loaded but before resources are allocated.
  - (Comment out slot lock acquisition if you have this patch.)
  - Run `bin/drydock lease ...` in two windows, within 10 seconds of one another.

This will reliably double-allocate the binding because both blueprints see a view of the world where the binding is free.

To verify the lock works, un-comment it (or apply this patch) and run the same test again. Now, the lock fails in one process and only one resource is allocated.

Reviewers: hach-que, chad

Reviewed By: hach-que, chad

Differential Revision: https://secure.phabricator.com/D14118
2015-09-21 04:45:25 -07:00
epriestley
6e03419593 Implement a rough AlmanacService blueprint in Drydock
Summary:
Ref T9253. Broadly, this realigns Allocator behavior to be more consistent and straightforward and amenable to intended future changes.

This attempts to make language more consistent: resources are "allocated" and leases are "acquired".

This prepares for (but does not implement) optimistic "slot locking", as discussed in D10304. Although I suspect some blueprints will need to perform other locking eventually, this does feel like a good fit for most of the locking blueprints need to do.

In particular, I've made the blueprint operations on `$resource` and `$lease` objects more purposeful: they need to invoke an activator on the appropriate object to be implemented correctly. Before they invoke this activator method, they configure the object. In a future diff, this configuration will include specifying slot locks that the lease or resource must acquire. So the API will be something like:

  $lease
    ->setActivateWhenAcquired(true)
    ->needSlotLock('x')
    ->needSlotLock('y')
    ->acquireOnResource($resource);

In the common case where slot locks are a good fit, I think this should make correct blueprint implementation very straightforward.

This prepares for (but does not implement) resources and leases which need significant setup steps. I've basically carved out two modes:

  - The "activate immediately" mode, as here, immediately opens the resource or activates the lease. This is appropriate if little or no setup is required. I expect many leases to operate in this mode, although I expect many resources will operate in the other mode.
  - The "allocate now, activate later" mode, which is not fully implemented yet. This will queue setup workers when the allocator exits. Overall, this will work very similarly to Harbormaster.
  - This new structure makes it acceptable for blueprints to sleep as long as they want during resource allocation and lease acquisition, so long as they are not waiting on anything which needs to be completed by the queue. Putting a `sleep(15 * 60)` in your EC2Blueprint to wait for EC2 to bring a machine up will perform worse than using delayed activation, but won't deadlock the queue or block any locks.

Overall, this flow is more similar to Harbormaster's flow. Having consistency between Harbormaster's model and Drydock's model is good, and I think Harbormaster's model is also simply much better than Drydock's (what exists today in Drydock was implemented a long time ago, and we had more support and infrastructure by the time Harbormaster was implemented, as well as a more clearly defined problem).

The particular strength of Harbormaster is that objects always (or almost always, at least) have a single, clearly defined writer. Ensuring objects have only one writer prevents races and makes reasoning about everything easier.

Drydock does not currently have a clearly defined single writer, but this moves us in that direction. We'll probably need more primitives eventually to flesh this out, like Harbormaster's command queue for messaging objects which you can't write to.

This blueprint was originally implemented in D13843. This makes a few changes to the blueprint itself:

  - A bunch of code from that (e.g., interfaces) doesn't exist yet.
  - I let the blueprint have multiple services. This simplifies the code a little and seems like it costs us nothing.

This also removes `bin/drydock create-resource`, which no longer makes sense to expose. It won't get locking, leasing, etc., correct, and can not be made correct.

NOTE: This technically works but doesn't do anything useful yet.

Test Plan: Used `bin/drydock lease --type host` to acquire leases against these blueprints.

Reviewers: hach-que, chad

Reviewed By: hach-que, chad

Subscribers: Mnkras

Maniphest Tasks: T9253

Differential Revision: https://secure.phabricator.com/D14117
2015-09-21 04:43:53 -07:00
epriestley
bb28f94f9b Reduce garbage-level of Drydock Allocator implementation
Summary:
Ref T9253. The Drydock allocator is very pseudocodey right now. Particularly, it was written before Blueprints were concrete.

Reorganize it to make its responsibilities and error handling behaviors more clear.

In particular, the Allocator does not manage locks. It's primarily trying to reject allocations which can not possibly work. Blueprints are responsible for locks. See some discussion in D10304.

NOTE: This code probably doesn't work as written, see future diffs.

Test Plan: See future diffs.

Reviewers: hach-que, chad

Reviewed By: hach-que, chad

Subscribers: cburroughs

Maniphest Tasks: T9253

Differential Revision: https://secure.phabricator.com/D14114
2015-09-21 04:43:25 -07:00
Joshua Spence
36e2d02d6e phtize all the things
Summary: `pht`ize a whole bunch of strings in rP.

Test Plan: Intense eyeballing.

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: hach-que, Korvin, epriestley

Differential Revision: https://secure.phabricator.com/D12797
2015-05-22 21:16:39 +10:00
James Rhodes
4277123060 Increase Drydock worker time to 24 hours
Summary:
Ref T2015.  This increases the Drydock worker lease time to 24 hours.  We noticed that some leases took longer than 2 hours when leasing from AWS (the actual resource was successfully leased at around 2 hours, 19 minutes).

24 hours should be plenty enough time to actually lease anything from EC2 (or any other leases during builds).

Test Plan: Have not yet tested this.

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: epriestley, Korvin

Maniphest Tasks: T2015

Differential Revision: https://secure.phabricator.com/D10544
2014-09-25 09:46:47 +10:00
Joshua Spence
0a62f13464 Change double quotes to single quotes.
Summary: Ran `arc lint --apply-patches --everything` over rP, mainly to change double quotes to single quotes where appropriate. These changes also validate that the `ArcanistXHPASTLinter::LINT_DOUBLE_QUOTE` rule is working as expected.

Test Plan: Eyeballed it.

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: epriestley, Korvin, hach-que

Differential Revision: https://secure.phabricator.com/D9431
2014-06-09 11:36:50 -07:00
Joshua Spence
6270114767 Various linter fixes.
Summary:
- Removed trailing newlines.
- Added newline at EOF.
- Removed leading newlines.
- Trimmed trailing whitespace.
- Spelling fix.
- Added newline at EOF

Test Plan: N/A

Reviewers: epriestley, #blessed_reviewers

Reviewed By: epriestley

CC: hach-que, chad, Korvin, epriestley, aran

Differential Revision: https://secure.phabricator.com/D8344
2014-02-26 12:44:58 -08:00
epriestley
9b0fa5747b Make Drydock more broadly aware of policies
Summary:
Ref T2015. Moves a bunch of raw object loads into modern policy-aware queries.

Also straightens out the Log and Lease policies a little bit: there are legitimate states where these objects are not attached to a resource (particularly, while a lease is being acquired). Handle these more gracefully.

Test Plan: Lint / browsed stuff.

Reviewers: btrahan

Reviewed By: btrahan

CC: aran

Maniphest Tasks: T2015

Differential Revision: https://secure.phabricator.com/D7836
2013-12-27 13:15:19 -08:00
James Rhodes
ba16df0fed Restructure Drydock so that blueprints are instances in the DB
Summary:
//(this diff used to be about applying policies to blueprints)//

This restructures Drydock so that blueprints are instances in the DB, with an associated implementation class.  Thus resources now have a `blueprintPHID` instead of `blueprintClass` and DrydockBlueprint becomes a DAO.  The old DrydockBlueprint is renamed to DrydockBlueprintImplementation, and the DrydockBlueprint DAO has a `blueprintClass` column on it.

This now just implements CAN_VIEW and CAN_EDIT policies for blueprints, although they are probably not enforced in all of the places they could be.

Test Plan: Used the `create-resource` and `lease` commands.  Closed resources and leases in the UI.  Clicked around the new and old lists to make sure everything is still working.

Reviewers: epriestley, #blessed_reviewers

Reviewed By: epriestley

CC: Korvin, epriestley, aran

Maniphest Tasks: T4111, T2015

Differential Revision: https://secure.phabricator.com/D7638
2013-12-03 11:09:07 +11:00
epriestley
d4ca508d2b Add PhabricatorWorker->log()
Summary:
Ref T2852. Add a `log()` method to `PhabricatorWorker` to make debugging easier.

I renamed the similar Drydock-specific method.

Test Plan:
Used logging in a future revision:

  ...
  <<< [36] <http> 211,704 us
  Updating main task.
  >>> [37] <http> https://app.asana.com/api/1.0/tasks/6153776820388
  ...

Reviewers: btrahan, chad

Reviewed By: chad

CC: aran

Maniphest Tasks: T2852

Differential Revision: https://secure.phabricator.com/D6296
2013-06-25 16:31:37 -07:00
epriestley
cce5ebebe9 Improve Drydock's ability to allocate leases correctly
Summary:
Right now, Drydock gives out multiple leases to the same working copy and gives out leases to working copies with repository "P" in them when the user requested some other repository.

Add two callbacks:

  - `canAllocateLease()` - allows a blueprint to reject a lease on a resource because of a fundamental incompatibility, like "it's a working copy with Phabricator in it, but the lease wants a working copy with Javelin in it".
  - `shouldAllocateLease()` - allows a blueprint to reject a lease on a resource because of resource limits, like "only one active lease can own a working copy at a time".

Also cleaned up various other things.

Test Plan:
After implementing the callbacks, Drydock has the correct behavior:

  - It gives multiple leases on `localhost`, but only one lease per working-copy resource.
  - It does not grant leases on resources with repository X to requests for repository Y.

Ran `bin/drydock lease --type working-copy --repositoryID 12` and similar repeatedly and verified results in the web console.

Reviewers: btrahan

Reviewed By: btrahan

CC: aran

Maniphest Tasks: T2015

Differential Revision: https://secure.phabricator.com/D4166
2012-12-12 18:42:12 -08:00
epriestley
7e0ce08154 Make various Drydock CLI/Allocator improvements
Summary:
  - Remove EC2, RemoteHost, Application, etc., blueprints for now. They're very proof-of-concept and Blueprints are getting API changes I don't want to bother propagating for now. Leave the abstract base class and the LocalHost blueprint. I'll restore the more complicated ones once better foundations are in place.
  - Remove the Allocate controller from the web UI. The original vision here was that you'd manually allocate resources in some cases, but it no longer makes sense to do so as all allocations come from leases now. This simplifies allocations and makes the rule for when we can clean up resources clear-cut (if a resource has no more active leases, it can be cleaned up). Instead, we'll build resources like the localhost and remote hosts lazily, when leases come in for them.
  - Add some configuration to manage the localhost blueprint.
  - Refactor `canAllocateResources()` into `isEnabled()` (for config checks) and `canAllocateMoreResources()` (for quota checks, e.g. too many resources are allocated already).
  - Juggle some signatures to align better with a world where blueprints generally do allocate.
  - Add some more logging and error handling.
  - Fix an issue with log ordering.

Test Plan: Allocated some localhost leases.

Reviewers: btrahan

Reviewed By: btrahan

CC: aran

Maniphest Tasks: T2015

Differential Revision: https://secure.phabricator.com/D3902
2012-11-06 15:30:11 -08:00
vrana
ef85f49adc Delete license headers from files
Summary:
This commit doesn't change license of any file. It just makes the license implicit (inherited from LICENSE file in the root directory).

We are removing the headers for these reasons:

- It wastes space in editors, less code is visible in editor upon opening a file.
- It brings noise to diff of the first change of any file every year.
- It confuses Git file copy detection when creating small files.
- We don't have an explicit license header in other files (JS, CSS, images, documentation).
- Using license header in every file is not obligatory: http://www.apache.org/dev/apply-license.html#new.

This change is approved by Alma Chao (Lead Open Source and IP Counsel at Facebook).

Test Plan: Verified that the license survived only in LICENSE file and that it didn't modify externals.

Reviewers: epriestley, davidrecordon

Reviewed By: epriestley

CC: aran, Korvin

Maniphest Tasks: T2035

Differential Revision: https://secure.phabricator.com/D3886
2012-11-05 11:16:51 -08:00
epriestley
89b37f0357 Make various Drydock improvements
Summary:
Tightens up a bunch of stuff:

  - In `drydock lease`, pull and print logs so the user can see what's happening.
  - Remove `DrydockAllocator`, which was a dumb class that did nothing. Move the tiny amount of logic it held directly to `DrydockLease`.
  - Move `resourceType` from worker task metadata directly to `DrydockLease`. Other things (like the web UI) can be more informative with this information available.
  - Pass leases to `allocateResource()`. We always allocate in response to a lease activation request, and the lease often has vital information. This also allows us to associate logs with leases correctly.

Test Plan: Ran `drydock lease --type host` and saw it perform a host allocation in EC2.

Reviewers: btrahan

Reviewed By: btrahan

CC: aran

Maniphest Tasks: T2015

Differential Revision: https://secure.phabricator.com/D3870
2012-11-01 16:53:17 -07:00