1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2024-11-25 16:22:43 +01:00
Commit graph

9 commits

Author SHA1 Message Date
epriestley
4cf1270ecd In Harbormaster, make sure artifacts are destroyed even if a build is aborted
Summary:
Ref T9252. Currently, Harbormaster and Drydock work like this in some cases:

  # Queue a lease for activation.
  # Then, a little later, save the lease PHID somewhere.
  # When the target/resource is destroyed, destroy the lease.

However, something can happen between (1) and (2). In Drydock this window is very short and the "something" would have to be a lighting strike or something similar, but in Harbormaster we wait until the resource activates to do (2) so the window can be many minutes long. In particular, a user can use "Abort Build" during those many minutes.

If they do, the target is destroyed but it doesn't yet have a record of the artifact, so the artifact isn't cleaned up.

Make these things work like this instead:

  # Create a new lease and pre-generate a PHID for it.
  # Save that PHID as something that needs to be cleaned up.
  # Queue the lease for activation.
  # When the target/resource is destroyed, destroy the lease if it exists.

This makes sure there's no step in the process where we might lose track of a lease/resource.

Also, clean up and standardize some other stuff I hit.

Test Plan:
  - Stopped daemons.
  - Restarted a build in Harbormaster.
  - Stepped through the build one stage at a time using `bin/worker execute ...`.
  - After the lease was queued, but before it activated, aborted the build.
  - Processed the Harbormaster side of things only.
  - Saw the lease get destroyed properly.

Reviewers: chad, hach-que

Reviewed By: hach-que

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14234
2015-10-05 05:58:53 -07:00
epriestley
d4a0b1c870 Remove names from Drydock resources
Summary:
Ref T9252. Long ago you sometimes manually created resources, so they had human-enterable names. However, users never make resources manually any more, so this field isn't really useful any more.

In particular, it means we write a lot of untranslatable strings like "Working Copy" to the database in the default locale. Instead, do the call at runtime so resource names are translatable.

Also clean up a few minor things I hit while kicking the tires here.

It's possible we might eventually want to introduce a human-choosable label so you can rename your favorite resources and this would just be a default name. I don't really have much of a use case for that yet, though, and I'm not sure there will ever be one.

Test Plan:
  - Restarted a Harbormaster build, got a clean build.
  - Released all leases/resources, restarted build, got a clean build with proper resource names.

Reviewers: hach-que, chad

Reviewed By: hach-que, chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14213
2015-10-01 08:13:43 -07:00
epriestley
e589d15231 Improve error and exception handling for Drydock resources
Summary:
Ref T9252. Currently, error handling behavior isn't great and a lot of errors aren't dealt with properly. Try to improve this by making default behaviors better:

  - Yields, slot lock exceptions, and aggregate or proxy exceptions containing an excpetion of these types turn into yields.
  - All other exceptions are considered permanent failures. They break the resource and

This feels a little bit "magical" but I want to try to get the default behaviors to align reasonably well with expectations so that blueprints mostly don't need to have a ton of error handling. This will probably need at least some refinement down the road, but it's a reasonable rule for all exception/error conditions we currently have.

Test Plan: I did a clean build, but haven't vetted this super thoroughly. Next diff will do the same thing to leases, then I'll work on stabilizing this code better.

Reviewers: chad, hach-que

Reviewed By: hach-que

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14211
2015-10-01 08:12:51 -07:00
epriestley
e117ace8c7 Convert Drydock lease and resource constants to strings
Summary:
Ref T9252. Drydock currently uses integer statuses, but there's no reason for this (they don't need to be ordered) and it makes debugging them, working with them, future APIs, etc., more cumbersome.

Switch to string instead.

Also rename `STATUS_OPEN` to `STATUS_ACTIVE` and `STATUS_CLOSED` to `STATUS_RELEASED` for consistency. This makes resources and leases have more similar states, and gives resource states more accurate names.

Test Plan: Browsed web UI, grepped for changed constants, applied patch, inspected database.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14153
2015-09-24 07:57:05 -07:00
epriestley
1f311d64c6 Give Drydock resources and leases a real "destroy" lifecycle phase
Summary: Ref T9252. Some leases or resources may need to remove data, tear down VMs, etc., during cleanup. After they are released, queue a "destroy" phase for performing teardown.

Test Plan:
  - Used `bin/drydock lease ...` to create a working copy lease.
  - Used `bin/drydock release-lease` and `bin/drydock release-resource` to release the lease and then the working copy and host.
  - Saw working copy and host get destroyed and cleaned up properly.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T6569, T9252

Differential Revision: https://secure.phabricator.com/D14144
2015-09-23 11:20:20 -07:00
epriestley
f1119ffcf5 Support working copies and separate allocate + activate steps for resources/leases in Drydock
Summary:
Ref T9253. For resources and leases that need to do something which takes a lot of time or requires waiting, allow them to allocate/acquire first and then activate later.

When we allocate a resource or acquire a lease, the blueprint can either activate it immediately (if all the work can happen quickly/inline) or activate it later. If the blueprint activates it later, we queue a worker to handle activating it.

Rebuild the "working copy" blueprint to work with this model: it allocates/acquires and activates in a separate step, once it is able to acquire a host.

Test Plan: With some power of imagination, brought up a bunch of working copies with `bin/drydock lease --type working-copy ...`

Reviewers: hach-que, chad

Reviewed By: hach-que, chad

Maniphest Tasks: T9253

Differential Revision: https://secure.phabricator.com/D14127
2015-09-21 04:46:24 -07:00
epriestley
6a0eb9d84b Allow AlmanacHost blueprints to build a meaningful CommandInterface
Summary: Ref T9253. Provide a meaningful command interface for Almanac hosts.

Test Plan:
Configued and leased a real host (`sbuild001.phacility.net`) and ran a command on it.

```
$ ./bin/drydock command --lease 90 -- ls /
bin
boot
core
dev
etc
home
initrd.img
lib
lib64
lost+found
media
mnt
opt
proc
root
run
sbin
srv
sys
tmp
usr
var
vmlinuz
```

Reviewers: chad, hach-que

Reviewed By: chad, hach-que

Maniphest Tasks: T9253

Differential Revision: https://secure.phabricator.com/D14126
2015-09-21 04:46:02 -07:00
epriestley
3ac99006bf Implement optimistic "slot locks" in Drydock
Summary:
See discussion in D10304. There's a lot of context there, but the general idea is:

  - Blueprints should manage locks in a granular way during the actual allocation/acquisition phase.
  - Optimistic "slot locks" might a pretty good primitive to make that easy to implement and reason about in most cases.

The way these locks work is that you just pick some name for the lock (like the PHID of a resource) and say that it needs to be acquired for the allocation/acquisition to work:

```
...
->needSlotLock("mylock(PHID-XYZQ-...)")
...
```

When you fire off the acquisition or allocation, it fails unless it could acquire the slot with that name. This is really simple (no explicit lock management) and a pretty good fit for most of the locking that blueprints and leases need to do.

If you need to do limit-based locks (e.g., maximum of 3 locks) you could acquire a lock like this:

```
mylock(whatever).slot(2)
```

Blueprints generally only contend with themselves, so it's normally OK for them to pick whatever strategy works best for them in naming locks.

This may not work as well if you have a huge number of slots (e.g., 100TB you want to give out in 1MB chunks), or other complex needs for locks (like you have to synchronize access to some external resource), but slot locks don't need to be the only mechanism that blueprints use. If they run into a problem that slot locks aren't a good fit for, they can use something else instead. For now, slot locks seem like a good fit for the problems we currently face and most of the problems I anticipate facing.

(The release workflows have other race issues which I'm not addressing here. They work fine if nothing races, but aren't race-safe.)

Test Plan:
To create a race where the same binding is allocated as a resource twice:

  - Add `sleep(10)` near the beginning of `allocateResource()`, after the free bindings are loaded but before resources are allocated.
  - (Comment out slot lock acquisition if you have this patch.)
  - Run `bin/drydock lease ...` in two windows, within 10 seconds of one another.

This will reliably double-allocate the binding because both blueprints see a view of the world where the binding is free.

To verify the lock works, un-comment it (or apply this patch) and run the same test again. Now, the lock fails in one process and only one resource is allocated.

Reviewers: hach-que, chad

Reviewed By: hach-que, chad

Differential Revision: https://secure.phabricator.com/D14118
2015-09-21 04:45:25 -07:00
epriestley
6e03419593 Implement a rough AlmanacService blueprint in Drydock
Summary:
Ref T9253. Broadly, this realigns Allocator behavior to be more consistent and straightforward and amenable to intended future changes.

This attempts to make language more consistent: resources are "allocated" and leases are "acquired".

This prepares for (but does not implement) optimistic "slot locking", as discussed in D10304. Although I suspect some blueprints will need to perform other locking eventually, this does feel like a good fit for most of the locking blueprints need to do.

In particular, I've made the blueprint operations on `$resource` and `$lease` objects more purposeful: they need to invoke an activator on the appropriate object to be implemented correctly. Before they invoke this activator method, they configure the object. In a future diff, this configuration will include specifying slot locks that the lease or resource must acquire. So the API will be something like:

  $lease
    ->setActivateWhenAcquired(true)
    ->needSlotLock('x')
    ->needSlotLock('y')
    ->acquireOnResource($resource);

In the common case where slot locks are a good fit, I think this should make correct blueprint implementation very straightforward.

This prepares for (but does not implement) resources and leases which need significant setup steps. I've basically carved out two modes:

  - The "activate immediately" mode, as here, immediately opens the resource or activates the lease. This is appropriate if little or no setup is required. I expect many leases to operate in this mode, although I expect many resources will operate in the other mode.
  - The "allocate now, activate later" mode, which is not fully implemented yet. This will queue setup workers when the allocator exits. Overall, this will work very similarly to Harbormaster.
  - This new structure makes it acceptable for blueprints to sleep as long as they want during resource allocation and lease acquisition, so long as they are not waiting on anything which needs to be completed by the queue. Putting a `sleep(15 * 60)` in your EC2Blueprint to wait for EC2 to bring a machine up will perform worse than using delayed activation, but won't deadlock the queue or block any locks.

Overall, this flow is more similar to Harbormaster's flow. Having consistency between Harbormaster's model and Drydock's model is good, and I think Harbormaster's model is also simply much better than Drydock's (what exists today in Drydock was implemented a long time ago, and we had more support and infrastructure by the time Harbormaster was implemented, as well as a more clearly defined problem).

The particular strength of Harbormaster is that objects always (or almost always, at least) have a single, clearly defined writer. Ensuring objects have only one writer prevents races and makes reasoning about everything easier.

Drydock does not currently have a clearly defined single writer, but this moves us in that direction. We'll probably need more primitives eventually to flesh this out, like Harbormaster's command queue for messaging objects which you can't write to.

This blueprint was originally implemented in D13843. This makes a few changes to the blueprint itself:

  - A bunch of code from that (e.g., interfaces) doesn't exist yet.
  - I let the blueprint have multiple services. This simplifies the code a little and seems like it costs us nothing.

This also removes `bin/drydock create-resource`, which no longer makes sense to expose. It won't get locking, leasing, etc., correct, and can not be made correct.

NOTE: This technically works but doesn't do anything useful yet.

Test Plan: Used `bin/drydock lease --type host` to acquire leases against these blueprints.

Reviewers: hach-que, chad

Reviewed By: hach-que, chad

Subscribers: Mnkras

Maniphest Tasks: T9253

Differential Revision: https://secure.phabricator.com/D14117
2015-09-21 04:43:53 -07:00