1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2024-12-03 20:22:46 +01:00
Commit graph

14 commits

Author SHA1 Message Date
epriestley
f48a833704 Fix an issue with incorrect authorization handling in Working Copy build steps
Summary:
Fixes T9669. Two issues:

  - We were using `repositoryPHIDs` instead of `blueprintPHIDs` for the list of allowed blueprints. Use the correct value.
  - We weren't enforcing `allowedBlueprintPHIDs` fully correctly. We //did// require an authorization, so the net effect was correct in nearly all cases, but we could have selected from too large a pool in the case where the application itself was doing the authorization (e.g., from the command line).

Test Plan: Ran a build through Drydock/Harbormaster locally.

Reviewers: chad, tycho.tatitscheff

Reviewed By: chad, tycho.tatitscheff

Subscribers: tycho.tatitscheff

Maniphest Tasks: T9669

Differential Revision: https://secure.phabricator.com/D14368
2015-10-30 16:02:35 +00:00
epriestley
2326d5f8d0 Show lease on Repository Operation detail view and awaken on failures
Summary:
Ref T182. Couple of minor improvements here:

  - Show the Drydock lease when viewing a Repository Operation detail screen. This just makes it easier to jump around between relevant objects.
  - When tasks are waiting for a lease, awaken them when it breaks or is released, not just when it is acquired. This makes the queue move forward faster when errors occur.

Test Plan:
  - Viewed a repository operation and saw a link to the lease.
  - Did a bad land (intentional merge problem) and got an error in about ~3 seconds instead of ~17.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T182

Differential Revision: https://secure.phabricator.com/D14341
2015-10-26 20:00:49 +00:00
epriestley
083a321dad Fix an issue where newly created Drydock resources could be improperly acquired
Summary:
Ref T9252. This is mostly a fix for an edge case from D14236. Here's the setup:

  - There are no resources.
  - A request for a new resource arrives.
  - We build a new resource.

Now, if we were leasing an existing resource, we'd call `canAcquireLeaseOnResource()` before acquiring a lease on the new resource.

However, for new resources we don't do that: we just acquire a lease immediately. This is wrong, because we now allow and expect some resources to be unleasable when created.

In a more complex workflow, this can also produce the wrong result and leave the lease acquired sub-optimally (and, today, deadlocked).

Make the "can we acquire?" pathway consistent for new and existing resources, so we always do the same set of checks.

Test Plan:
  - Started daemons.
  - Deleted all working copy resources.
  - Ran two working-copy-using build plans at the same time.
  - Before this change, one would often [1] acquire a lease on a pending resource which never allocated, then deadlock.
  - After this change, the same thing happens except that the lease remains pending and the work completes.

[1] Although the race this implies is allowed (resource pool limits are soft/advisory, and it is expected that we may occasionally run over them), it's MUCH easier to hit right now than I would expect it to be, so I think there's probably at least one more small bug here somewhere. I'll see if I can root it out after this change.

Reviewers: chad, hach-que

Reviewed By: hach-que

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14272
2015-10-14 06:16:21 -07:00
epriestley
1bdf225354 Use Drydock authorizations when acquiring leases
Summary:
Ref T9519. When acquiring leases on resources:

  - Only consider resources created by authorized blueprints.
  - Only consider authorized blueprints when creating new resources.
  - Fail with a tailored error if no blueprints are allowed.
  - Fail with a tailored error if missing authorizations are causing acquisition failure.

One somewhat-substantial issue with this is that it's pretty hard to figure out from the Harbormaster side. Specifically, the Build step UI does not show field value anywhere, so the presence of unapproved blueprints is not communicated. This is much more clear in Drydock. I'll plan to address this in future changes to Harbormaster, since there are other related/similar issues anyway.

Test Plan: {F872527}

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9519

Differential Revision: https://secure.phabricator.com/D14254
2015-10-12 17:02:35 -07:00
epriestley
ee937e99fb Fix unbounded expansion of allocating resource pool
Summary:
Ref T9252. I think there's a more complex version of this problem discussed elsewhere, but here's what we hit today:

  - 5 commits land at the same time and trigger 5 builds.
  - All of them go to acquire a working copy.
  - Working copies have a limit of 1 right now, so 1 of them gets the lease on it.
  - The other 4 all trigger allocation of //new// working copies. So now we have: 1 active, leased working copy and 4 pending, leased working copies.
  - The 4 pending working copies will never activate without manual intervention, so these 4 builds are stuck forever.

To fix this, prevent WorkingCopies from giving out leases until they activate. So now the leases won't acquire until we know the working copy is good, which solves the first problem.

However, this creates a secondary problem:

  - As above, all 5 go to acquire a working copy.
  - One gets it.
  - The other 4 trigger allocations, but no longer acquire leases. This is an improvement.
  - Every time the leases update, they trigger another allocation, but never acquire. They trigger, say, a few thousand allocations.
  - Eventually the first build finishes up and the second lease acquires the working copy. After some time, all of the builds finish.
  - However, they generated an unboundedly large number of pending working copy resources during this time.

This is technically "okay-ish", in that it did work correctly, it just generated a gigantic mess as a side effect.

To solve this, at least for now, provide a mechanism to impose allocation rate limits and put a cap on the number of allocating resources of a given type. As hard-coded, this the greater of "1" or "25% of the active resources in the pool".

So if there are 40 working copies active, we'll start allocating up to 10 more and then cut new allocations off until those allocations get sorted out. This prevents us from getting runaway queues of limitless size.

This also imposes a total active working copy resource limit of 1, which incidentally also fixes the problem, although I expect to raise this soon.

These mechanisms will need refinement, but the basic idea is:

  - Resources which aren't sure if they can actually activate should wait until they do activate before allowing leases to acquire them. I'm fairly confident this rule is a reasonable one.
  - Then we limit how many bookkeeping side effects Drydock can generate once it starts encountering limits.

Broadly, some amount of mess is inevitable because Drydock is allowed to try things that might not work. In an extreme case we could prevent this mess by setting all these limits at "1" forever, which would degrade Drydock to effectively be a synchronous, blocking queue.

The idea here is to put some amount of slack in the system (more than zero, but less than infinity) so we get the performance benefits of having a parallel, asyncronous system without a finite, manageable amount of mess.

Numbers larger than 0 but less than infinity are pretty tricky, but I think rules like "X% of active resources" seem fairly reasonable, at least for resources like working copies.

Test Plan:
Ran something like this:

```
for i in `seq 1 5`; do sh -c '(./bin/harbormaster build --plan 10 rX... &) &'; done;
```

Saw 5 plans launch, acquire leases, proceed in an orderly fashion, and eventually finish successfully.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14236
2015-10-05 15:59:16 -07:00
epriestley
4496176924 Add staging area support to Harbormaster/Drydock + various fixes
Summary:
Ref T9252. This primarily allows Harbormaster to request (and Drydock to fulfill) working copies with a patch from a staging area. Doing this means we can do builds on in-review changes from `arc diff`.

This is a little cobbled-together but should basically work.

Also fix some other issues:

  - Yielded, awakend workers are fine to update but could complain.
  - We can't log slot lock failures to resources if we don't end up saving them.
  - Killing the transaction would wipe out the log.
  - Fix some TODOs, etc.

Test Plan: Ran Harbormaster builds on a local revision.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14214
2015-10-01 16:55:01 -07:00
epriestley
d4a0b1c870 Remove names from Drydock resources
Summary:
Ref T9252. Long ago you sometimes manually created resources, so they had human-enterable names. However, users never make resources manually any more, so this field isn't really useful any more.

In particular, it means we write a lot of untranslatable strings like "Working Copy" to the database in the default locale. Instead, do the call at runtime so resource names are translatable.

Also clean up a few minor things I hit while kicking the tires here.

It's possible we might eventually want to introduce a human-choosable label so you can rename your favorite resources and this would just be a default name. I don't really have much of a use case for that yet, though, and I'm not sure there will ever be one.

Test Plan:
  - Restarted a Harbormaster build, got a clean build.
  - Released all leases/resources, restarted build, got a clean build with proper resource names.

Reviewers: hach-que, chad

Reviewed By: hach-que, chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14213
2015-10-01 08:13:43 -07:00
epriestley
b219bcfb3d Improve error and exception handling for Drydock leases
Summary:
Ref T9252. See companion change in D14211. This does the same thing for leases.

Particularly, most of the TODOs about error handling can just be removed because they'll do the right things by default now.

This and D14211 also move slot lock release to after resource destruction. This feels cleaner than trying to release early at release/break.

Test Plan: Restarted a Harbormaster build, got a clean build result. This needs more vetting but I'll clean up any issues as I hit them.

Reviewers: chad, hach-que

Reviewed By: hach-que

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14212
2015-10-01 08:13:20 -07:00
epriestley
4ac82be5ed Merge the DrydockLease workers into a single worker
Summary:
Ref T9252. This is the same as D14201, but for lease stuff instead of resource stuff.

This one is a little heavier but still feels pretty reasonable to me at the end of the day (worker is <1K lines and has a ton of comment stuff).

Also fixes a few random bugs I hit in the task queue.

Test Plan:
  - Restarted some Harbormaster builds, saw them go through cleanly.
  - Released pre-activation resources/leases.
  - Probably still kinda buggy but I'll iron the details out over time.

Logs are starting to look somewhat plausible:

{F855747}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14202
2015-10-01 08:11:02 -07:00
epriestley
8bf5905024 Add Drydock log types and more logging
Summary: Ref T9252. Make log types modular so they can be translated and have complicated rendering logic if necessary (currently, none have this).

Test Plan: {F855330}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14198
2015-10-01 08:10:07 -07:00
epriestley
ec6d69e74d Give Drydock resources a proper expiry mechanism
Summary:
Fixes T6569. This implements an expiry mechanism for Drydock resources which parallels the mechanism for leases.

A few things are missing that we'll probably need in the future:

  - An "EXPIRES" command to update the expiration time. This would let resources be permanent while leased, then expire after, say, 24 hours without any leases.
  - A callback like `shouldActuallyExpireRightNow()` for resources and leases that lets them decide not to expire at the last second.
  - A callback like `didAcquireLease()` for resource blueprints, to parallel `didReleaseLease()`, letting them clear or extend their timer.

However, this stuff would mostly just let us tune behaviors, not really open up new capabilities.

Test Plan: Changed host resources to expire after 60 seconds, leased one, saw it vanish 60 seconds later.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T6569

Differential Revision: https://secure.phabricator.com/D14176
2015-09-28 09:35:14 -07:00
epriestley
3379904237 Allow Drydock leases to expire after a time limit
Summary: Ref T6569. If a lease is activated with an expiration date, schedule a task to try to clean it up after that time.

Test Plan:
  - Used `bin/drydock lease ... --until ...` to activate a lease in the near future.
  - Waited for a bit.
  - Saw it expire and get destroyed at the scheduled time.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T6569

Differential Revision: https://secure.phabricator.com/D14148
2015-09-23 13:54:27 -07:00
epriestley
1f311d64c6 Give Drydock resources and leases a real "destroy" lifecycle phase
Summary: Ref T9252. Some leases or resources may need to remove data, tear down VMs, etc., during cleanup. After they are released, queue a "destroy" phase for performing teardown.

Test Plan:
  - Used `bin/drydock lease ...` to create a working copy lease.
  - Used `bin/drydock release-lease` and `bin/drydock release-resource` to release the lease and then the working copy and host.
  - Saw working copy and host get destroyed and cleaned up properly.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T6569, T9252

Differential Revision: https://secure.phabricator.com/D14144
2015-09-23 11:20:20 -07:00
epriestley
789df89c84 Add a command queue to Drydock to manage lease/resource release
Summary:
Ref T9252. Broadly, Drydock currently races on releasing objects from the "active" state. To reproduce this:

  - Scatter some sleep()s pretty much anywhere in the release code.
  - Release several times from web UI or CLI in quick succession.

Resources or leases will execute some release code twice or otherwise do inconsistent things.

(I didn't chase down a detailed reproduction scenario for this since inspection of the code makes it clear that there are no meaningful locks or mechanisms preventing this.)

Instead, add a Harbormaster-style command queue to resources and leases. When something wants to do a release, it adds a command to the queue and schedules a worker. The workers acquire a lock, then try to consume commands from the queue.

This guarantees that only one process is responsible for writes to active resource/leases.

This is the last major step to giving resources and leases a single writer during all states:

  - Resource, Unsaved: AllocatorWorker
  - Resource, Pending: ResourceWorker (Possible rename to "Allocated?")
  - Resource, Open: This diff, ResourceUpdateWorker. (Likely rename to "Active").
  - Resource, Closed/Broken: Future destruction worker. (Likely rename to "Released" / "Broken"; maybe remove "Broken").
  - Resource, Destroyed: No writes.
  - Lease, Unsaved: Whatever wants the lease.
  - Lease, Pending: AllocatorWorker
  - Lease, Acquired: LeaseWorker
  - Lease, Active: This diff, LeaseUpdateWorker.
  - Lease, Released/Broken: Future destruction worker (Maybe remove "Broken"?)
  - Lease, Expired: No writes. (Likely rename to "Destroyed").

In most phases, we can already guarantee that there is a single writer without doing any extra work. This is more complicated in the "Active" case because the release buttons on the web UI, the release tools on the CLI, the lease requestor itself, the garbage collector, and any other release process cleaning up related objects may try to effect a release. All of these could race one another (and, in many cases, race other processes from other phases because all of these get to act immediately) as this code is currently written. Using a queue here lets us make sure there's only a single writer in this phase.

One thing which is notable is that whatever acquires a lease **can not write to it**! It is never the writer once it queues the lease for activation. It can not write to any resources, either. And, likewise, Blueprints can not write to resources while acquiring or releasing leases.

We may need to provide a mechinism so that blueprints and/or resource/lease holders get to attach some storage to resources/leases for bookkeeping. For example, a blueprint might need to keep some kind of cache on a resource to help it manage state. But I think we can cross that bridge when we come to it, and nothing else would need to write to this storage so it's technically straightforward to introduce such a mechanism if we need one.

Test Plan:
  - Viewed buttons in web UI, checked enabled/disabled states.
  - Clicked the buttons.
  - Saw commands show up in the command queue.
  - Saw some daemon stuff get scheduled.
  - Ran CLI tools, saw commands get consumed and resources/leases release.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14143
2015-09-23 07:42:08 -07:00