Summary:
Closes T7829 as wontfix. Closes T7965 as wontfix. Closes T7800 as wontfix. Closes T2731 as wontfix. Closes T1271 as wontfix.
We aren't maintaining this at all (see, e.g., T7829) and a user reported a technically accurate security issue via HackerOne: <https://hackerone.com/reports/222870>
Just throw it away until we get to the eventual Conphernece bot/API update and can do this stuff correctly.
Test Plan: Grepped for `phabricatorbot`.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T7965, T7829, T7800, T2731, T1271
Differential Revision: https://secure.phabricator.com/D17756
Under some workloads, the taskmaster may hibernate and launch more rapidly
than it should. Require 15 seconds of inactivity before hibernating. Also
hibernate for longer.
Auditors: chad
The `min()` vs `max()` fix in D17560 meant that the Trigger daemon only
hibernates for 5 seconds, so we do a full GC sweep every 5 seconds. This ends
up eating a fair amount of CPU for no real benefit.
The GC cursors should move to persistent storage, but just bump this default
up in the meantime.
Auditors: chad
Summary: Ref T12298. Like PullLocal daemons, this allows the last daemon in the pool to hibernate if there's no work to be done, and awakens the pool when work arrives.
Test Plan:
- Ran `bin/phd debug task --trace`.
- Saw the pool hibernate and look for tasks.
- Commented on an object.
- Saw the pool wake up and process the queue.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12298
Differential Revision: https://secure.phabricator.com/D17559
Summary:
Ref T12298. Two minor daemon improvements:
- Make the "waiting" message reflect hibernation.
- Don't trigger a reload right after launching.
Test Plan:
- Read "waiting" message.
- Ran "bin/phd start", didn't see an immediate SIGHUP in the log.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12298
Differential Revision: https://secure.phabricator.com/D17550
Summary: Ref T12298. This allows the PullLocal daemon to hibernate like the Trigger daemon, but automatically wakes it back up when it needs to do something.
Test Plan:
- Ran `bin/phd debug pulllocal --trace`.
- Saw the daemon hibernate after doing a checkup on repositories.
- Saw periodic queries to look for new update messages.
- After clicking "Update Now" in the web UI to schedule an update, saw the daemon wake up immediately.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12298
Differential Revision: https://secure.phabricator.com/D17540
Summary: Ref T12331. These changes are intended to make it easier to debug T12331 since I'm having difficulty reproducing the issue locally.
Test Plan:
- Ran `bin/phd debug task --pool 4` and got an autoscaling pool.
- Ran `bin/worker flood --duration 3` and got some 3-second-long tasks to execute with `bin/worker execute ...`.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12331
Differential Revision: https://secure.phabricator.com/D17431
Summary:
Ref T12298. The trigger daemon already has routine long-term sleep, and few external events can impact when it should ideally wake up. The relevant events are:
- Someone creates a new Nuance source (ideally, we should wake up right away and start polling it).
- Someone creates a Calendar event about 16 minutes in the future (ideally, we should send them a reminder in about a minute).
- Someone changes GC config to be extremely aggressive (ideally, we should immediately respect the change).
None of these cases are very important. We don't hibernate for more than 3 minutes, so the worst case is that your Nuance source takes 3 minutes to start importing or your Calendar notification comes two minutes too late (13 minutes before the event instead of 15).
This change makes GC sightly more CPU-expensive on average: currently, we do a GC sweep every 4 hours. After this change, we'll end up doing one every 3 minutes, because we lose the fact that we did a sweep recently when the daemon restarts.
We could fix this by keeping track of when the last GC sweep was in the database, instead of in the Daemon process, but the cost of a sweep is normally very small so I don't plan to do this anytime soon.
Test Plan:
- Ran `bin/phd debug trigger`, saw daemon go through 3-minute hibernate + restart cycles.
- Ran `bin/phd debug task`, saw daemon run normally.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12298
Differential Revision: https://secure.phabricator.com/D17408
Summary: Ref T10390. Simplifies dropdown by rolling out canUseInPanel in useless panels
Test Plan: Add a query panel, see less options.
Reviewers: epriestley
Reviewed By: epriestley
Subscribers: Korvin
Maniphest Tasks: T10390
Differential Revision: https://secure.phabricator.com/D17341
Summary:
I frequently run into a situation where I want to kill tasks that have accumulated a lot of failures regardless of what class they are. Or I'll want to kill every worker of a certain class but only if it has failed at least once. This change allows me to run `./bin/worker cancel --class <MYCLASS> --min-failure-count 5` to only kill tasks with at least 5 failed attempts.
The `--min-failure-count N` argument can be used by itself as well as with `--class CLASSNAME`. I don't think it makes sense for it to work with `--id ID`, but I'm not dead set on that or anything.
Test Plan: I ran the worker management workflow with and without the `--min-failure-count` argument and it worked as expected.
Reviewers: #blessed_reviewers, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: Korvin, epriestley, yelirekim
Differential Revision: https://secure.phabricator.com/D16906
Summary:
This has been replaced by `PolicyCodex` after D16830. Also:
- Rebuild Celerity map to fix grumpy unit test.
- Fix one issue on the policy exception workflow to accommodate the new code.
Test Plan:
- `arc unit --everything`
- Viewed policy explanations.
- Viewed policy errors.
Reviewers: chad
Reviewed By: chad
Subscribers: hach-que, PHID-OPKG-gm6ozazyms6q6i22gyam
Differential Revision: https://secure.phabricator.com/D16831
Summary: Ref T5267. Fix one minor bug (paths were not being resolved properly) and one minor string issue (missing `%d` in a string).
Test Plan: Extracted strings, got a cleaner result.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T5267
Differential Revision: https://secure.phabricator.com/D16808
Summary: Ref T7931. This is still quite rough, but should technically send vaguely-useful email as part of the standard trigger infrastructure.
Test Plan: Ran `bin/phd start`, created an event shortly, saw reminder email send in `bin/mail list-outbound`.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T7931
Differential Revision: https://secure.phabricator.com/D16784
Summary:
Fixes T11771. Adds a lock around each GC process so we don't try to, e.g., delete old files on two machines at once just because they're both running trigger daemons.
The other aspects of this daemon (actual triggers; nuance importers) already have separate locks.
Test Plan: Ran `bin/phd debug trigger --trace`, saw daemon acquire locks and collect garbage.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11771
Differential Revision: https://secure.phabricator.com/D16739
Summary:
Fixes T11677. This makes two minor adjustments to the repository import daemons:
- The first step ("Message") now queues at a slightly-lower-than-default (for already-imported repositories) or very-low (for newly importing repositories) priority level.
- The other steps now queue at "default" priority level. This is actually what they already did, but without this change their behavior would be to inherit the priority level of their parents.
This has two effects:
- When adding new repositories to an existing install, they shouldn't block other things from happening anymore.
- The daemons will tend to start one commit and run through all of its steps before starting another commit. This makes progress through the queue more even and predictable.
- Before, they did ALL the message tasks, then ALL the change tasks, etc. This works fine but is confusing/uneven/less-predictable because each type of task takes a different amount of time.
Test Plan:
- Added a new repository.
- Saw all of its "message" steps queue at priority 4000.
- Saw followups queue at priority 2000.
- Saw progress generally "finish what you started" -- go through the queue one commit at a time, instead of one type of task at a time.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11677
Differential Revision: https://secure.phabricator.com/D16585
Summary: Previously, the chatbot docs instructed users to get certificates for the conduit API and put the cert in a `conduit.cert` config key. In order to get the chatbot to work, I needed to instead get an API key and put it in the `conduit.token` config entry.
Test Plan: Doc fix. Tried the new documented way and it worked.
Reviewers: epriestley, #blessed_reviewers
Reviewed By: epriestley, #blessed_reviewers
Subscribers: Korvin, epriestley
Differential Revision: https://secure.phabricator.com/D16443
Summary: Fixes T11508. The config entry `remarkup.ignored-object-names` already contains a blacklist of object names that should be ignored in the web UI. This change makes that blacklist also apply to the chatbot. This makes it possible to have a chatbot ignore things like V1, V2, Q1 and any other phrases the user may not want to generate links to objects.
Test Plan: Create objects (tasks, slowvotes, etc.) then mention the object names in chat (with the bot running). The bot should respond with helpful links to the given objects. Then add the object names to the blacklist through the config web UI. This apparently triggers the bot to restart itself. Then mention the object names in chat again. The bot should no longer respond with links because those object names have been added to the blacklist regex.
Reviewers: epriestley, #blessed_reviewers
Reviewed By: epriestley, #blessed_reviewers
Subscribers: epriestley
Maniphest Tasks: T11508
Differential Revision: https://secure.phabricator.com/D16442
Summary:
Fixes T11490. Currently, this query can not use a key and the table size may be quite large.
Adjust the query so it can use a key for both selection and ordering, and add that key.
Test Plan: Ran `EXPLAIN` on the old query in production, then added the key and ran `EXPLAIN` on the new query. Saw key in use, and "rows" examined drop from 29,273 to 15.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11490
Differential Revision: https://secure.phabricator.com/D16423
Summary:
The cluster synchronization code runs either actively (before returning a response to `git clone`, for example) or passively (routinely, as the daemons update reposiories).
The active sync runs as the web user (if running `git clone http://...`) or the VCS user (if running `git clone ssh://...`). But the passive sync runs as the daemon user.
All of these sync processes need to run actual commands as the daemon user (`git fetch ...`).
For the active ones, we must `sudo`.
For the passive ones, we're already the right user. We run the same code, and end up trying to sudo to ourselves, which `sudo` isn't happy about by default.
Depending on how `sudo` is configured and which users things are running as this might work anyway, but it's silly and if it doesn't work it requires you to go make non-obvious, weird config changes that are unintuitive and somewhat nonsensical. This is probably worse on the balance than adding a bit of complexity to the code.
Instead, test which user we're running as. If it's already the right user, don't sudo.
Test Plan:
- Ran `bin/repository update --trace` as daemon user, saw no more `sudo`.
- Ran a `git clone` to make sure that didn't break.
Reviewers: chad, avivey
Reviewed By: avivey
Differential Revision: https://secure.phabricator.com/D16391
Summary:
Ref T11458. Depends on D16388. Currently, we're very aggressive about closing connections in the taskmaster daemons.
This can end up taking up a lot of resources. In particular, because the outgoing port for outbound connections normally can not be reused for 60 seconds after a connection closes, we may exhaust outbound ports on the host if there's a big queue full of stuff that's being processed very quickly.
At a minimum, we //always// are holding open a `worker` connection, which we always need again right away. So even in the best case we end up opening/closing this about once per second and each daemon takes up about ~60 outbound ports when it should take up ~1.
So, make two adjustments:
- First, only close connections which we haven't issued a query on in the last 60 seconds. This should prevent us from closing connections that we'll need again immediately in most cases. In the worst case, we shouldn't be eating up any extra ports under default TCP behavior.
- Second, explicitly close connections. We were relying on implicit/GC behavior (maybe as a holdover from very long ago, before we got connection wrappers in place?), which probably did about the same thing but isn't as predictable and can't be profiled or instrumented.
Test Plan:
This is somewhat difficult to test completely convincingly in isolation since the problem behavior depends on production scales and the workload, and to some degree on configuration.
I tested that this stuff baiscally works by adding logging to connect/close and running the daemons, verifying that they churned connections a lot before this change (e.g., ~1/s even at no load) and churn rarely afterward (e.g., almost never at no load).
I ran some workload through them to make sure I didn't completely break anything.
The best real test is just seeing how production responds. Current inbound/outbound connections on `secure001` are 1,200:
```
secure001 $ netstat -t | grep :mysql | wc -l
1164
```
Current outbound from `repo001` are 18,600:
```
repo001 $ netstat -t | grep :mysql | wc -l
18663
```
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11458
Differential Revision: https://secure.phabricator.com/D16389
Summary:
Ref T11309. In that task, a user misunderstood two parts of this error:
- They took "exception" to mean "unexpected failure", when it was intended to mean "rare circumstance".
- They intereted the internal ID number of a commit to mean that Phabricator was malfunctioning.
Make the language of this condition more direct, explaining what the situation means in greater detail.
Additionally, we would previously re-throw this exception, which would make the daemon exit, wait a moment, and restart. This was normal and expected.
When //unexpected// failures occur, it's important do to this: it prevents a daemon failing in a loop from causing too many side effects (e.g., limit of 1 email per 5 seconds instead of thousands per second).
When expected, permanent failures occur, we do not need to do this: the task will not be retried. I just did it because it was slightly more consistent ("failures restart daemons") and we had few permanent failure types at the time.
We have more now, and restarting the daemons generates some additional logs which have the potential to confuse. Cycling the daemon also (intentionally) reduces the rate at which we process tasks, which can be bad for permanent failures like "deleted commit" because users can delete a huge number of commits and possibly clog up the queue with cycle-after-failure actions.
Test Plan:
Tried to process a deleted commit, saw a new message:
```
2016-07-11 9:30:22 AM [STDE] <VERB> PhabricatorTaskmasterDaemon Task 1428658 was cancelled: Commit "R55:6c46b7d0fb82a859ca3f87a95dc8dcceef8088c9" (with internal ID "282161") is no longer reachable from any branch, tag, or ref in this repository, so it will not be imported. This usually means that the branch the commit was on was deleted or overwritten.
```
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11309
Differential Revision: https://secure.phabricator.com/D16268
Summary:
Ref T3554. Makes `bin/worker cancel --class <classname>` work (cancel all tasks with that type).
This is useful in development if your queue is full of a bunch of gunk, and a need has occasionally arisen in production environments (usually "one option is cancel everything and move on").
Test Plan: Ran `bin/worker cancel` to cancel blocks of tasks by class name.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T3554
Differential Revision: https://secure.phabricator.com/D16267
Summary:
Trigger daemon is trying to find the next event to invoke before sleeping, but the query includes already-elapsed triggers.
It then tries to sleep for 0 seconds.
Test Plan:
On a new instance, schedule a single trigger of type `PhabricatorOneTimeTriggerClock` to a very near time.
Use top to see trigger daemon not going to 100% CPU once the event has elapsed.
Reviewers: #blessed_reviewers, epriestley
Subscribers: Korvin
Differential Revision: https://secure.phabricator.com/D15750
Summary:
Ref T10537. For Nuance, I want to introduce new sources (like "GitHub" or "GitHub via Nuance" or something) but this needs to modularize eventually.
Split ContentSource apart so applications can add new content sources.
Test Plan:
This change has huge surface area, so I'll hold it until post-release. I think it's fairly safe (and if it does break anything, the breaks should be fatals, not anything subtle or difficult to fix), there's just no reason not to hold it for a few hours.
- Viewed new module page.
- Grepped for all removed functions/constants.
- Viewed some transactions.
- Hovered over timestamps to get content source details.
- Added a comment via Conduit.
- Added a comment via web.
- Ran `bin/storage upgrade --namespace XXXXX --no-quickstart -f` to re-run all historic migrations.
- Generated some objects with `bin/lipsum`.
- Ran a bulk job on some tasks.
- Ran unit tests.
{F1190182}
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10537
Differential Revision: https://secure.phabricator.com/D15521
Summary: Ref T10537. Currently, when you have at least two cursors, the daemon can poll too frequently when processing the last source because it never hits the end-of-list condition.
Test Plan:
- Ran `bin/phd debug trigger`.
- Observed huge volumes of output before change as triggers fired as fast as possible.
- Observed reasonable poll frequency after change.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10537
Differential Revision: https://secure.phabricator.com/D15464
Summary:
Ref T10537. More infrastructure:
- Put a `bin/nuance` in place with `bin/nuance import`. This has no useful behavior yet.
- Allow sources to be searched by substring. This supports `bin/nuance import --source whatever` so you don't have to dig up PHIDs.
Test Plan:
- Applied migrations.
- Ran `bin/nuance import --source ...` (no meaningful effect, but works fine).
- Searched for sources by substring in the UI.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10537
Differential Revision: https://secure.phabricator.com/D15436
Summary:
Ref T10537. Some sources (like the future "GitHub Repository" source) need to poll remotes.
- Provide a mechanism for sources to emit import cursors.
- Hook them into the trigger daemon so they'll fire periodically.
- Provide some storage.
This diff does nothing useful or interesting, and is pure infrastructure.
Test Plan:
- Ran `bin/storage upgrade -f`, no adjustment issues.
- Poked around Nuance.
- Ran the trigger daemon, verified it didn't crash and checked for Nuance stuff to do.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10537
Differential Revision: https://secure.phabricator.com/D15435
Summary:
Every caller returns `true`. This was added a long time ago for Projects, but projects are no longer subscribable.
I don't anticipate needing this in the future.
Test Plan: Grepped for this method.
Reviewers: chad
Reviewed By: chad
Differential Revision: https://secure.phabricator.com/D15409
Summary:
Ref T6183. Ref T10054. Historically, only members could watch projects because there were some weird special cases with policies. These policy issues have been resolved and Herald is generally powerful enough to do equivalent watches on most objects anyway.
Also puts a "Watch Project" button on the feed panel to make the behavior and meaning more obvious.
Test Plan:
- Watched a project I was not a member of.
- Clicked the feed watch/unwatch button.
{F1064909}
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T6183, T10054
Differential Revision: https://secure.phabricator.com/D15063
Summary: See Q266.
Test Plan: Created a bulk job, clicked "Details" instead of "Confirm", clicked "Continue" to get back to confirmation dialog.
Reviewers: chad
Reviewed By: chad
Differential Revision: https://secure.phabricator.com/D14985
Summary: Ref T9979. Convert all DestructionEngine behaviors to extensions.
Test Plan:
{F1033244}
Destroyed an object, verifying:
- Herald transcripts were destroyed;
- edges were destroyed;
- flags were destroyed;
- tokens were destroyed;
- transactions were destroyed;
- worker tasks were cancelled.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9979
Differential Revision: https://secure.phabricator.com/D14832
Summary:
Ref T9994. This fixes the first issue discussed on that task, which is that when a merge fails after "arc land", we would not clean up all the leases properly.
Specifically, when a merge fails, we use `queueTask()` to schedule a followup task. This followup destroys the lease and frees the underlying resource.
However, the default behavior of `queueTask()` is to //not queue tasks// if the parent task fails. This is a reasonable, safe behavior that was originally introduced in D8774, where it kept us from sending too much mail if a task did "send some mail" and then failed a little later on and got retried.
Since I think the default behavior is correct, I just special cased the behavior for Drydock to make it queue even on failure. These are the only types of followup tasks we currently want to queue on main task failure.
(It's possible that future Blueprints might want some kind of more specialized behavior, where some tasks queue only on success, but we can cross that bridge when we come to it.)
Test Plan:
- See T9994#149878 for test case setup.
- I ran that test case again with this patch, and saw the followup task queue properly in the `--trace` log, a correspoinding update task show up in `/daemon/`, and the lease get destroyed when I ran it a moment later.
{F1029915}
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9994
Differential Revision: https://secure.phabricator.com/D14818
Summary: This logic is flipped.
Test Plan:
- Before change: ran `bin/phd debug task`, saw queries to the config table every second.
- After change: ran `bin/phd debug task`, saw queries to the config table every 10 seconds.
Reviewers: chad, joshuaspence
Reviewed By: chad, joshuaspence
Differential Revision: https://secure.phabricator.com/D14542
Summary:
Fixes T9808.
An instance imported a very large repository, generating approximately 4 million tasks over the course of a few days. A week later, these tasks started expiring and became candidates for garbage collection.
The GC works by deleting 100 rows at at time over and over again. It finds the rows it's going to delete by querying for old rows.
Currently, this query generates a `WHERE dateCreated < X ORDER BY id DESC` query. This query can not efficiently execute using a single key, because it relies on `dateCreated` order to find the rows, then on `id` order to sort them. With a table with 4M rows, this is slow.
This would still be OK, except that the query has to execute a lot of times since it only deletes 100 rows each time. Particularly, it needs to execute a total of ~40K times.
Instead, generate `WHERE dateCreated < X ORDER BY dateCreated DESC, id DESC`. This should have the same effect in general and the GC definitely doesn't care about the difference, but it should be more efficient at large scales.
Test Plan:
I had to `TRUNCATE` the problem table so I don't have a perfect repro to completely convincingly test this anymore. Both queries behave fine at small scales, which is why we haven't seen this before.
I was able to run the newer query in production before I nuked the table and have it complete in a reasonable amount of time, while the old query hung longer than I wanted to wait (several minutes?). The query plan for the new query was also a good one, while the query plan for the old query was terrible.
I loaded the daemon console and ran `bin/garbage collect --collector worker.tasks --trace`. I verified the queries looked reasonable and produced reasonable results in production.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9808
Differential Revision: https://secure.phabricator.com/D14505
Summary: Fixes T7053. Depends on D14452.
Test Plan:
Created a custom daemon which dumps out the config hash (by querying `PhabricatorEnv::calculateEnvironmentHash()`). Ran this daemon with `./bin/phd debug PhabricatorDebugDaemon` and saw the config hash update within 30 seconds.
{P1886}
Reviewers: #blessed_reviewers, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: Korvin
Maniphest Tasks: T7053
Differential Revision: https://secure.phabricator.com/D14458
Summary:
Ref T9252. Currently, `queueTask()` accepts `$priority` as its third argument. Allow it to take a full range of `$options` instead. This API just never got updated after we expanded avialable options.
Arguably this whole API should be some kind of "TaskQueueRequest" object but I'll leave that for another day.
Test Plan:
- Grepped for `queueTask()` and verified no other callsites are affected by this API change.
- Ran some daemons.
- See also next diff.
Reviewers: hach-que, chad
Reviewed By: hach-que, chad
Maniphest Tasks: T9252
Differential Revision: https://secure.phabricator.com/D14235
Summary:
Ref T9252. Currently, Harbormaster and Drydock work like this in some cases:
# Queue a lease for activation.
# Then, a little later, save the lease PHID somewhere.
# When the target/resource is destroyed, destroy the lease.
However, something can happen between (1) and (2). In Drydock this window is very short and the "something" would have to be a lighting strike or something similar, but in Harbormaster we wait until the resource activates to do (2) so the window can be many minutes long. In particular, a user can use "Abort Build" during those many minutes.
If they do, the target is destroyed but it doesn't yet have a record of the artifact, so the artifact isn't cleaned up.
Make these things work like this instead:
# Create a new lease and pre-generate a PHID for it.
# Save that PHID as something that needs to be cleaned up.
# Queue the lease for activation.
# When the target/resource is destroyed, destroy the lease if it exists.
This makes sure there's no step in the process where we might lose track of a lease/resource.
Also, clean up and standardize some other stuff I hit.
Test Plan:
- Stopped daemons.
- Restarted a build in Harbormaster.
- Stepped through the build one stage at a time using `bin/worker execute ...`.
- After the lease was queued, but before it activated, aborted the build.
- Processed the Harbormaster side of things only.
- Saw the lease get destroyed properly.
Reviewers: chad, hach-que
Reviewed By: hach-que
Maniphest Tasks: T9252
Differential Revision: https://secure.phabricator.com/D14234
Summary:
Fixes T9494. This:
- Removes all the random GC.x.y.z config.
- Puts it all in one place that's locked and which you use `bin/garbage set-policy ...` to adjust.
- Makes every TTL-based GC configurable.
- Simplifies the code in the actual GCs.
Test Plan:
- Ran `bin/garbage collect` to collect some garbage, until it stopped collecting.
- Ran `bin/garbage set-policy ...` to shorten policy. Saw change in web UI. Ran `bin/garbage collect` again and saw it collect more garbage.
- Set policy to indefinite and saw it not collect garabge.
- Set policy to default and saw it reflected in web UI / `collect`.
- Ran `bin/phd debug trigger` and saw all GCs fire with reasonable looking queries.
- Read new docs.
{F857928}
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9494
Differential Revision: https://secure.phabricator.com/D14219
Summary:
Ref T9252. This primarily allows Harbormaster to request (and Drydock to fulfill) working copies with a patch from a staging area. Doing this means we can do builds on in-review changes from `arc diff`.
This is a little cobbled-together but should basically work.
Also fix some other issues:
- Yielded, awakend workers are fine to update but could complain.
- We can't log slot lock failures to resources if we don't end up saving them.
- Killing the transaction would wipe out the log.
- Fix some TODOs, etc.
Test Plan: Ran Harbormaster builds on a local revision.
Reviewers: hach-que, chad
Reviewed By: chad
Maniphest Tasks: T9252
Differential Revision: https://secure.phabricator.com/D14214
Summary:
Ref T9252. This is the same as D14201, but for lease stuff instead of resource stuff.
This one is a little heavier but still feels pretty reasonable to me at the end of the day (worker is <1K lines and has a ton of comment stuff).
Also fixes a few random bugs I hit in the task queue.
Test Plan:
- Restarted some Harbormaster builds, saw them go through cleanly.
- Released pre-activation resources/leases.
- Probably still kinda buggy but I'll iron the details out over time.
Logs are starting to look somewhat plausible:
{F855747}
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9252
Differential Revision: https://secure.phabricator.com/D14202
Summary:
Ref T9123. To run upstream builds in Harbormaster/Drydock, we need to be able to check out `libphutil`, `arcanist` and `phabricator` next to one another.
This adds an "Also Clone: ..." field to Harbormaster working copy build steps so I can type all three repos into it and get a proper clone with everything we need.
This is somewhat upstream-centric and a bit narrow, but I don't think it's totally unreasonable, and most of the underlying stuff is relatively general.
This adds some more typechecking and improves data/type handling for custom fields, too. In particular, it prevents users from entering an invalid/restricted value in a field (for example, you can't "Also Clone" a repository you don't have permission to see).
Test Plan: Restarted build, got a Drydock resource with multiple repositories in it.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9123
Differential Revision: https://secure.phabricator.com/D14183
Summary:
Ref T9252. Currently, Harbormaster does this when trying to acquire a working copy:
- Ask for a working copy.
- Yield for 15 seconds.
- Check if we have a working copy yet.
That's OK, but Drydock takes ~1s to acquire a working copy lease if a resource is already available, so we end up doing this:
- T+0: Ask for a working copy.
- T+0: Yield for 15 seconds.
- T+1: Working copy lease activates.
- T+15: Working copy lease is used.
- T+16: Build finishes.
So we end up spending about 2 seconds doing work and 14 seconds sleeping.
One way to fix this would be to fiddle with the yield duration, so we yield for 1, 2, 4, ... seconds or something. This probably isn't a bad idea for longer leases (i.e., wait for 15, 30, 45 ... seconds or similar) but it implies a lot of churn for short leases.
Instead, let tasks "awaken" other tasks when they complete. The "awaken" operation means: if a task is in a yielded state (no failures, no owner, explicitly yielded, future expires time), pretend it only yielded until right now instead of whenever it really yielded to.
Basically, this rewrites history so that even though Harbormaster did a `yield(15)`, we pretend it did a `yield(4)` after we activate the lease if lease activation took 4 seconds.
If this misses, it's fine: we fall back to the normal yield behavior and things move forward normally a few seconds later.
If it hits, we get a more nimble process pretty cleanly.
Test Plan:
- Restarted a build plan (lease working copy + run `ls`) with this patch no-op'd, took about 16 seconds.
- Restarted a build plan with this patch active, took about 1 second.
Reviewers: hach-que, chad
Reviewed By: chad
Maniphest Tasks: T9252
Differential Revision: https://secure.phabricator.com/D14178
Summary:
Fixes T6569. This implements an expiry mechanism for Drydock resources which parallels the mechanism for leases.
A few things are missing that we'll probably need in the future:
- An "EXPIRES" command to update the expiration time. This would let resources be permanent while leased, then expire after, say, 24 hours without any leases.
- A callback like `shouldActuallyExpireRightNow()` for resources and leases that lets them decide not to expire at the last second.
- A callback like `didAcquireLease()` for resource blueprints, to parallel `didReleaseLease()`, letting them clear or extend their timer.
However, this stuff would mostly just let us tune behaviors, not really open up new capabilities.
Test Plan: Changed host resources to expire after 60 seconds, leased one, saw it vanish 60 seconds later.
Reviewers: hach-que, chad
Reviewed By: chad
Maniphest Tasks: T6569
Differential Revision: https://secure.phabricator.com/D14176
Summary: Ref T6569. If a lease is activated with an expiration date, schedule a task to try to clean it up after that time.
Test Plan:
- Used `bin/drydock lease ... --until ...` to activate a lease in the near future.
- Waited for a bit.
- Saw it expire and get destroyed at the scheduled time.
Reviewers: hach-que, chad
Reviewed By: chad
Maniphest Tasks: T6569
Differential Revision: https://secure.phabricator.com/D14148
Summary:
Ref T9252. This simplifies some Drydock code.
Most of this code relates to the old notion of Drydock being able to enumerate all the tasks it needs to complete in order to acquire a lease. The code has stepped back from this, since it's unnecessary, the queue is more powerful than it used to be, and it would be a lot of work to keep track of.
The ~only thing that should ever wait for leases in modern code is `bin/drydock lease`, and it's fine for it to just sit there sleeping, so this just does that.
This reduces the granularity of logging, but I'll address that separately in future logging-focused changes.
Test Plan: Used `bin/drydock lease` to acquire a lease, saw it acquire cleanly.
Reviewers: hach-que, chad
Reviewed By: chad
Maniphest Tasks: T9252
Differential Revision: https://secure.phabricator.com/D14147