1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2024-12-16 10:30:56 +01:00
Commit graph

312 commits

Author SHA1 Message Date
epriestley
5c1e4488de Remove all "Phabricator Bot" code
Summary:
Closes T7829 as wontfix. Closes T7965 as wontfix. Closes T7800 as wontfix. Closes T2731 as wontfix. Closes T1271 as wontfix.

We aren't maintaining this at all (see, e.g., T7829) and a user reported a technically accurate security issue via HackerOne: <https://hackerone.com/reports/222870>

Just throw it away until we get to the eventual Conphernece bot/API update and can do this stuff correctly.

Test Plan: Grepped for `phabricatorbot`.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T7965, T7829, T7800, T2731, T1271

Differential Revision: https://secure.phabricator.com/D17756
2017-04-21 12:48:35 -07:00
epriestley
a41d158490 Only hibernate the Taskmaster after 15 seconds of inactivity
Under some workloads, the taskmaster may hibernate and launch more rapidly
than it should. Require 15 seconds of inactivity before hibernating. Also
hibernate for longer.

Auditors: chad
2017-03-25 05:01:32 -07:00
epriestley
2cda280cde Make the default Trigger hibernation 3 minutes instead of 5 seconds
The `min()` vs `max()` fix in D17560 meant that the Trigger daemon only
hibernates for 5 seconds, so we do a full GC sweep every 5 seconds. This ends
up eating a fair amount of CPU for no real benefit.

The GC cursors should move to persistent storage, but just bump this default
up in the meantime.

Auditors: chad
2017-03-25 04:14:32 -07:00
epriestley
8b553d2f18 Allow taskmaster daemons to hibernate
Summary: Ref T12298. Like PullLocal daemons, this allows the last daemon in the pool to hibernate if there's no work to be done, and awakens the pool when work arrives.

Test Plan:
  - Ran `bin/phd debug task --trace`.
  - Saw the pool hibernate and look for tasks.
  - Commented on an object.
  - Saw the pool wake up and process the queue.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12298

Differential Revision: https://secure.phabricator.com/D17559
2017-03-24 13:51:37 -07:00
epriestley
f13637627d Improve daemon "waiting" message, config reload behavior
Summary:
Ref T12298. Two minor daemon improvements:

  - Make the "waiting" message reflect hibernation.
  - Don't trigger a reload right after launching.

Test Plan:
- Read "waiting" message.
- Ran "bin/phd start", didn't see an immediate SIGHUP in the log.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12298

Differential Revision: https://secure.phabricator.com/D17550
2017-03-24 08:32:08 -07:00
epriestley
9099485a71 Allow the PullLocal daemon to hibernate, and wake it when repositories need an update
Summary: Ref T12298. This allows the PullLocal daemon to hibernate like the Trigger daemon, but automatically wakes it back up when it needs to do something.

Test Plan:
  - Ran `bin/phd debug pulllocal --trace`.
  - Saw the daemon hibernate after doing a checkup on repositories.
  - Saw periodic queries to look for new update messages.
  - After clicking "Update Now" in the web UI to schedule an update, saw the daemon wake up immediately.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12298

Differential Revision: https://secure.phabricator.com/D17540
2017-03-23 10:52:28 -07:00
epriestley
90ec21f999 Add "--pool" and "--duration" flags to daemon CLI tools
Summary: Ref T12331. These changes are intended to make it easier to debug T12331 since I'm having difficulty reproducing the issue locally.

Test Plan:
  - Ran `bin/phd debug task --pool 4` and got an autoscaling pool.
  - Ran `bin/worker flood --duration 3` and got some 3-second-long tasks to execute with `bin/worker execute ...`.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12331

Differential Revision: https://secure.phabricator.com/D17431
2017-02-28 07:43:46 -08:00
epriestley
40cc403d23 Allow the Trigger daemon to hibernate, reducing processes to 0
Summary:
Ref T12298. The trigger daemon already has routine long-term sleep, and few external events can impact when it should ideally wake up. The relevant events are:

  - Someone creates a new Nuance source (ideally, we should wake up right away and start polling it).
  - Someone creates a Calendar event about 16 minutes in the future (ideally, we should send them a reminder in about a minute).
  - Someone changes GC config to be extremely aggressive (ideally, we should immediately respect the change).

None of these cases are very important. We don't hibernate for more than 3 minutes, so the worst case is that your Nuance source takes 3 minutes to start importing or your Calendar notification comes two minutes too late (13 minutes before the event instead of 15).

This change makes GC sightly more CPU-expensive on average: currently, we do a GC sweep every 4 hours. After this change, we'll end up doing one every 3 minutes, because we lose the fact that we did a sweep recently when the daemon restarts.

We could fix this by keeping track of when the last GC sweep was in the database, instead of in the Daemon process, but the cost of a sweep is normally very small so I don't plan to do this anytime soon.

Test Plan:
  - Ran `bin/phd debug trigger`, saw daemon go through 3-minute hibernate + restart cycles.
  - Ran `bin/phd debug task`, saw daemon run normally.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12298

Differential Revision: https://secure.phabricator.com/D17408
2017-02-24 10:54:05 -08:00
Chad Little
bf44210dc8 Reduce application search engine results list for Dashboards
Summary: Ref T10390. Simplifies dropdown by rolling out canUseInPanel in useless panels

Test Plan: Add a query panel, see less options.

Reviewers: epriestley

Reviewed By: epriestley

Subscribers: Korvin

Maniphest Tasks: T10390

Differential Revision: https://secure.phabricator.com/D17341
2017-02-22 12:42:43 -08:00
Jakub Vrana
a778151f28 Fix errors found by PHPStan
Test Plan: Ran `phpstan analyze -a autoload.php phabricator/src`.

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: Korvin, hach-que

Differential Revision: https://secure.phabricator.com/D17371
2017-02-17 10:10:15 +00:00
Josh Cox
ac66522c2e Add a flag to ./bin/worker to select tasks based on their failureCount
Summary:
I frequently run into a situation where I want to kill tasks that have accumulated a lot of failures regardless of what class they are. Or I'll want to kill every worker of a certain class but only if it has failed at least once. This change allows me to run `./bin/worker cancel --class <MYCLASS> --min-failure-count 5` to only kill tasks with at least 5 failed attempts.

The `--min-failure-count N` argument can be used by itself as well as with `--class CLASSNAME`. I don't think it makes sense for it to work with `--id ID`, but I'm not dead set on that or anything.

Test Plan: I ran the worker management workflow with and without the `--min-failure-count` argument and it worked as expected.

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: Korvin, epriestley, yelirekim

Differential Revision: https://secure.phabricator.com/D16906
2016-10-12 09:49:29 -04:00
epriestley
706c21375e Remove empty implementations of describeAutomaticCapabilities()
Summary:
This has been replaced by `PolicyCodex` after D16830. Also:

  - Rebuild Celerity map to fix grumpy unit test.
  - Fix one issue on the policy exception workflow to accommodate the new code.

Test Plan:
  - `arc unit --everything`
  - Viewed policy explanations.
  - Viewed policy errors.

Reviewers: chad

Reviewed By: chad

Subscribers: hach-que, PHID-OPKG-gm6ozazyms6q6i22gyam

Differential Revision: https://secure.phabricator.com/D16831
2016-11-09 15:24:22 -08:00
epriestley
960c0be689 Fix some issues with Phabricator i18n string extraction
Summary: Ref T5267. Fix one minor bug (paths were not being resolved properly) and one minor string issue (missing `%d` in a string).

Test Plan: Extracted strings, got a cleaner result.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T5267

Differential Revision: https://secure.phabricator.com/D16808
2016-11-06 11:12:45 -08:00
epriestley
6b16f930c4 Automatically send (not-so-great) email notifications for upcoming events
Summary: Ref T7931. This is still quite rough, but should technically send vaguely-useful email as part of the standard trigger infrastructure.

Test Plan: Ran `bin/phd start`, created an event shortly, saw reminder email send in `bin/mail list-outbound`.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T7931

Differential Revision: https://secure.phabricator.com/D16784
2016-11-01 13:24:40 -07:00
epriestley
7678f412be Hold a lock while collecting garbage
Summary:
Fixes T11771. Adds a lock around each GC process so we don't try to, e.g., delete old files on two machines at once just because they're both running trigger daemons.

The other aspects of this daemon (actual triggers; nuance importers) already have separate locks.

Test Plan: Ran `bin/phd debug trigger --trace`, saw daemon acquire locks and collect garbage.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11771

Differential Revision: https://secure.phabricator.com/D16739
2016-10-20 13:40:00 -07:00
epriestley
db2425b300 Do initial repository imports at a lower priority and finish importing commits before starting new ones
Summary:
Fixes T11677. This makes two minor adjustments to the repository import daemons:

  - The first step ("Message") now queues at a slightly-lower-than-default (for already-imported repositories) or very-low (for newly importing repositories) priority level.
  - The other steps now queue at "default" priority level. This is actually what they already did, but without this change their behavior would be to inherit the priority level of their parents.

This has two effects:

  - When adding new repositories to an existing install, they shouldn't block other things from happening anymore.
  - The daemons will tend to start one commit and run through all of its steps before starting another commit. This makes progress through the queue more even and predictable.
    - Before, they did ALL the message tasks, then ALL the change tasks, etc. This works fine but is confusing/uneven/less-predictable because each type of task takes a different amount of time.

Test Plan:
  - Added a new repository.
  - Saw all of its "message" steps queue at priority 4000.
  - Saw followups queue at priority 2000.
  - Saw progress generally "finish what you started" -- go through the queue one commit at a time, instead of one type of task at a time.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11677

Differential Revision: https://secure.phabricator.com/D16585
2016-09-21 16:41:01 -07:00
Josh Cox
8cdf1a890a Updated the docs so chatbots can use the Conduit API
Summary: Previously, the chatbot docs instructed users to get certificates for the conduit API and put the cert in a `conduit.cert` config key. In order to get the chatbot to work, I needed to instead get an API key and put it in the `conduit.token` config entry.

Test Plan: Doc fix. Tried the new documented way and it worked.

Reviewers: epriestley, #blessed_reviewers

Reviewed By: epriestley, #blessed_reviewers

Subscribers: Korvin, epriestley

Differential Revision: https://secure.phabricator.com/D16443
2016-08-24 19:05:30 -04:00
Josh Cox
605210bc95 Make the chatbot obey the object name blacklist
Summary: Fixes T11508. The config entry `remarkup.ignored-object-names` already contains a blacklist of object names that should be ignored in the web UI. This change makes that blacklist also apply to the chatbot. This makes it possible to have a chatbot ignore things like V1, V2, Q1 and any other phrases the user may not want to generate links to objects.

Test Plan: Create objects (tasks, slowvotes, etc.) then mention the object names in chat (with the bot running). The bot should respond with helpful links to the given objects. Then add the object names to the blacklist through the config web UI. This apparently triggers the bot to restart itself. Then mention the object names in chat again. The bot should no longer respond with links because those object names have been added to the blacklist regex.

Reviewers: epriestley, #blessed_reviewers

Reviewed By: epriestley, #blessed_reviewers

Subscribers: epriestley

Maniphest Tasks: T11508

Differential Revision: https://secure.phabricator.com/D16442
2016-08-23 07:38:27 -05:00
epriestley
3bd0da0ec2 Add a missing table key to improve performance of "Recently Completed Tasks" query
Summary:
Fixes T11490. Currently, this query can not use a key and the table size may be quite large.

Adjust the query so it can use a key for both selection and ordering, and add that key.

Test Plan: Ran `EXPLAIN` on the old query in production, then added the key and ran `EXPLAIN` on the new query. Saw key in use, and "rows" examined drop from 29,273 to 15.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11490

Differential Revision: https://secure.phabricator.com/D16423
2016-08-19 11:53:09 -07:00
epriestley
ca78c1825a When already running as the daemon user, don't "sudo" daemon commands
Summary:
The cluster synchronization code runs either actively (before returning a response to `git clone`, for example) or passively (routinely, as the daemons update reposiories).

The active sync runs as the web user (if running `git clone http://...`) or the VCS user (if running `git clone ssh://...`). But the passive sync runs as the daemon user.

All of these sync processes need to run actual commands as the daemon user (`git fetch ...`).

For the active ones, we must `sudo`.

For the passive ones, we're already the right user. We run the same code, and end up trying to sudo to ourselves, which `sudo` isn't happy about by default.

Depending on how `sudo` is configured and which users things are running as this might work anyway, but it's silly and if it doesn't work it requires you to go make non-obvious, weird config changes that are unintuitive and somewhat nonsensical. This is probably worse on the balance than adding a bit of complexity to the code.

Instead, test which user we're running as. If it's already the right user, don't sudo.

Test Plan:
  - Ran `bin/repository update --trace` as daemon user, saw no more `sudo`.
  - Ran a `git clone` to make sure that didn't break.

Reviewers: chad, avivey

Reviewed By: avivey

Differential Revision: https://secure.phabricator.com/D16391
2016-08-11 16:41:19 -07:00
epriestley
5e3efca08a In taskmaster daemons, only close connections which were not used recently
Summary:
Ref T11458. Depends on D16388. Currently, we're very aggressive about closing connections in the taskmaster daemons.

This can end up taking up a lot of resources. In particular, because the outgoing port for outbound connections normally can not be reused for 60 seconds after a connection closes, we may exhaust outbound ports on the host if there's a big queue full of stuff that's being processed very quickly.

At a minimum, we //always// are holding open a `worker` connection, which we always need again right away. So even in the best case we end up opening/closing this about once per second and each daemon takes up about ~60 outbound ports when it should take up ~1.

So, make two adjustments:

  - First, only close connections which we haven't issued a query on in the last 60 seconds. This should prevent us from closing connections that we'll need again immediately in most cases. In the worst case, we shouldn't be eating up any extra ports under default TCP behavior.
  - Second, explicitly close connections. We were relying on implicit/GC behavior (maybe as a holdover from very long ago, before we got connection wrappers in place?), which probably did about the same thing but isn't as predictable and can't be profiled or instrumented.

Test Plan:
This is somewhat difficult to test completely convincingly in isolation since the problem behavior depends on production scales and the workload, and to some degree on configuration.

I tested that this stuff baiscally works by adding logging to connect/close and running the daemons, verifying that they churned connections a lot before this change (e.g., ~1/s even at no load) and churn rarely afterward (e.g., almost never at no load).

I ran some workload through them to make sure I didn't completely break anything.

The best real test is just seeing how production responds. Current inbound/outbound connections on `secure001` are 1,200:

```
secure001 $ netstat -t | grep :mysql | wc -l
1164
```

Current outbound from `repo001` are 18,600:

```
repo001 $ netstat -t | grep :mysql | wc -l
18663
```

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11458

Differential Revision: https://secure.phabricator.com/D16389
2016-08-11 12:03:56 -07:00
epriestley
4068ee2a75 Make permanent worker failures more user-friendly
Summary:
Ref T11309. In that task, a user misunderstood two parts of this error:

  - They took "exception" to mean "unexpected failure", when it was intended to mean "rare circumstance".
  - They intereted the internal ID number of a commit to mean that Phabricator was malfunctioning.

Make the language of this condition more direct, explaining what the situation means in greater detail.

Additionally, we would previously re-throw this exception, which would make the daemon exit, wait a moment, and restart. This was normal and expected.

When //unexpected// failures occur, it's important do to this: it prevents a daemon failing in a loop from causing too many side effects (e.g., limit of 1 email per 5 seconds instead of thousands per second).

When expected, permanent failures occur, we do not need to do this: the task will not be retried. I just did it because it was slightly more consistent ("failures restart daemons") and we had few permanent failure types at the time.

We have more now, and restarting the daemons generates some additional logs which have the potential to confuse. Cycling the daemon also (intentionally) reduces the rate at which we process tasks, which can be bad for permanent failures like "deleted commit" because users can delete a huge number of commits and possibly clog up the queue with cycle-after-failure actions.

Test Plan:
Tried to process a deleted commit, saw a new message:

```
2016-07-11 9:30:22 AM [STDE] <VERB> PhabricatorTaskmasterDaemon Task 1428658 was cancelled: Commit "R55:6c46b7d0fb82a859ca3f87a95dc8dcceef8088c9" (with internal ID "282161") is no longer reachable from any branch, tag, or ref in this repository, so it will not be imported. This usually means that the branch the commit was on was deleted or overwritten.
```

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11309

Differential Revision: https://secure.phabricator.com/D16268
2016-07-11 09:21:39 -07:00
epriestley
c510c925cf Allow worker tasks to be cancelled by classname
Summary:
Ref T3554. Makes `bin/worker cancel --class <classname>` work (cancel all tasks with that type).

This is useful in development if your queue is full of a bunch of gunk, and a need has occasionally arisen in production environments (usually "one option is cancel everything and move on").

Test Plan: Ran `bin/worker cancel` to cancel blocks of tasks by class name.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T3554

Differential Revision: https://secure.phabricator.com/D16267
2016-07-11 09:21:16 -07:00
Aviv Eyal
a3bb35e9d2 make Trigger Daemon sleep correctly when one-time triggers exist
Summary:
Trigger daemon is trying to find the next event to invoke before sleeping, but the query includes already-elapsed triggers.
It then tries to sleep for 0 seconds.

Test Plan:
On a new instance, schedule a single trigger of type `PhabricatorOneTimeTriggerClock` to a very near time.

Use top to see trigger daemon not going to 100% CPU once the event has elapsed.

Reviewers: #blessed_reviewers, epriestley

Subscribers: Korvin

Differential Revision: https://secure.phabricator.com/D15750
2016-04-18 14:17:10 -07:00
epriestley
601aaa5a86 Modularize content sources
Summary:
Ref T10537. For Nuance, I want to introduce new sources (like "GitHub" or "GitHub via Nuance" or something) but this needs to modularize eventually.

Split ContentSource apart so applications can add new content sources.

Test Plan:
This change has huge surface area, so I'll hold it until post-release. I think it's fairly safe (and if it does break anything, the breaks should be fatals, not anything subtle or difficult to fix), there's just no reason not to hold it for a few hours.

- Viewed new module page.
- Grepped for all removed functions/constants.
- Viewed some transactions.
- Hovered over timestamps to get content source details.
- Added a comment via Conduit.
- Added a comment via web.
- Ran `bin/storage upgrade --namespace XXXXX --no-quickstart -f` to re-run all historic migrations.
- Generated some objects with `bin/lipsum`.
- Ran a bulk job on some tasks.
- Ran unit tests.

{F1190182}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10537

Differential Revision: https://secure.phabricator.com/D15521
2016-03-26 11:59:45 -07:00
epriestley
de23ba0002 Fix a minor issue in Nuance which could cause the trigger daemon to poll too often
Summary: Ref T10537. Currently, when you have at least two cursors, the daemon can poll too frequently when processing the last source because it never hits the end-of-list condition.

Test Plan:
  - Ran `bin/phd debug trigger`.
  - Observed huge volumes of output before change as triggers fired as fast as possible.
  - Observed reasonable poll frequency after change.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10537

Differential Revision: https://secure.phabricator.com/D15464
2016-03-12 05:04:42 -08:00
epriestley
2a3c3b2b98 Provide bin/nuance import and ngram indexes for sources
Summary:
Ref T10537. More infrastructure:

  - Put a `bin/nuance` in place with `bin/nuance import`. This has no useful behavior yet.
  - Allow sources to be searched by substring. This supports `bin/nuance import --source whatever` so you don't have to dig up PHIDs.

Test Plan:
  - Applied migrations.
  - Ran `bin/nuance import --source ...` (no meaningful effect, but works fine).
  - Searched for sources by substring in the UI.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10537

Differential Revision: https://secure.phabricator.com/D15436
2016-03-08 10:30:24 -08:00
epriestley
3f4cc3ad6e Allow Nuances sources to provide import cursors
Summary:
Ref T10537. Some sources (like the future "GitHub Repository" source) need to poll remotes.

  - Provide a mechanism for sources to emit import cursors.
  - Hook them into the trigger daemon so they'll fire periodically.
  - Provide some storage.

This diff does nothing useful or interesting, and is pure infrastructure.

Test Plan:
  - Ran `bin/storage upgrade -f`, no adjustment issues.
  - Poked around Nuance.
  - Ran the trigger daemon, verified it didn't crash and checked for Nuance stuff to do.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10537

Differential Revision: https://secure.phabricator.com/D15435
2016-03-08 10:30:04 -08:00
epriestley
abb4c03b47 Remove shouldShowSubscribersProperty() from SubscribableInterface
Summary:
Every caller returns `true`. This was added a long time ago for Projects, but projects are no longer subscribable.

I don't anticipate needing this in the future.

Test Plan: Grepped for this method.

Reviewers: chad

Reviewed By: chad

Differential Revision: https://secure.phabricator.com/D15409
2016-03-06 06:01:36 -08:00
Sébastien Santoro
a4db6f387d Fix typo: discsussions → discussions
Test Plan: Read again the sentence.

Reviewers: joshuaspence, #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: epriestley

Differential Revision: https://secure.phabricator.com/D15316
2016-02-21 01:51:03 -08:00
epriestley
5c2e49a812 Allow any user to watch any project they can see
Summary:
Ref T6183. Ref T10054. Historically, only members could watch projects because there were some weird special cases with policies. These policy issues have been resolved and Herald is generally powerful enough to do equivalent watches on most objects anyway.

Also puts a "Watch Project" button on the feed panel to make the behavior and meaning more obvious.

Test Plan:
  - Watched a project I was not a member of.
  - Clicked the feed watch/unwatch button.

{F1064909}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T6183, T10054

Differential Revision: https://secure.phabricator.com/D15063
2016-01-19 19:38:30 -08:00
epriestley
96b1665eaa Link "continue" action to confirm dialog in bulk jobs that are unconfirmed
Summary: See Q266.

Test Plan: Created a bulk job, clicked "Details" instead of "Confirm", clicked "Continue" to get back to confirmation dialog.

Reviewers: chad

Reviewed By: chad

Differential Revision: https://secure.phabricator.com/D14985
2016-01-10 10:55:58 -08:00
epriestley
4bba3fd4c1 Fully modularize DestructionEngine
Summary: Ref T9979. Convert all DestructionEngine behaviors to extensions.

Test Plan:
{F1033244}

Destroyed an object, verifying:

  - Herald transcripts were destroyed;
  - edges were destroyed;
  - flags were destroyed;
  - tokens were destroyed;
  - transactions were destroyed;
  - worker tasks were cancelled.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9979

Differential Revision: https://secure.phabricator.com/D14832
2015-12-21 17:03:44 -08:00
epriestley
e9af4f8970 Fix an issue where Drydock followup tasks would not queue if the main task failed
Summary:
Ref T9994. This fixes the first issue discussed on that task, which is that when a merge fails after "arc land", we would not clean up all the leases properly.

Specifically, when a merge fails, we use `queueTask()` to schedule a followup task. This followup destroys the lease and frees the underlying resource.

However, the default behavior of `queueTask()` is to //not queue tasks// if the parent task fails. This is a reasonable, safe behavior that was originally introduced in D8774, where it kept us from sending too much mail if a task did "send some mail" and then failed a little later on and got retried.

Since I think the default behavior is correct, I just special cased the behavior for Drydock to make it queue even on failure. These are the only types of followup tasks we currently want to queue on main task failure.

(It's possible that future Blueprints might want some kind of more specialized behavior, where some tasks queue only on success, but we can cross that bridge when we come to it.)

Test Plan:
  - See T9994#149878 for test case setup.
  - I ran that test case again with this patch, and saw the followup task queue properly in the `--trace` log, a correspoinding update task show up in `/daemon/`, and the lease get destroyed when I ran it a moment later.

{F1029915}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9994

Differential Revision: https://secure.phabricator.com/D14818
2015-12-18 08:17:04 -08:00
epriestley
b964f8873b Fix daemon restart behavior to check once every 10 seconds
Summary: This logic is flipped.

Test Plan:
  - Before change: ran `bin/phd debug task`, saw queries to the config table every second.
  - After change: ran `bin/phd debug task`, saw queries to the config table every 10 seconds.

Reviewers: chad, joshuaspence

Reviewed By: chad, joshuaspence

Differential Revision: https://secure.phabricator.com/D14542
2015-11-23 05:59:04 -08:00
epriestley
2e09a93dc1 Improve efficiency of worker task GC for huge loads
Summary:
Fixes T9808.

An instance imported a very large repository, generating approximately 4 million tasks over the course of a few days. A week later, these tasks started expiring and became candidates for garbage collection.

The GC works by deleting 100 rows at at time over and over again. It finds the rows it's going to delete by querying for old rows.

Currently, this query generates a `WHERE dateCreated < X ORDER BY id DESC` query. This query can not efficiently execute using a single key, because it relies on `dateCreated` order to find the rows, then on `id` order to sort them. With a table with 4M rows, this is slow.

This would still be OK, except that the query has to execute a lot of times since it only deletes 100 rows each time. Particularly, it needs to execute a total of ~40K times.

Instead, generate `WHERE dateCreated < X ORDER BY dateCreated DESC, id DESC`. This should have the same effect in general and the GC definitely doesn't care about the difference, but it should be more efficient at large scales.

Test Plan:
I had to `TRUNCATE` the problem table so I don't have a perfect repro to completely convincingly test this anymore. Both queries behave fine at small scales, which is why we haven't seen this before.

I was able to run the newer query in production before I nuked the table and have it complete in a reasonable amount of time, while the old query hung longer than I wanted to wait (several minutes?). The query plan for the new query was also a good one, while the query plan for the old query was terrible.

I loaded the daemon console and ran `bin/garbage collect --collector worker.tasks --trace`. I verified the queries looked reasonable and produced reasonable results in production.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9808

Differential Revision: https://secure.phabricator.com/D14505
2015-11-17 17:05:10 -08:00
Joshua Spence
a07a8aca24 Add a daemon overseer module to restart daemons when config changes
Summary: Fixes T7053. Depends on D14452.

Test Plan:
Created a custom daemon which dumps out the config hash (by querying `PhabricatorEnv::calculateEnvironmentHash()`). Ran this daemon with `./bin/phd debug PhabricatorDebugDaemon` and saw the config hash update within 30 seconds.

{P1886}

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: Korvin

Maniphest Tasks: T7053

Differential Revision: https://secure.phabricator.com/D14458
2015-11-11 08:44:18 +11:00
Joshua Spence
495cb7a2e0 Mark PhabricatorPHIDType::getPHIDTypeApplicationClass() as abstract
Summary: Fixes T9625. As explained in a `TODO` comment, seems reasonable enough.

Test Plan: Unit tests.

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: Korvin, hach-que

Maniphest Tasks: T9625

Differential Revision: https://secure.phabricator.com/D14068
2015-11-03 06:47:12 +11:00
epriestley
de2bbfef7d Allow PhabricatorWorker->queueTask() to take full $options
Summary:
Ref T9252. Currently, `queueTask()` accepts `$priority` as its third argument. Allow it to take a full range of `$options` instead. This API just never got updated after we expanded avialable options.

Arguably this whole API should be some kind of "TaskQueueRequest" object but I'll leave that for another day.

Test Plan:
  - Grepped for `queueTask()` and verified no other callsites are affected by this API change.
  - Ran some daemons.
  - See also next diff.

Reviewers: hach-que, chad

Reviewed By: hach-que, chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14235
2015-10-05 09:46:29 -07:00
epriestley
4cf1270ecd In Harbormaster, make sure artifacts are destroyed even if a build is aborted
Summary:
Ref T9252. Currently, Harbormaster and Drydock work like this in some cases:

  # Queue a lease for activation.
  # Then, a little later, save the lease PHID somewhere.
  # When the target/resource is destroyed, destroy the lease.

However, something can happen between (1) and (2). In Drydock this window is very short and the "something" would have to be a lighting strike or something similar, but in Harbormaster we wait until the resource activates to do (2) so the window can be many minutes long. In particular, a user can use "Abort Build" during those many minutes.

If they do, the target is destroyed but it doesn't yet have a record of the artifact, so the artifact isn't cleaned up.

Make these things work like this instead:

  # Create a new lease and pre-generate a PHID for it.
  # Save that PHID as something that needs to be cleaned up.
  # Queue the lease for activation.
  # When the target/resource is destroyed, destroy the lease if it exists.

This makes sure there's no step in the process where we might lose track of a lease/resource.

Also, clean up and standardize some other stuff I hit.

Test Plan:
  - Stopped daemons.
  - Restarted a build in Harbormaster.
  - Stepped through the build one stage at a time using `bin/worker execute ...`.
  - After the lease was queued, but before it activated, aborted the build.
  - Processed the Harbormaster side of things only.
  - Saw the lease get destroyed properly.

Reviewers: chad, hach-que

Reviewed By: hach-que

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14234
2015-10-05 05:58:53 -07:00
epriestley
9c798e5cca Provide bin/garbage for interacting with garbage collection
Summary:
Fixes T9494. This:

  - Removes all the random GC.x.y.z config.
  - Puts it all in one place that's locked and which you use `bin/garbage set-policy ...` to adjust.
  - Makes every TTL-based GC configurable.
  - Simplifies the code in the actual GCs.

Test Plan:
  - Ran `bin/garbage collect` to collect some garbage, until it stopped collecting.
  - Ran `bin/garbage set-policy ...` to shorten policy. Saw change in web UI. Ran `bin/garbage collect` again and saw it collect more garbage.
  - Set policy to indefinite and saw it not collect garabge.
  - Set policy to default and saw it reflected in web UI / `collect`.
  - Ran `bin/phd debug trigger` and saw all GCs fire with reasonable looking queries.
  - Read new docs.

{F857928}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9494

Differential Revision: https://secure.phabricator.com/D14219
2015-10-02 09:17:24 -07:00
epriestley
878a493301 Begin standardizing garbage collectors
Summary: Ref T9494. Improve support infrastructure for garbage collectors.

Test Plan:
  - Ran `bin/phd debug trigger`, saw collectors execute.

{F857852}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9494

Differential Revision: https://secure.phabricator.com/D14218
2015-10-01 16:58:43 -07:00
epriestley
4496176924 Add staging area support to Harbormaster/Drydock + various fixes
Summary:
Ref T9252. This primarily allows Harbormaster to request (and Drydock to fulfill) working copies with a patch from a staging area. Doing this means we can do builds on in-review changes from `arc diff`.

This is a little cobbled-together but should basically work.

Also fix some other issues:

  - Yielded, awakend workers are fine to update but could complain.
  - We can't log slot lock failures to resources if we don't end up saving them.
  - Killing the transaction would wipe out the log.
  - Fix some TODOs, etc.

Test Plan: Ran Harbormaster builds on a local revision.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14214
2015-10-01 16:55:01 -07:00
epriestley
4ac82be5ed Merge the DrydockLease workers into a single worker
Summary:
Ref T9252. This is the same as D14201, but for lease stuff instead of resource stuff.

This one is a little heavier but still feels pretty reasonable to me at the end of the day (worker is <1K lines and has a ton of comment stuff).

Also fixes a few random bugs I hit in the task queue.

Test Plan:
  - Restarted some Harbormaster builds, saw them go through cleanly.
  - Released pre-activation resources/leases.
  - Probably still kinda buggy but I'll iron the details out over time.

Logs are starting to look somewhat plausible:

{F855747}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14202
2015-10-01 08:11:02 -07:00
epriestley
55767aac0f Fix an issue where followup tasks could fail to queue with string priorities
Auditors: chad
2015-09-28 19:46:41 -07:00
epriestley
bfaa93aa9b Allow Harbormaster build plans to request additional working copies
Summary:
Ref T9123. To run upstream builds in Harbormaster/Drydock, we need to be able to check out `libphutil`, `arcanist` and `phabricator` next to one another.

This adds an "Also Clone: ..." field to Harbormaster working copy build steps so I can type all three repos into it and get a proper clone with everything we need.

This is somewhat upstream-centric and a bit narrow, but I don't think it's totally unreasonable, and most of the underlying stuff is relatively general.

This adds some more typechecking and improves data/type handling for custom fields, too. In particular, it prevents users from entering an invalid/restricted value in a field (for example, you can't "Also Clone" a repository you don't have permission to see).

Test Plan: Restarted build, got a Drydock resource with multiple repositories in it.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T9123

Differential Revision: https://secure.phabricator.com/D14183
2015-09-28 17:57:41 -07:00
epriestley
9b29d46e60 Make Drydock lease infrastructure more nimble
Summary:
Ref T9252. Currently, Harbormaster does this when trying to acquire a working copy:

  - Ask for a working copy.
  - Yield for 15 seconds.
  - Check if we have a working copy yet.

That's OK, but Drydock takes ~1s to acquire a working copy lease if a resource is already available, so we end up doing this:

  - T+0: Ask for a working copy.
  - T+0: Yield for 15 seconds.
  - T+1: Working copy lease activates.
  - T+15: Working copy lease is used.
  - T+16: Build finishes.

So we end up spending about 2 seconds doing work and 14 seconds sleeping.

One way to fix this would be to fiddle with the yield duration, so we yield for 1, 2, 4, ... seconds or something. This probably isn't a bad idea for longer leases (i.e., wait for 15, 30, 45 ... seconds or similar) but it implies a lot of churn for short leases.

Instead, let tasks "awaken" other tasks when they complete. The "awaken" operation means: if a task is in a yielded state (no failures, no owner, explicitly yielded, future expires time), pretend it only yielded until right now instead of whenever it really yielded to.

Basically, this rewrites history so that even though Harbormaster did a `yield(15)`, we pretend it did a `yield(4)` after we activate the lease if lease activation took 4 seconds.

If this misses, it's fine: we fall back to the normal yield behavior and things move forward normally a few seconds later.

If it hits, we get a more nimble process pretty cleanly.

Test Plan:
  - Restarted a build plan (lease working copy + run `ls`) with this patch no-op'd, took about 16 seconds.
  - Restarted a build plan with this patch active, took about 1 second.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14178
2015-09-28 09:35:40 -07:00
epriestley
ec6d69e74d Give Drydock resources a proper expiry mechanism
Summary:
Fixes T6569. This implements an expiry mechanism for Drydock resources which parallels the mechanism for leases.

A few things are missing that we'll probably need in the future:

  - An "EXPIRES" command to update the expiration time. This would let resources be permanent while leased, then expire after, say, 24 hours without any leases.
  - A callback like `shouldActuallyExpireRightNow()` for resources and leases that lets them decide not to expire at the last second.
  - A callback like `didAcquireLease()` for resource blueprints, to parallel `didReleaseLease()`, letting them clear or extend their timer.

However, this stuff would mostly just let us tune behaviors, not really open up new capabilities.

Test Plan: Changed host resources to expire after 60 seconds, leased one, saw it vanish 60 seconds later.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T6569

Differential Revision: https://secure.phabricator.com/D14176
2015-09-28 09:35:14 -07:00
epriestley
3379904237 Allow Drydock leases to expire after a time limit
Summary: Ref T6569. If a lease is activated with an expiration date, schedule a task to try to clean it up after that time.

Test Plan:
  - Used `bin/drydock lease ... --until ...` to activate a lease in the near future.
  - Waited for a bit.
  - Saw it expire and get destroyed at the scheduled time.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T6569

Differential Revision: https://secure.phabricator.com/D14148
2015-09-23 13:54:27 -07:00
epriestley
fcb6d1e2fa Strip some obsolete code out of Drydock
Summary:
Ref T9252. This simplifies some Drydock code.

Most of this code relates to the old notion of Drydock being able to enumerate all the tasks it needs to complete in order to acquire a lease. The code has stepped back from this, since it's unnecessary, the queue is more powerful than it used to be, and it would be a lot of work to keep track of.

The ~only thing that should ever wait for leases in modern code is `bin/drydock lease`, and it's fine for it to just sit there sleeping, so this just does that.

This reduces the granularity of logging, but I'll address that separately in future logging-focused changes.

Test Plan: Used `bin/drydock lease` to acquire a lease, saw it acquire cleanly.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14147
2015-09-23 13:21:41 -07:00