Summary:
Ref T13253. Fixes T6615. See that task for discussion.
- Remove three keys which serve no real purpose: `dataID` doesn't do anything for us, and the two `leaseOwner` keys are unused.
- Rename `leaseOwner_2` to `key_owner`.
- Fix an issue where `dataID` was nullable in the active table and non-nullable in the archive table.
In practice, //all// workers have data, so all workers have a `dataID`: if they didn't, we'd already fatal when trying to move tasks to the archive table. Just clean this up for consistency, and remove the ancient codepath which imagined tasks with no data.
Test Plan:
- Ran `bin/storage upgrade`, inspected tables.
- Ran `bin/phd debug taskmaster`, worked through a bunch of tasks with no problems.
Reviewers: amckinley
Reviewed By: amckinley
Subscribers: PHID-OPKG-gm6ozazyms6q6i22gyam
Maniphest Tasks: T13253, T6615
Differential Revision: https://secure.phabricator.com/D20175
Summary:
Depends on D20175. Ref T12425. Ref T13253. Currently, importing commits can stall search index rebuilds, since index rebuilds use an older priority from before T11677 and weren't really updated for D16585.
In general, we'd like to complete all indexing tasks before continuing repository imports. A possible exception is if you rebuild an entire index with `bin/search index --rebuild-the-world`, but we could queue those at a separate lower priority if issues arise.
Test Plan: Ran some search indexing through the queue.
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T13253, T12425
Differential Revision: https://secure.phabricator.com/D20177
Summary:
Depends on D19919. Ref T11351. This method appeared in D8802 (note that "get...Object" was renamed to "get...Transaction" there, so this method was actually "new" even though a method of the same name had existed before).
The goal at the time was to let Harbormaster post build results to Diffs and have them end up on Revisions, but this eventually got a better implementation (see below) where the Harbormaster-specific code can just specify a "publishable object" where build results should go.
The new `get...Object` semantics ultimately broke some stuff, and the actual implementation in Differential was removed in D10911, so this method hasn't really served a purpose since December 2014. I think that broke the Harbormaster thing by accident and we just lived with it for a bit, then Harbormaster got some more work and D17139 introduced "publishable" objects which was a better approach. This was later refined by D19281.
So: the original problem (sending build results to the right place) has a good solution now, this method hasn't done anything for 4 years, and it was probably a bad idea in the first place since it's pretty weird/surprising/fragile.
Note that `Comment` objects still have an unrelated method with the same name. In that case, the method ties the `Comment` storage object to the related `Transaction` storage object.
Test Plan: Grepped for `getApplicationTransactionObject`, verified that all remaining callsites are related to `Comment` objects.
Reviewers: amckinley
Reviewed By: amckinley
Subscribers: PHID-OPKG-gm6ozazyms6q6i22gyam
Maniphest Tasks: T11351
Differential Revision: https://secure.phabricator.com/D19920
Summary:
Depends on D19918. Ref T11351. In D19918, I removed all calls to this method. Now, remove all implementations.
All of these implementations just `return $timeline`, only the three sites in D19918 did anything interesting.
Test Plan: Used `grep willRenderTimeline` to find callsites, found none.
Reviewers: amckinley
Reviewed By: amckinley
Subscribers: PHID-OPKG-gm6ozazyms6q6i22gyam
Maniphest Tasks: T11351
Differential Revision: https://secure.phabricator.com/D19919
Summary:
Ref T13216. See PHI970. Ref T13054. See some discussion in T13216.
When a Harbormaster Buildable object is first created for a Diff, it has no `containerPHID` since the revision has not yet been created.
We later (after creating a revision) send the Buildable a message telling it that we've added a container and it should re-link the container object.
Currently, we send this message in `applyExternalEffects()`, which runs inside the Differential transaction. If Harbormaster races quickly enough, it can read the `Diff` object before the transaction commits, and not see the container update.
Add a `didCommitTransaction()` callback after the transactions commit, then move the message code there instead.
Test Plan:
- See T13216 for substantial evidence that this change is on the right track.
- Before change: added `sleep(15)`, reproduced the issue reliably.
- After change: unable to reproduce issue even with `sleep(15)` (the `containerPHID` always populates correctly).
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T13216, T13054
Differential Revision: https://secure.phabricator.com/D19807
Summary: Depends on D19785. Ref T13217. This converts many of the most common clause construction pathways to the new %Q / %LQ / %LO / %LA / %LJ semantics.
Test Plan: Browsed around a bunch, saw fewer warnings and no obvious behavioral errors. The transformations here are generally mechanical (although I did them by hand).
Reviewers: amckinley
Reviewed By: amckinley
Subscribers: hach-que
Maniphest Tasks: T13217
Differential Revision: https://secure.phabricator.com/D19789
Summary: Depends on D19796. Simplify some timing code by using phutil_microseconds_since() instead of duplicate casting and arithmetic.
Test Plan: Grepped for `1000000` to find these. Pulled, pushed, made a conduit call. This isn't exhaustive but it should be hard for these to break in a bad way since they're all just diagnostic.
Reviewers: amckinley
Reviewed By: amckinley
Differential Revision: https://secure.phabricator.com/D19797
Summary: Depends on D19173. Ref T13096. Adds an optional, disabled-by-default lock log to make it easier to figure out what is acquiring and holding locks.
Test Plan: Ran `bin/lock log --enable`, `--disable`, `--name`, etc. Saw sensible-looking output with log enabled and daemons restarted. Saw no additional output with log disabled and daemons restarted.
Maniphest Tasks: T13096
Differential Revision: https://secure.phabricator.com/D19174
Summary:
Depends on D18961. Ref T13049. Currently, longer exports don't give the user any feedback, and exports that take longer than 30 seconds are likely to timeout.
For small exports (up to 1,000 rows) continue doing the export in the web process.
For large exports, queue a bulk job and do them in the workers instead. This sends the user through the bulk operation UI and is similar to bulk edits. It's a little clunky for now, but you get your data at the end, which is far better than hanging for 30 seconds and then fataling.
Test Plan: Exported small result sets, got the same workflow as before. Exported very large result sets, went through the bulk flow, got reasonable results out.
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T13049
Differential Revision: https://secure.phabricator.com/D18962
Summary:
Depends on D18947. Ref T13051. This goes through transaction tables and compacts the edge storage into the slim format.
I put this on `bin/garbage` instead of `bin/storage` because `bin/storage` has a lot of weird stuff about how it manages databases so that it can run before configuration (e.g., all the `--user`, `--password` type flags for configuring DB connections).
Test Plan:
Loaded an object with a bunch of transactions. Ran migration. Spot checked table for sanity. Loaded another copy of the object in the web UI, compared the two pages, saw no user-visible changes.
Here's a concrete example of the migration effect -- old row:
```
*************************** 44. row ***************************
id: 757
phid: PHID-XACT-PSTE-5gnaaway2vnyen5
authorPHID: PHID-USER-cvfydnwadpdj7vdon36z
objectPHID: PHID-PSTE-5uj6oqv4kmhtr6ctwcq7
viewPolicy: public
editPolicy: PHID-USER-cvfydnwadpdj7vdon36z
commentPHID: NULL
commentVersion: 0
transactionType: core:edge
oldValue: {"PHID-PROJ-wh32nih7q5scvc5lvipv":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-wh32nih7q5scvc5lvipv","dateCreated":"1449170691","seq":"0","dataID":null,"data":[]},"PHID-PROJ-5r2ed5v27xrgltvou5or":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-5r2ed5v27xrgltvou5or","dateCreated":"1449170683","seq":"0","dataID":null,"data":[]},"PHID-PROJ-zfp44q7loir643b5i4v4":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-zfp44q7loir643b5i4v4","dateCreated":"1449170668","seq":"0","dataID":null,"data":[]},"PHID-PROJ-okljqs7prifhajtvia3t":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-okljqs7prifhajtvia3t","dateCreated":"1448902756","seq":"0","dataID":null,"data":[]},"PHID-PROJ-3cuwfuuh4pwqyuof2hhr":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-3cuwfuuh4pwqyuof2hhr","dateCreated":"1448899367","seq":"0","dataID":null,"data":[]},"PHID-PROJ-amvkc5zw2gsy7tyvocug":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-amvkc5zw2gsy7tyvocug","dateCreated":"1448833330","seq":"0","dataID":null,"data":[]}}
newValue: {"PHID-PROJ-wh32nih7q5scvc5lvipv":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-wh32nih7q5scvc5lvipv","dateCreated":"1449170691","seq":"0","dataID":null,"data":[]},"PHID-PROJ-5r2ed5v27xrgltvou5or":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-5r2ed5v27xrgltvou5or","dateCreated":"1449170683","seq":"0","dataID":null,"data":[]},"PHID-PROJ-zfp44q7loir643b5i4v4":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-zfp44q7loir643b5i4v4","dateCreated":"1449170668","seq":"0","dataID":null,"data":[]},"PHID-PROJ-okljqs7prifhajtvia3t":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-okljqs7prifhajtvia3t","dateCreated":"1448902756","seq":"0","dataID":null,"data":[]},"PHID-PROJ-3cuwfuuh4pwqyuof2hhr":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-3cuwfuuh4pwqyuof2hhr","dateCreated":"1448899367","seq":"0","dataID":null,"data":[]},"PHID-PROJ-amvkc5zw2gsy7tyvocug":{"src":"PHID-PSTE-5uj6oqv4kmhtr6ctwcq7","type":"41","dst":"PHID-PROJ-amvkc5zw2gsy7tyvocug","dateCreated":"1448833330","seq":"0","dataID":null,"data":[]},"PHID-PROJ-tbowhnwinujwhb346q36":{"dst":"PHID-PROJ-tbowhnwinujwhb346q36","type":41,"data":[]},"PHID-PROJ-izrto7uflimduo6uw2tp":{"dst":"PHID-PROJ-izrto7uflimduo6uw2tp","type":41,"data":[]}}
contentSource: {"source":"web","params":[]}
metadata: {"edge:type":41}
dateCreated: 1450197571
dateModified: 1450197571
```
New row:
```
*************************** 44. row ***************************
id: 757
phid: PHID-XACT-PSTE-5gnaaway2vnyen5
authorPHID: PHID-USER-cvfydnwadpdj7vdon36z
objectPHID: PHID-PSTE-5uj6oqv4kmhtr6ctwcq7
viewPolicy: public
editPolicy: PHID-USER-cvfydnwadpdj7vdon36z
commentPHID: NULL
commentVersion: 0
transactionType: core:edge
oldValue: []
newValue: ["PHID-PROJ-tbowhnwinujwhb346q36","PHID-PROJ-izrto7uflimduo6uw2tp"]
contentSource: {"source":"web","params":[]}
metadata: {"edge:type":41}
dateCreated: 1450197571
dateModified: 1450197571
```
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T13051
Differential Revision: https://secure.phabricator.com/D18948
Summary:
Fixes T13042. This hooks up the new "silent" mode from D18882 and makes it actually work.
The UI (where we tell you to go run some command and then reload the page) is pretty clumsy, but should solve some problems for now and can be cleaned up eventually. The actual mechanics (timeline aggregation, Herald interaction, etc.) are on firmer ground.
Test Plan:
- Made a normal bulk edit, got mail and feed stories.
- Made a silent bulk edit, no mail and no feed.
- Saw "Silent Edit" marker in timeline for silent edits:
{F5386245}
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T13042
Differential Revision: https://secure.phabricator.com/D18883
Summary: Noticed a couple of typos in the docs, and then things got out of hand.
Test Plan:
- Stared at the words until my eyes watered and the letters began to swim on the screen.
- Consulted a dictionary.
Reviewers: #blessed_reviewers, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: epriestley, yelirekim, PHID-OPKG-gm6ozazyms6q6i22gyam
Differential Revision: https://secure.phabricator.com/D18693
Summary: This word is not spelled properly.
Test Plan: Read the word.
Reviewers: chad
Reviewed By: chad
Differential Revision: https://secure.phabricator.com/D18250
Summary:
Fixes T12803. An install is having difficulty diagnosing mail failures, and one component is that permanent task failures aren't reaching the log.
It's reasonable to send these to the log even when "phd.verbose" is off. See T12803 for a rough review of when we generate these failrues today.
Test Plan:
- Faked some exceptions.
- Got a result in the log (P2058) with `phd.verbose` turned off.
Reviewers: chad, amckinley
Reviewed By: chad
Maniphest Tasks: T12803
Differential Revision: https://secure.phabricator.com/D18106
Summary:
Closes T7829 as wontfix. Closes T7965 as wontfix. Closes T7800 as wontfix. Closes T2731 as wontfix. Closes T1271 as wontfix.
We aren't maintaining this at all (see, e.g., T7829) and a user reported a technically accurate security issue via HackerOne: <https://hackerone.com/reports/222870>
Just throw it away until we get to the eventual Conphernece bot/API update and can do this stuff correctly.
Test Plan: Grepped for `phabricatorbot`.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T7965, T7829, T7800, T2731, T1271
Differential Revision: https://secure.phabricator.com/D17756
Under some workloads, the taskmaster may hibernate and launch more rapidly
than it should. Require 15 seconds of inactivity before hibernating. Also
hibernate for longer.
Auditors: chad
The `min()` vs `max()` fix in D17560 meant that the Trigger daemon only
hibernates for 5 seconds, so we do a full GC sweep every 5 seconds. This ends
up eating a fair amount of CPU for no real benefit.
The GC cursors should move to persistent storage, but just bump this default
up in the meantime.
Auditors: chad
Summary: Ref T12298. Like PullLocal daemons, this allows the last daemon in the pool to hibernate if there's no work to be done, and awakens the pool when work arrives.
Test Plan:
- Ran `bin/phd debug task --trace`.
- Saw the pool hibernate and look for tasks.
- Commented on an object.
- Saw the pool wake up and process the queue.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12298
Differential Revision: https://secure.phabricator.com/D17559
Summary:
Ref T12298. Two minor daemon improvements:
- Make the "waiting" message reflect hibernation.
- Don't trigger a reload right after launching.
Test Plan:
- Read "waiting" message.
- Ran "bin/phd start", didn't see an immediate SIGHUP in the log.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12298
Differential Revision: https://secure.phabricator.com/D17550
Summary: Ref T12298. This allows the PullLocal daemon to hibernate like the Trigger daemon, but automatically wakes it back up when it needs to do something.
Test Plan:
- Ran `bin/phd debug pulllocal --trace`.
- Saw the daemon hibernate after doing a checkup on repositories.
- Saw periodic queries to look for new update messages.
- After clicking "Update Now" in the web UI to schedule an update, saw the daemon wake up immediately.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12298
Differential Revision: https://secure.phabricator.com/D17540
Summary: Ref T12331. These changes are intended to make it easier to debug T12331 since I'm having difficulty reproducing the issue locally.
Test Plan:
- Ran `bin/phd debug task --pool 4` and got an autoscaling pool.
- Ran `bin/worker flood --duration 3` and got some 3-second-long tasks to execute with `bin/worker execute ...`.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12331
Differential Revision: https://secure.phabricator.com/D17431
Summary:
Ref T12298. The trigger daemon already has routine long-term sleep, and few external events can impact when it should ideally wake up. The relevant events are:
- Someone creates a new Nuance source (ideally, we should wake up right away and start polling it).
- Someone creates a Calendar event about 16 minutes in the future (ideally, we should send them a reminder in about a minute).
- Someone changes GC config to be extremely aggressive (ideally, we should immediately respect the change).
None of these cases are very important. We don't hibernate for more than 3 minutes, so the worst case is that your Nuance source takes 3 minutes to start importing or your Calendar notification comes two minutes too late (13 minutes before the event instead of 15).
This change makes GC sightly more CPU-expensive on average: currently, we do a GC sweep every 4 hours. After this change, we'll end up doing one every 3 minutes, because we lose the fact that we did a sweep recently when the daemon restarts.
We could fix this by keeping track of when the last GC sweep was in the database, instead of in the Daemon process, but the cost of a sweep is normally very small so I don't plan to do this anytime soon.
Test Plan:
- Ran `bin/phd debug trigger`, saw daemon go through 3-minute hibernate + restart cycles.
- Ran `bin/phd debug task`, saw daemon run normally.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12298
Differential Revision: https://secure.phabricator.com/D17408
Summary: Ref T10390. Simplifies dropdown by rolling out canUseInPanel in useless panels
Test Plan: Add a query panel, see less options.
Reviewers: epriestley
Reviewed By: epriestley
Subscribers: Korvin
Maniphest Tasks: T10390
Differential Revision: https://secure.phabricator.com/D17341
Summary:
I frequently run into a situation where I want to kill tasks that have accumulated a lot of failures regardless of what class they are. Or I'll want to kill every worker of a certain class but only if it has failed at least once. This change allows me to run `./bin/worker cancel --class <MYCLASS> --min-failure-count 5` to only kill tasks with at least 5 failed attempts.
The `--min-failure-count N` argument can be used by itself as well as with `--class CLASSNAME`. I don't think it makes sense for it to work with `--id ID`, but I'm not dead set on that or anything.
Test Plan: I ran the worker management workflow with and without the `--min-failure-count` argument and it worked as expected.
Reviewers: #blessed_reviewers, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: Korvin, epriestley, yelirekim
Differential Revision: https://secure.phabricator.com/D16906
Summary:
This has been replaced by `PolicyCodex` after D16830. Also:
- Rebuild Celerity map to fix grumpy unit test.
- Fix one issue on the policy exception workflow to accommodate the new code.
Test Plan:
- `arc unit --everything`
- Viewed policy explanations.
- Viewed policy errors.
Reviewers: chad
Reviewed By: chad
Subscribers: hach-que, PHID-OPKG-gm6ozazyms6q6i22gyam
Differential Revision: https://secure.phabricator.com/D16831
Summary: Ref T5267. Fix one minor bug (paths were not being resolved properly) and one minor string issue (missing `%d` in a string).
Test Plan: Extracted strings, got a cleaner result.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T5267
Differential Revision: https://secure.phabricator.com/D16808
Summary: Ref T7931. This is still quite rough, but should technically send vaguely-useful email as part of the standard trigger infrastructure.
Test Plan: Ran `bin/phd start`, created an event shortly, saw reminder email send in `bin/mail list-outbound`.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T7931
Differential Revision: https://secure.phabricator.com/D16784
Summary:
Fixes T11771. Adds a lock around each GC process so we don't try to, e.g., delete old files on two machines at once just because they're both running trigger daemons.
The other aspects of this daemon (actual triggers; nuance importers) already have separate locks.
Test Plan: Ran `bin/phd debug trigger --trace`, saw daemon acquire locks and collect garbage.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11771
Differential Revision: https://secure.phabricator.com/D16739
Summary:
Fixes T11677. This makes two minor adjustments to the repository import daemons:
- The first step ("Message") now queues at a slightly-lower-than-default (for already-imported repositories) or very-low (for newly importing repositories) priority level.
- The other steps now queue at "default" priority level. This is actually what they already did, but without this change their behavior would be to inherit the priority level of their parents.
This has two effects:
- When adding new repositories to an existing install, they shouldn't block other things from happening anymore.
- The daemons will tend to start one commit and run through all of its steps before starting another commit. This makes progress through the queue more even and predictable.
- Before, they did ALL the message tasks, then ALL the change tasks, etc. This works fine but is confusing/uneven/less-predictable because each type of task takes a different amount of time.
Test Plan:
- Added a new repository.
- Saw all of its "message" steps queue at priority 4000.
- Saw followups queue at priority 2000.
- Saw progress generally "finish what you started" -- go through the queue one commit at a time, instead of one type of task at a time.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11677
Differential Revision: https://secure.phabricator.com/D16585
Summary: Previously, the chatbot docs instructed users to get certificates for the conduit API and put the cert in a `conduit.cert` config key. In order to get the chatbot to work, I needed to instead get an API key and put it in the `conduit.token` config entry.
Test Plan: Doc fix. Tried the new documented way and it worked.
Reviewers: epriestley, #blessed_reviewers
Reviewed By: epriestley, #blessed_reviewers
Subscribers: Korvin, epriestley
Differential Revision: https://secure.phabricator.com/D16443
Summary: Fixes T11508. The config entry `remarkup.ignored-object-names` already contains a blacklist of object names that should be ignored in the web UI. This change makes that blacklist also apply to the chatbot. This makes it possible to have a chatbot ignore things like V1, V2, Q1 and any other phrases the user may not want to generate links to objects.
Test Plan: Create objects (tasks, slowvotes, etc.) then mention the object names in chat (with the bot running). The bot should respond with helpful links to the given objects. Then add the object names to the blacklist through the config web UI. This apparently triggers the bot to restart itself. Then mention the object names in chat again. The bot should no longer respond with links because those object names have been added to the blacklist regex.
Reviewers: epriestley, #blessed_reviewers
Reviewed By: epriestley, #blessed_reviewers
Subscribers: epriestley
Maniphest Tasks: T11508
Differential Revision: https://secure.phabricator.com/D16442
Summary:
Fixes T11490. Currently, this query can not use a key and the table size may be quite large.
Adjust the query so it can use a key for both selection and ordering, and add that key.
Test Plan: Ran `EXPLAIN` on the old query in production, then added the key and ran `EXPLAIN` on the new query. Saw key in use, and "rows" examined drop from 29,273 to 15.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11490
Differential Revision: https://secure.phabricator.com/D16423
Summary:
The cluster synchronization code runs either actively (before returning a response to `git clone`, for example) or passively (routinely, as the daemons update reposiories).
The active sync runs as the web user (if running `git clone http://...`) or the VCS user (if running `git clone ssh://...`). But the passive sync runs as the daemon user.
All of these sync processes need to run actual commands as the daemon user (`git fetch ...`).
For the active ones, we must `sudo`.
For the passive ones, we're already the right user. We run the same code, and end up trying to sudo to ourselves, which `sudo` isn't happy about by default.
Depending on how `sudo` is configured and which users things are running as this might work anyway, but it's silly and if it doesn't work it requires you to go make non-obvious, weird config changes that are unintuitive and somewhat nonsensical. This is probably worse on the balance than adding a bit of complexity to the code.
Instead, test which user we're running as. If it's already the right user, don't sudo.
Test Plan:
- Ran `bin/repository update --trace` as daemon user, saw no more `sudo`.
- Ran a `git clone` to make sure that didn't break.
Reviewers: chad, avivey
Reviewed By: avivey
Differential Revision: https://secure.phabricator.com/D16391
Summary:
Ref T11458. Depends on D16388. Currently, we're very aggressive about closing connections in the taskmaster daemons.
This can end up taking up a lot of resources. In particular, because the outgoing port for outbound connections normally can not be reused for 60 seconds after a connection closes, we may exhaust outbound ports on the host if there's a big queue full of stuff that's being processed very quickly.
At a minimum, we //always// are holding open a `worker` connection, which we always need again right away. So even in the best case we end up opening/closing this about once per second and each daemon takes up about ~60 outbound ports when it should take up ~1.
So, make two adjustments:
- First, only close connections which we haven't issued a query on in the last 60 seconds. This should prevent us from closing connections that we'll need again immediately in most cases. In the worst case, we shouldn't be eating up any extra ports under default TCP behavior.
- Second, explicitly close connections. We were relying on implicit/GC behavior (maybe as a holdover from very long ago, before we got connection wrappers in place?), which probably did about the same thing but isn't as predictable and can't be profiled or instrumented.
Test Plan:
This is somewhat difficult to test completely convincingly in isolation since the problem behavior depends on production scales and the workload, and to some degree on configuration.
I tested that this stuff baiscally works by adding logging to connect/close and running the daemons, verifying that they churned connections a lot before this change (e.g., ~1/s even at no load) and churn rarely afterward (e.g., almost never at no load).
I ran some workload through them to make sure I didn't completely break anything.
The best real test is just seeing how production responds. Current inbound/outbound connections on `secure001` are 1,200:
```
secure001 $ netstat -t | grep :mysql | wc -l
1164
```
Current outbound from `repo001` are 18,600:
```
repo001 $ netstat -t | grep :mysql | wc -l
18663
```
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11458
Differential Revision: https://secure.phabricator.com/D16389
Summary:
Ref T11309. In that task, a user misunderstood two parts of this error:
- They took "exception" to mean "unexpected failure", when it was intended to mean "rare circumstance".
- They intereted the internal ID number of a commit to mean that Phabricator was malfunctioning.
Make the language of this condition more direct, explaining what the situation means in greater detail.
Additionally, we would previously re-throw this exception, which would make the daemon exit, wait a moment, and restart. This was normal and expected.
When //unexpected// failures occur, it's important do to this: it prevents a daemon failing in a loop from causing too many side effects (e.g., limit of 1 email per 5 seconds instead of thousands per second).
When expected, permanent failures occur, we do not need to do this: the task will not be retried. I just did it because it was slightly more consistent ("failures restart daemons") and we had few permanent failure types at the time.
We have more now, and restarting the daemons generates some additional logs which have the potential to confuse. Cycling the daemon also (intentionally) reduces the rate at which we process tasks, which can be bad for permanent failures like "deleted commit" because users can delete a huge number of commits and possibly clog up the queue with cycle-after-failure actions.
Test Plan:
Tried to process a deleted commit, saw a new message:
```
2016-07-11 9:30:22 AM [STDE] <VERB> PhabricatorTaskmasterDaemon Task 1428658 was cancelled: Commit "R55:6c46b7d0fb82a859ca3f87a95dc8dcceef8088c9" (with internal ID "282161") is no longer reachable from any branch, tag, or ref in this repository, so it will not be imported. This usually means that the branch the commit was on was deleted or overwritten.
```
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11309
Differential Revision: https://secure.phabricator.com/D16268
Summary:
Ref T3554. Makes `bin/worker cancel --class <classname>` work (cancel all tasks with that type).
This is useful in development if your queue is full of a bunch of gunk, and a need has occasionally arisen in production environments (usually "one option is cancel everything and move on").
Test Plan: Ran `bin/worker cancel` to cancel blocks of tasks by class name.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T3554
Differential Revision: https://secure.phabricator.com/D16267
Summary:
Trigger daemon is trying to find the next event to invoke before sleeping, but the query includes already-elapsed triggers.
It then tries to sleep for 0 seconds.
Test Plan:
On a new instance, schedule a single trigger of type `PhabricatorOneTimeTriggerClock` to a very near time.
Use top to see trigger daemon not going to 100% CPU once the event has elapsed.
Reviewers: #blessed_reviewers, epriestley
Subscribers: Korvin
Differential Revision: https://secure.phabricator.com/D15750
Summary:
Ref T10537. For Nuance, I want to introduce new sources (like "GitHub" or "GitHub via Nuance" or something) but this needs to modularize eventually.
Split ContentSource apart so applications can add new content sources.
Test Plan:
This change has huge surface area, so I'll hold it until post-release. I think it's fairly safe (and if it does break anything, the breaks should be fatals, not anything subtle or difficult to fix), there's just no reason not to hold it for a few hours.
- Viewed new module page.
- Grepped for all removed functions/constants.
- Viewed some transactions.
- Hovered over timestamps to get content source details.
- Added a comment via Conduit.
- Added a comment via web.
- Ran `bin/storage upgrade --namespace XXXXX --no-quickstart -f` to re-run all historic migrations.
- Generated some objects with `bin/lipsum`.
- Ran a bulk job on some tasks.
- Ran unit tests.
{F1190182}
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10537
Differential Revision: https://secure.phabricator.com/D15521
Summary: Ref T10537. Currently, when you have at least two cursors, the daemon can poll too frequently when processing the last source because it never hits the end-of-list condition.
Test Plan:
- Ran `bin/phd debug trigger`.
- Observed huge volumes of output before change as triggers fired as fast as possible.
- Observed reasonable poll frequency after change.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10537
Differential Revision: https://secure.phabricator.com/D15464
Summary:
Ref T10537. More infrastructure:
- Put a `bin/nuance` in place with `bin/nuance import`. This has no useful behavior yet.
- Allow sources to be searched by substring. This supports `bin/nuance import --source whatever` so you don't have to dig up PHIDs.
Test Plan:
- Applied migrations.
- Ran `bin/nuance import --source ...` (no meaningful effect, but works fine).
- Searched for sources by substring in the UI.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10537
Differential Revision: https://secure.phabricator.com/D15436
Summary:
Ref T10537. Some sources (like the future "GitHub Repository" source) need to poll remotes.
- Provide a mechanism for sources to emit import cursors.
- Hook them into the trigger daemon so they'll fire periodically.
- Provide some storage.
This diff does nothing useful or interesting, and is pure infrastructure.
Test Plan:
- Ran `bin/storage upgrade -f`, no adjustment issues.
- Poked around Nuance.
- Ran the trigger daemon, verified it didn't crash and checked for Nuance stuff to do.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10537
Differential Revision: https://secure.phabricator.com/D15435
Summary:
Every caller returns `true`. This was added a long time ago for Projects, but projects are no longer subscribable.
I don't anticipate needing this in the future.
Test Plan: Grepped for this method.
Reviewers: chad
Reviewed By: chad
Differential Revision: https://secure.phabricator.com/D15409
Summary:
Ref T6183. Ref T10054. Historically, only members could watch projects because there were some weird special cases with policies. These policy issues have been resolved and Herald is generally powerful enough to do equivalent watches on most objects anyway.
Also puts a "Watch Project" button on the feed panel to make the behavior and meaning more obvious.
Test Plan:
- Watched a project I was not a member of.
- Clicked the feed watch/unwatch button.
{F1064909}
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T6183, T10054
Differential Revision: https://secure.phabricator.com/D15063
Summary: See Q266.
Test Plan: Created a bulk job, clicked "Details" instead of "Confirm", clicked "Continue" to get back to confirmation dialog.
Reviewers: chad
Reviewed By: chad
Differential Revision: https://secure.phabricator.com/D14985
Summary: Ref T9979. Convert all DestructionEngine behaviors to extensions.
Test Plan:
{F1033244}
Destroyed an object, verifying:
- Herald transcripts were destroyed;
- edges were destroyed;
- flags were destroyed;
- tokens were destroyed;
- transactions were destroyed;
- worker tasks were cancelled.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9979
Differential Revision: https://secure.phabricator.com/D14832
Summary:
Ref T9994. This fixes the first issue discussed on that task, which is that when a merge fails after "arc land", we would not clean up all the leases properly.
Specifically, when a merge fails, we use `queueTask()` to schedule a followup task. This followup destroys the lease and frees the underlying resource.
However, the default behavior of `queueTask()` is to //not queue tasks// if the parent task fails. This is a reasonable, safe behavior that was originally introduced in D8774, where it kept us from sending too much mail if a task did "send some mail" and then failed a little later on and got retried.
Since I think the default behavior is correct, I just special cased the behavior for Drydock to make it queue even on failure. These are the only types of followup tasks we currently want to queue on main task failure.
(It's possible that future Blueprints might want some kind of more specialized behavior, where some tasks queue only on success, but we can cross that bridge when we come to it.)
Test Plan:
- See T9994#149878 for test case setup.
- I ran that test case again with this patch, and saw the followup task queue properly in the `--trace` log, a correspoinding update task show up in `/daemon/`, and the lease get destroyed when I ran it a moment later.
{F1029915}
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9994
Differential Revision: https://secure.phabricator.com/D14818
Summary: This logic is flipped.
Test Plan:
- Before change: ran `bin/phd debug task`, saw queries to the config table every second.
- After change: ran `bin/phd debug task`, saw queries to the config table every 10 seconds.
Reviewers: chad, joshuaspence
Reviewed By: chad, joshuaspence
Differential Revision: https://secure.phabricator.com/D14542
Summary:
Fixes T9808.
An instance imported a very large repository, generating approximately 4 million tasks over the course of a few days. A week later, these tasks started expiring and became candidates for garbage collection.
The GC works by deleting 100 rows at at time over and over again. It finds the rows it's going to delete by querying for old rows.
Currently, this query generates a `WHERE dateCreated < X ORDER BY id DESC` query. This query can not efficiently execute using a single key, because it relies on `dateCreated` order to find the rows, then on `id` order to sort them. With a table with 4M rows, this is slow.
This would still be OK, except that the query has to execute a lot of times since it only deletes 100 rows each time. Particularly, it needs to execute a total of ~40K times.
Instead, generate `WHERE dateCreated < X ORDER BY dateCreated DESC, id DESC`. This should have the same effect in general and the GC definitely doesn't care about the difference, but it should be more efficient at large scales.
Test Plan:
I had to `TRUNCATE` the problem table so I don't have a perfect repro to completely convincingly test this anymore. Both queries behave fine at small scales, which is why we haven't seen this before.
I was able to run the newer query in production before I nuked the table and have it complete in a reasonable amount of time, while the old query hung longer than I wanted to wait (several minutes?). The query plan for the new query was also a good one, while the query plan for the old query was terrible.
I loaded the daemon console and ran `bin/garbage collect --collector worker.tasks --trace`. I verified the queries looked reasonable and produced reasonable results in production.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9808
Differential Revision: https://secure.phabricator.com/D14505