Summary:
See discussion in D8773. Three small adjustments which should help prevent this kind of issue:
- When queueing followup tasks, hold them on the worker until we finish the task, then queue them only if the work was successful.
- Increase the default lease time from 60 seconds to 2 hours. Although most tasks finish in far fewer than 60 seconds, the daemons are generally stable nowadays and these short leases don't serve much of a purpose. I think they also date from an era where lease expiry and failure were less clearly distinguished.
- Increase the default wait-after-failure from 60 seconds to 5 minutes. This largely dates from the MetaMTA era, where Facebook ran services with high failure rates and it was appropriate to repeatedly hammer them until things went through. In modern infrastructure, such failures are rare.
Test Plan:
- Verified that tasks queued properly after the main task was updated.
- Verified that leases default to 7200 seconds.
- Intentionally failed a task and verified default 300 second wait before retry.
- Removed all default leases shorter than 7200 seconds (there was only one).
- Checked all the wait before retry implementations for anything much shorter than 5 minutes (they all seem reasonable).
Reviewers: btrahan, sowedance
Reviewed By: sowedance
Subscribers: epriestley
Differential Revision: https://secure.phabricator.com/D8774
Summary:
There are quite a few tests in Arcanist, libphutil and Phabricator that do something similar to `$this->assertEqual(false, ...)` or `$this->assertEqual(true, ...)`.
This is unnecessarily verbose and it would be cleaner if we had `assertFalse` and `assertTrue` methods.
Test Plan: I contemplated adding a unit test for the `getCallerInfo` method but wasn't sure if it was required / where it should live.
Reviewers: epriestley, #blessed_reviewers
Reviewed By: epriestley
CC: Korvin, epriestley, aran
Differential Revision: https://secure.phabricator.com/D8460
Summary:
Ref T2015. Currently, Drydock has a `wait-for-lease` workflow which is invoked in the background by the `lease` workflow.
The goal of this mechanism is to allow `bin/drydock lease` to print out logs as the lease is acquired. However, this predates the `runAllTasksInProcess` flags, and they provide a simpler and more robust way (potentially with `--trace` and `PhutilConsole`) to do synchronous execution and debug logging.
Simplify this whole mechanism: just run everything in-process in `bin/drydock lease`, and do logging via `--trace`. We could thread a `PhutilConsole` through things too, but this seems good enough for now.
Also various cleanup/etc.
Test Plan: Ran `bin/drydock lease`. Ran `bin/harbormaster build X --plan Y`, for `Y` being a Drydock-dependent build plan.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2015
Differential Revision: https://secure.phabricator.com/D7835
Summary:
Fixes two issues:
- When rendering a task's details, we currently issue a policy-oblivious query. Instead, issue a policy-aware query.
- The formatting is a little bit weird, with the top half in a box and the bottom half with an older style. Make them consistent.
Test Plan: Looked at the detail pages for several tasks in queue.
Reviewers: btrahan, chad
Reviewed By: chad
CC: aran
Differential Revision: https://secure.phabricator.com/D7812
Summary:
Ref T1049. See discussion in D7745. We have some specific interest in this for D7745, but generally we want to consume tasks with expired leases in roughly FIFO order, just like we consume new tasks in roughly FIFO order. Currently, when we select an expired task we order them by `id`, but this is the original insert order, not lease expiration order. Instead, order by `leaseExpires`.
This query is actually much better than the old one was, since the WHERE part is `leaseExpries < VALUE`.
Test Plan: Ran `EXPLAIN` on the query. Ran a taskmaster in debug mode and saw it lease new and expired tasks successfully.
Reviewers: hach-que, btrahan
Reviewed By: hach-que
CC: aran
Maniphest Tasks: T1049
Differential Revision: https://secure.phabricator.com/D7746
Summary: Fixes T3610. This got un-scoped at some point, and is only used in Drydock so it escaped notice.
Test Plan: Ran `bin/drydock lease --type host` without hitting an exception about incorrect query construction.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T3610
Differential Revision: https://secure.phabricator.com/D6552
Summary: Fixes T3557. One thing which made T3557 kind of a mess was the lack of information about progress through temporary failures. Add a column which records a task's last failure time, and surface it in the console.
Test Plan: {F51277}
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T3557
Differential Revision: https://secure.phabricator.com/D6550
Summary: Fixes T2569. This is the other common exception source which is ambiguous. List the task ID explicitly to make debugging easier.
Test Plan: {F51268}
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2569
Differential Revision: https://secure.phabricator.com/D6549
Summary: Ref T2852. When a Differential revision is linked to an Asana task, show the related task in Differential.
Test Plan: {F49234}
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2852
Differential Revision: https://secure.phabricator.com/D6387
Summary:
Ref T2852. Asana sync tasks currently have a standard retry/backoff schedule, but the defaults are quite aggressive (retry every 60s forever). Instead, retry at increasing intervals and stop retrying after a few tries.
- Retry at intervals and stop retrying after a few iterations.
- Modernize some interfaces.
- Add better information about retry behaviors to the web UI.
Test Plan: {F49194}
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2852
Differential Revision: https://secure.phabricator.com/D6381
Summary:
Ref T2852. Add a `log()` method to `PhabricatorWorker` to make debugging easier.
I renamed the similar Drydock-specific method.
Test Plan:
Used logging in a future revision:
...
<<< [36] <http> 211,704 us
Updating main task.
>>> [37] <http> https://app.asana.com/api/1.0/tasks/6153776820388
...
Reviewers: btrahan, chad
Reviewed By: chad
CC: aran
Maniphest Tasks: T2852
Differential Revision: https://secure.phabricator.com/D6296
Summary:
Ref T2852. I want to model Asana integration as a response to feed events. Currently, we queue one feed event for each HTTP hook.
Instead, always queue one feed event and then have it queue any necessary followup events (now, http hooks; soon, asana).
Add a script to make it easy to reproducibly fire feed event publishing.
Test Plan:
Republished a feed event and verified it hit configured HTTP hooks correctly.
$ ./bin/feed republish 5765774156541908292 --trace
>>> [2] <connect> phabricator2_feed
<<< [2] <connect> 1,660 us
>>> [3] <query> SELECT story.* FROM `feed_storydata` story JOIN `feed_storyreference` ref ON ref.chronologicalKey = story.chronologicalKey WHERE (ref.chronologicalKey IN (5765774156541908292)) GROUP BY story.chronologicalKey ORDER BY story.chronologicalKey DESC
<<< [3] <query> 595 us
>>> [4] <connect> phabricator2_differential
<<< [4] <connect> 760 us
>>> [5] <query> SELECT * FROM `differential_revision` WHERE phid IN ('PHID-DREV-ywqmrj5zgkdloqh5p3c5')
<<< [5] <query> 478 us
>>> [6] <query> SELECT * FROM `differential_revision` WHERE phid IN ('PHID-DREV-ywqmrj5zgkdloqh5p3c5')
<<< [6] <query> 449 us
>>> [7] <connect> phabricator2_user
<<< [7] <connect> 1,062 us
>>> [8] <query> SELECT * FROM `user` WHERE phid in ('PHID-USER-lqiz3yd7wmk64ejugvov')
<<< [8] <query> 540 us
>>> [9] <connect> phabricator2_file
<<< [9] <connect> 951 us
>>> [10] <query> SELECT * FROM `file` WHERE phid IN ('PHID-FILE-gq6dlsysvxbn3dgwvky7')
<<< [10] <query> 498 us
>>> [11] <query> SELECT * FROM `user_status` WHERE userPHID IN ('PHID-USER-lqiz3yd7wmk64ejugvov') AND UNIX_TIMESTAMP() BETWEEN dateFrom AND dateTo
<<< [11] <query> 507 us
Republishing story...
>>> [12] <query> SELECT story.* FROM `feed_storydata` story JOIN `feed_storyreference` ref ON ref.chronologicalKey = story.chronologicalKey WHERE (ref.chronologicalKey IN (5765774156541908292)) GROUP BY story.chronologicalKey ORDER BY story.chronologicalKey DESC
<<< [12] <query> 685 us
>>> [13] <query> SELECT * FROM `differential_revision` WHERE phid IN ('PHID-DREV-ywqmrj5zgkdloqh5p3c5')
<<< [13] <query> 489 us
>>> [14] <query> SELECT * FROM `differential_revision` WHERE phid IN ('PHID-DREV-ywqmrj5zgkdloqh5p3c5')
<<< [14] <query> 512 us
>>> [15] <query> SELECT * FROM `user` WHERE phid in ('PHID-USER-lqiz3yd7wmk64ejugvov')
<<< [15] <query> 601 us
>>> [16] <query> SELECT * FROM `file` WHERE phid IN ('PHID-FILE-gq6dlsysvxbn3dgwvky7')
<<< [16] <query> 405 us
>>> [17] <query> SELECT * FROM `user_status` WHERE userPHID IN ('PHID-USER-lqiz3yd7wmk64ejugvov') AND UNIX_TIMESTAMP() BETWEEN dateFrom AND dateTo
<<< [17] <query> 551 us
>>> [18] <query> SELECT story.* FROM `feed_storydata` story JOIN `feed_storyreference` ref ON ref.chronologicalKey = story.chronologicalKey WHERE (ref.chronologicalKey IN (5765774156541908292)) GROUP BY story.chronologicalKey ORDER BY story.chronologicalKey DESC
<<< [18] <query> 507 us
>>> [19] <query> SELECT * FROM `differential_revision` WHERE phid IN ('PHID-DREV-ywqmrj5zgkdloqh5p3c5')
<<< [19] <query> 428 us
>>> [20] <query> SELECT * FROM `differential_revision` WHERE phid IN ('PHID-DREV-ywqmrj5zgkdloqh5p3c5')
<<< [20] <query> 419 us
>>> [21] <query> SELECT * FROM `user` WHERE phid in ('PHID-USER-lqiz3yd7wmk64ejugvov')
<<< [21] <query> 591 us
>>> [22] <query> SELECT * FROM `file` WHERE phid IN ('PHID-FILE-gq6dlsysvxbn3dgwvky7')
<<< [22] <query> 406 us
>>> [23] <query> SELECT * FROM `user_status` WHERE userPHID IN ('PHID-USER-lqiz3yd7wmk64ejugvov') AND UNIX_TIMESTAMP() BETWEEN dateFrom AND dateTo
<<< [23] <query> 593 us
>>> [24] <http> http://127.0.0.1/derp/
<<< [24] <http> 746,157 us
[2013-06-24 20:23:26] EXCEPTION: (HTTPFutureResponseStatusHTTP) [HTTP/500] Internal Server Error
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2852
Differential Revision: https://secure.phabricator.com/D6291
Summary: MySQL is not able to use indexes with searching for tuples.
Test Plan:
Explained the query before and after, saw `key_len` 16 instead of 8.
Also saw time 0.0 s instead of 2.9 s (but that was probably caused by warming up).
Reviewers: epriestley
Reviewed By: epriestley
CC: aran, Korvin
Differential Revision: https://secure.phabricator.com/D5580
Summary:
I wasn't able to reproduce the "recursion detected" in real web request but I saw lots of 1073741824 refcounts in `debug_zval_dump()` of $object.
I'm not sure how that happens.
Test Plan: D4807#4
Reviewers: epriestley
Reviewed By: epriestley
CC: aran, Korvin
Maniphest Tasks: T2432
Differential Revision: https://secure.phabricator.com/D4839
Summary: Fixes various array_combine() warnings for PHP < 5.4
Test Plan: lint/unit/grep
Reviewers: btrahan, vrana, chad
Reviewed By: chad
CC: aran
Differential Revision: https://secure.phabricator.com/D4660
Summary:
Some time ago, we added `ORDER BY id ASC` to the worker `UPDATE ...` query, because someone reported that their MySQL read slaves were complaining about the query (I can't find the exact error message, but something to the effect of the rows the query affected not being deterministic). This seemed harmless since it should be the same as the query's implicit order (I guess?), but actually made the query dramatically slower for large numbers of rows.
On my local machine, this query takes about 2 seconds with ~1M rows. If I run `SELECT`, or run `UPDATE` without ORDER BY, the query takes < 0.01s. I don't understand exactly what's happening -- my guess is something to do with the ORDER BY implying that a lot of rows need to be locked?
In T2372, a user is seeing 20-60s rumtimes on this query.
I solved this by doing a SELECT, followed by an UPDATE. Each query runs quickly. This introduces the possibility of a race (two processes SELECT the same rows, then try to UPDATE), which we currently recover from by having the second UPDATE fail and then having that daemon try again 1 second later. This seems generally reasonable. Some alternatives I considered:
- We could SELECT ... LOCK FOR UPDATE, but failing and retrying a little later seems at least as good as blocking.
- We could select more rows than we need, and then try to lock some of them randomly. I think this would work well, but it's a bit more complex than what we're doing now so I left it until we have a clearer need.
Test Plan:
Inserted ~1M tasks into the queue. Ran `phd debug taskmaster`, saw ~2s task updates. Applied patch. Ran `phd debug taskmaster`, saw <1ms updates. Ran `phd launch 8 taskmaster`, saw rapid completion of tasks.
This stuff also has fairly thorough unit test coverage.
Reviewers: vrana, btrahan
Reviewed By: vrana
CC: aran
Maniphest Tasks: T2372
Differential Revision: https://secure.phabricator.com/D4576
Summary:
Load the data for daemon worker tasks when viewing them, and present
the information in a useful way. This defaults to printing the json data,
but for some classes of worker it will also link to the corresponding
object, to make debugging problems with workers easier.
Test Plan:
load /daemon/task/NNN for a CommitParserWorker and a MetaMTAWorker, and
see the addition of a data field with useful content and link.
Reviewers: epriestley, vrana
Reviewed By: epriestley
CC: aran, Korvin
Differential Revision: https://secure.phabricator.com/D4226
Summary: Include task ID and class when raising this exception. I took a brief stab at doing this generically, but (a) we specifically raise this exception outside of normal try/catch because we can't follow normal recovery rules for it and (b) we don't have a reasonable PhutilProxyException or similar right now which would preserve stack traces, and don't have builtin exception nesting support until PHP 5.3.
Test Plan: Faked this exception, verified we get more information in the logs.
Reviewers: btrahan, vrana
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2193
Differential Revision: https://secure.phabricator.com/D4205
Summary:
People hit three issues with D3914:
- As per T2059, we applied a schema change from a `.php` patch, which currently does not work if you use a different user to make schema changes than for normal use.
- Since the change in question is idempotent, just move it to a `.sql` patch. We'll follow up in T2059 and fix it properly.
- Rogue daemons at several installs used old code (expecting autoincrement) to insert into the new table (no autoincrement), thereby creating tasks with ID 0.
- Rename the table so they'll fail.
- This also makes the code a little more consistent.
- Some installs now have tasks with ID 0.
- Use checks against null rather than against 0 so we can process these tasks.
The major issues this fixes are the schema upgrade failure in T2059, and the infinite loops in T2072 and elsewhere.
This isn't really a fully statisfactory fix. I'll discuss some next steps in T2072.
Test Plan: Created new tasks via MetaMTA/Differential. Ran tasks with `phd debug taskmaster`. Inserted a task 0 and verified it ran and archived correctly.
Reviewers: btrahan, vrana, nh
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2072, T2059
Differential Revision: https://secure.phabricator.com/D3973
Summary: See D3912 for discussion. InnoDB may reuse autoincrement IDs after restart; provide a way to avoid it.
Test Plan: Unit tests. Scheduled and executed tasks through `drydock lease --type host` and `phd debug taskmaster`.
Reviewers: vrana, btrahan
Reviewed By: vrana
CC: aran
Differential Revision: https://secure.phabricator.com/D3914
Summary:
- Remove EC2, RemoteHost, Application, etc., blueprints for now. They're very proof-of-concept and Blueprints are getting API changes I don't want to bother propagating for now. Leave the abstract base class and the LocalHost blueprint. I'll restore the more complicated ones once better foundations are in place.
- Remove the Allocate controller from the web UI. The original vision here was that you'd manually allocate resources in some cases, but it no longer makes sense to do so as all allocations come from leases now. This simplifies allocations and makes the rule for when we can clean up resources clear-cut (if a resource has no more active leases, it can be cleaned up). Instead, we'll build resources like the localhost and remote hosts lazily, when leases come in for them.
- Add some configuration to manage the localhost blueprint.
- Refactor `canAllocateResources()` into `isEnabled()` (for config checks) and `canAllocateMoreResources()` (for quota checks, e.g. too many resources are allocated already).
- Juggle some signatures to align better with a world where blueprints generally do allocate.
- Add some more logging and error handling.
- Fix an issue with log ordering.
Test Plan: Allocated some localhost leases.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2015
Differential Revision: https://secure.phabricator.com/D3902
Summary:
This commit doesn't change license of any file. It just makes the license implicit (inherited from LICENSE file in the root directory).
We are removing the headers for these reasons:
- It wastes space in editors, less code is visible in editor upon opening a file.
- It brings noise to diff of the first change of any file every year.
- It confuses Git file copy detection when creating small files.
- We don't have an explicit license header in other files (JS, CSS, images, documentation).
- Using license header in every file is not obligatory: http://www.apache.org/dev/apply-license.html#new.
This change is approved by Alma Chao (Lead Open Source and IP Counsel at Facebook).
Test Plan: Verified that the license survived only in LICENSE file and that it didn't modify externals.
Reviewers: epriestley, davidrecordon
Reviewed By: epriestley
CC: aran, Korvin
Maniphest Tasks: T2035
Differential Revision: https://secure.phabricator.com/D3886
Summary:
We lock tasks by setting `leaseOwner` to a unique value, but the value is currently unique-to-the-process rather than unique-to-the-query. This means that if a process leases a task, then leases another task, both tasks will have the same `leaseOwner`. This can cause an issue where we go to select the task we just leased and get the other task instead, if we aren't careful about the select construction.
We can avoid this by being clever and making sure the select is constructed correctly, but making the `leaseOwner` unique to the query is much simpler and more foolproof. This guarantees we always select only the rows we just leased.
Also remove `PhabricatorGoodForNothingWorker` since `PhabricatorTestWorker` fills its role of allowing things to be tested, and simplify the unit tests since we don't need to be clever about avoiding this issue any more.
Test Plan: Ran unit tests.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2015
Differential Revision: https://secure.phabricator.com/D3862
Summary:
This was the major goal of D3859/D3855, and to a lesser degree D3854/D3852.
As Drydock is allocating a resource, it may need to allocate other resources first. For example, if it's allocating a working copy, it may need to allocate a host first.
Currently, we have the process basically queue up the allocation (insert a task into the queue) and sleep() until it finishes. This is problematic for a bunch of reasons, but the major one is that if allocation takes more resources (host, port, machine, DNS) than you have daemons, they could all end up sleeping and waiting for some other daemon to do their work. This is really stupid. Even if you only take up some of them, you're spending slots sleeping when you could be doing useful work.
To partially get around this and make the CLI experience less dumb, there's this goofy `synchronous` flag that gets passed around everywhere and pushes the workflow through a pile of special cases. Basically the `synchronous` flag causes us to do everything in-process. But this is dumb too because we'd rather do things in parallel if we can, and we have to have a lot of special case code to make it work at all.
Get rid of all of this. Instead of sleep()ing, try to work on the tasks that need to be worked on. If another daemon grabbed them already that's fine, but in the worst case we just gracefully degrade and do everything in process. So we get the best of both worlds: if we have parallelizable tasks and free daemons, things will execute in parallel. If we have nonparallelizable tasks or no free daemons, things will execute in process.
Test Plan: Ran `drydock_control.php --trace` and saw it perform cascading allocations without sleeping or special casing.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2015
Differential Revision: https://secure.phabricator.com/D3861
Summary:
- Clean up a TODO about permanent failures.
- Clean up a TODO about failing tasks after too many retries.
- Clean up a TODO about testing for bad leases.
- Make the lease/retry implementation more flexible and natural.
- Make completely bogus tasks fail permanently.
- Make PhabricatorMetaMTAWorker use new `getWaitBeforeRetry()` (as intended), not hackily implement logic in `getRequiredLeaseTime()`.
- Document worker hooks for failures and retries.
- Provide coverage on everything.
Test Plan: Ran unit tests. Ran `bin/phd debug taskmaster`.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2015
Differential Revision: https://secure.phabricator.com/D3859
Summary: This simplifies the fairly thorny logic of leasing tasks a bit. I'm planning to introduce another callsite shortly for Drydock.
Test Plan: Ran `bin/phd debug taskmaster`, observed sensible queries and correct operation.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2015
Differential Revision: https://secure.phabricator.com/D3855
Summary: Make mobile-friendly and provide UI to cancel/retry tasks. Remove display of task data to arbitrary users, as it may be sensitive.
Test Plan:
{F22502}
{F22503}
{F22504}
{F22505}
{F22506}
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2015
Differential Revision: https://secure.phabricator.com/D3854
Summary:
Currently, when taskmasters complete a task it is immediately deleted. This prevents us from doing some general things, like:
- Supporting the idea of permanent failure (e.g., after N failures just stop trying).
- Showing the user how fast taskmasters are completing tasks.
- Showing the user how long tasks took to complete.
Having better visibility into this is important to Drydock, which builds on the task system. Also, generally buff debug output for task execution.
Test Plan: Ran `bin/phd debug taskmaster`. Ran `bin/phd debug garbage`. Queued some tasks via various systems.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran
Maniphest Tasks: T2015
Differential Revision: https://secure.phabricator.com/D3852
Summary:
- `kill_init.php` said "Moving 1000 files" - I hope that this is not some limit in `FileFinder`.
- [src/infrastructure/celerity] `git mv utils.php map.php; git mv api/utils.php api.php`
- Comment `phutil_libraries` in `.arcconfig` and run `arc liberate`.
NOTE: `arc diff` timed out so I'm pushing it without review.
Test Plan:
/D1234
Browsed around, especially in `applications/repository/worker/commitchangeparser` and `applications/` in general.
Auditors: epriestley
Maniphest Tasks: T1103
Summary: Last of the big final patches. Left a few debatable classes (12 out of about 400) that I'll deal with individually eventually.
Test Plan: Ran testEverythingImplemented.
Reviewers: btrahan
Reviewed By: btrahan
CC: aran, epriestley
Maniphest Tasks: T795
Differential Revision: https://secure.phabricator.com/D1881
Summary: See D133. Workers can also be subject to the same race, invert the
row relationship in the same way.
Test Plan: Launched repository master daemons and some taskmasters and used
the Daemon console to veify that they were able to process tasks. Manually
checked the database to make sure data got linked correctly and that new data
was inserted correctly.
Reviewers: jungejason
CC: tuomaspelkonen
Differential Revision: 135