Summary:
Ref T7352. Make sure modern `phd stop` can still read the old PID file format and stop the daemons, at least for now.
Without this, `stop` still detects them and tells you to `stop --force`, which works, but this makes things a good deal cleaner.
Test Plan: Ran `phd stop` from master, then `phd stop` from this revision. Saw old daemons stop cleanly.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T7352
Differential Revision: https://secure.phabricator.com/D11873
Summary:
Fixes T7352. This reduces the memory footprint for instances by combining these two similar daemons into one daemon which handles the responsibilities of both.
The fit isn't 100% perfect here but it's pretty close, and the GC daemon is fairly trivial.
Test Plan:
- Adjusted all the numbers to small numbers (5 second sleep, 120 second GC length).
- Added a ton of logging.
- Started trigger daemon.
- Saw it run a GC cycle.
- Saw it reschedule another cycle after 120 seconds (adjusted down from 4 hours).
- Reverted all the logging/small numbers.
- Ran `bin/phd start`, saw stable trigger daemon running.
- Grepped for removed daemon class name.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T7352
Differential Revision: https://secure.phabricator.com/D11872
Summary: Ref T7352. This is pretty straightforward. I renamed `phd.start-taskmasters` to `phd.taskmasters` for clarity.
Test Plan:
- Ran `phd start`, `phd start --autoscale-reserve 0.25`, `phd restart --autoscale-reserve 0.25`, etc.
- Examined PID file to see options were passed.
- I'm defaulting this off (0 reserve) and making it a flag rather than an option because it's a very advanced feature which is probably not useful outside of instancing.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T7352
Differential Revision: https://secure.phabricator.com/D11871
Summary:
Ref T7352. We were previously identifying things by `<daemonClass, overseerPID, startTime>` but that's not unique in a world where one overseer can run multiple daemons.
We already have an internal "daemonID", it just doesn't get written into the DB right now.
Start writing it, then use it to clean up `phd status`.
Test Plan: Ran `phd status`, got more accurate/useful output than previously.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T7352
Differential Revision: https://secure.phabricator.com/D11865
Summary: Ref T7352. This makes `phd stop` and `phd status` produce more reasonable output with the new PID file format.
Test Plan: Ran `phd stop`, `phd status`, etc.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T7352
Differential Revision: https://secure.phabricator.com/D11856
Summary:
Ref T7352. This changes `phd` to pass configuration to overseers over stdin. We still run one overseer per daemon.
The "status" stuff needs some cleanup, but it's mostly just UI/cosmetic.
Test Plan:
- Ran `phd debug`, `phd launch`, `phd start`, `phd status`, `phd stop`, etc.
- Verified PID files write in a reasonable format.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T7352
Differential Revision: https://secure.phabricator.com/D11855
Summary:
Right now, taskmasters on empty queues sleep for 30 seconds. With a default setup (4 taskmasters), this averages out to 7.5 seconds between the time you do anything that queues something and the time that the taskmasters start work on it.
On instances, which currently launch a smaller number of taskmasters, this wait is even longer.
Instead, sleep for the number of seconds that there are taskmasters, with a random offset. This makes the average wait to start a task from an empty queue 1 second, and the average maximum load of an empty queue also one query per second.
On loaded instances this doesn't matter, but this should dramatically improve behavior for less-loaded instances without any real tradeoffs.
Test Plan: Started several taskmasters, saw them jitter out of sync and then use short sleeps to give an empty queue about a 1s delay.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Differential Revision: https://secure.phabricator.com/D11772
Summary:
Ref T7152. Ref T3554.
- When an administrator clicks "send invites", queue tasks to send the invites.
- Then, actually send the invites.
- Make the links in the invites work properly.
- Also provide `bin/worker execute` to make debugging one-off workers like this easier.
- Clean up some UI, too.
Test Plan:
We now get as far as the exception which is a placeholder for a registration workflow.
{F291213}
{F291214}
{F291215}
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T3554, T7152
Differential Revision: https://secure.phabricator.com/D11736
Summary: Fixes T7111. Also nails out the TODO to show the username
Test Plan: fired up the ole phabot, chatted "P123" and saw "P123: https://secure.phabricator.com/P123 - Masterwork From Distant Lands by epriestley"...!
Reviewers: epriestley
Reviewed By: epriestley
Subscribers: Korvin, epriestley
Maniphest Tasks: T7111
Differential Revision: https://secure.phabricator.com/D11664
Summary:
Ref T6881. This adds the worker, and a script to make it easier to test. It doesn't actually invoice anything.
I'm intentionally allowing the script to double-bill since it makes testing way easier (by letting you bill the same period over and over again), and provides a tool for recovery if billing screws up.
(This diff isn't very interesting, just trying to avoid a 5K-line diff at the end.)
Test Plan: Used `bin/phortune invoice ...` to get the worker to print out some date ranges which it would theoretically invoice.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6881
Differential Revision: https://secure.phabricator.com/D11577
Summary: I got these wrong and the test didn't trigger for some reason that I haven't looked into.
Test Plan: `arc unit --everything`
Reviewers: hach-que, btrahan
Reviewed By: btrahan
Subscribers: epriestley
Differential Revision: https://secure.phabricator.com/D11453
Summary:
Ref T6881. I tried to cheat here by not implementing this, but we need it for destroying triggers directly with `bin/remove destroy`, since that needs to load them by PHID.
So, cheat slightly less. Implement PolicyAware but not CursorPagedPolicyAware.
Test Plan:
- Used `bin/remove destroy` to destroy a trigger by PHID.
- Browsed daemon console.
- Ran trigger daemon.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6881
Differential Revision: https://secure.phabricator.com/D11445
Summary:
Ref T6881. This makes it easier to fire a trigger and make sure it works properly. You can use the `--now` flag to travel through time, and test scheduling conditions with `--last` and `--next`. It will tell you when the trigger would reschedule.
Better than waiting 24 hours to see if things work.
Test Plan: Fired some backups, got useful output which made me think my code probably works correctly.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6881
Differential Revision: https://secure.phabricator.com/D11438
Summary: Ref T6881. This is useful to show a "Next backup: 2:30 AM" sort of thing without requring callers to know how triggers work internally.
Test Plan: Showed that kind of thing in Instances.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6881
Differential Revision: https://secure.phabricator.com/D11437
Summary:
Ref T6881. By design, the EXECUTION order only selects tasks which have been scheduled (since it performs a JOIN). This is inconsistent with other queries and problematic for withID/withPHID queries which may want to select an unscheduled task.
Switch to standard ID ordering by default.
Test Plan:
- Instances console now finds unscheduled triggers.
- Verified that all existing queries specify an explicit order.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6881
Differential Revision: https://secure.phabricator.com/D11436
Summary: Ref T6881. When stuff with triggers is destroyed, it should destroy the triggers.
Test Plan: Will test in Instances.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6881
Differential Revision: https://secure.phabricator.com/D11435
Summary: Ref T6881. Add a standard "just queue a task" trigger action; I expect almost all application code to use this.
Test Plan: Will test in Instances.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6881
Differential Revision: https://secure.phabricator.com/D11429
Summary: Ref T6881. I just want to show trigger info in the instance management console.
Test Plan: Will test in Instances.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6881
Differential Revision: https://secure.phabricator.com/D11428
Summary: Ref T6881. Before implementing subscriptions, I'm going to vet triggers by using them to do backups. Each instance will get a daily trigger for backups, and that should give us a smaller-scale test to catch issues and limitations, with more opportunities for something to go wrong since it fires more often.
Test Plan: Added unit tests.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6881
Differential Revision: https://secure.phabricator.com/D11427
Summary:
Ref T6881. Hopefully, this is the hard part.
This adds a new daemon (the "trigger" daemon) which processes triggers, schedules them, and then executes them at the scheduled time. The design is a little complicated, but has these goals:
- High resistance to race conditions: only the application writes to the trigger table; only the daemon writes to the event table. We won't lose events if someone saves a meeting at the same time as we're sending a reminder out for it.
- Execution guarantees: scheduled events are guaranteed to execute exactly once.
- Support for arbitrarily large queues: the daemon will make progress even if there are millions of triggers in queue. The cost to update the queue is proportional to the number of changes in it; the cost to process the queue is proportional to the number of events to execute.
- Relatively good observability: you can monitor the state of the trigger queue reasonably well from the web UI.
- Modular Infrastructure: this is a very low-level construct that Calendar, Phortune, etc., should be able to build on top of.
It doesn't have this stuff yet:
- Not very robust to bad actions: a misbehaving trigger can stop the queue fairly easily. This is OK for now since we aren't planning to make it part of any other applications for a while. We do still get execute-exaclty-once, but it might not happen for a long time (until someone goes and fixes the queue), when we could theoretically continue executing other events.
- Doesn't start automatically: normal users don't need to run this thing yet so I'm not starting it by default.
- Not super well tested: I've vetted the basics but haven't run real workloads through this yet.
- No sophisticated tooling: I added some basic stuff but it's missing some pieces we'll have to build sooner or later, e.g. `bin/trigger cancel` or whatever.
- Intentionally not realtime: This design puts execution guarantees far above realtime concerns, and will not give you precise event execution at 1-second resolution. I think this is the correct goal to pursue architecturally, and certainly correct for subscriptions and meeting reminders. Events which execute after they have become irrelevant can simply decline to do anything (like a meeting reminder which executes after the meeting is over).
In general, the expectation for applications is:
- When creating an object (like a calendar event) that needs to trigger a scheduled action, write a trigger (and save the PHID if you plan to update it later).
- The daemon will process the event and schedule the action efficiently, in a race-free way.
- If you want to move the action, update the trigger and the daemon will take care of it.
- Your action will eventually dump a task into the task queue, and the task daemons will actually perform it.
Test Plan:
Using a test script like this:
```
<?php
require_once 'scripts/__init_script__.php';
$trigger = id(new PhabricatorWorkerTrigger())
->setAction(
new PhabricatorLogTriggerAction(
array(
'message' => 'test',
)))
->setClock(
new PhabricatorMetronomicTriggerClock(
array(
'period' => 33,
)))
->save();
var_dump($trigger);
```
...I queued triggers and ran the daemon:
- Verified triggers fire;
- verified triggers reschedule;
- verified trigger events show up in the web UI;
- tried different periods;
- added some triggers while the daemon was running;
- examined `phd debug` output for anything suspicious.
It seems to work in trivial use case, at least.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6881
Differential Revision: https://secure.phabricator.com/D11419
Summary:
Ref T6881. This will probably make more sense in a couple of diffs, but this is a class that implements scheduling/recurrence rules. Two rules are provided:
- Trigger an event at a specific time (e.g., a meeting reminder notification).
- Trigger an event on the Nth day of every month (e.g., a subscription bill).
At some point, we'll presumably add a rule for T2896 (maybe using the "RRULE" spec) so you can do stuff like "the second to last thursday of every month", etc., but we don't need that for now.
(The "Nth day of every month, or move it back if no such day exists" rule doesn't seem to be expressible with the "RRULE" format, so implementing that wouldn't give us a superset of this. I think this rule is correct and desirable for this purpose, though.)
Test Plan: Added and executed unit tests.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6881
Differential Revision: https://secure.phabricator.com/D11403
Summary:
This is unusual, but if `getWorkerInstance()` throws we end up with an undefined `$worker` when recovering from the exception.
Instead, handle this case slightly more gracefully.
The easiest way to hit this is to schedule a task for a worker that doesn't exist (or remove an existing worker, which is what I did to hit it).
Test Plan: Saw a more graceful error recovery; ran some normal successful tasks out of the queue.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Differential Revision: https://secure.phabricator.com/D11413
Summary: Ref T6822.
Test Plan: `grep`. This method is only called from within `PhutilArgumentWorkflow::__construct`.
Reviewers: epriestley, #blessed_reviewers
Reviewed By: epriestley, #blessed_reviewers
Subscribers: Korvin, epriestley
Maniphest Tasks: T6822
Differential Revision: https://secure.phabricator.com/D11415
Summary: Ref T6822. This method is only called from `PhutilDaemon::execute()` and can be made `protected`.
Test Plan: See D11404.
Reviewers: #blessed_reviewers, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: Korvin, epriestley
Maniphest Tasks: T6822
Differential Revision: https://secure.phabricator.com/D11405
Summary: Fixes T6842. Allow the daemons to always be terminated, even if MySQL is down. I was hoping to be able to optionally enable this behavior with the `--force` flag, but this seems messy.
Test Plan:
```lang=bash
> ./bin/phd start
Freeing active task leases...
Freed 1 task lease(s).
Preparing to launch daemons.
NOTE: Logs will appear in '/var/tmp/phd/log/daemons.log'.
Starting daemons as phd
Launching daemon "PhabricatorRepositoryPullLocalDaemon".
Starting daemons as phd
Launching daemon "PhabricatorGarbageCollectorDaemon".
Starting daemons as phd
Launching daemon "PhabricatorTaskmasterDaemon".
Done.
> service mysql stop
mysql stop/waiting
> ./bin/phd stop
Interrupting daemon 'PhabricatorRepositoryPullLocalDaemon' (4263)...
Interrupting daemon 'PhabricatorGarbageCollectorDaemon' (4271)...
Interrupting daemon 'PhabricatorTaskmasterDaemon' (4287)...
Daemon 4263 exited.
Daemon 4271 exited.
Daemon 4287 exited.
```
Reviewers: #blessed_reviewers, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: Korvin, epriestley
Maniphest Tasks: T6842
Differential Revision: https://secure.phabricator.com/D11385
Summary:
Ref T2783. Currently, the repository edit page does some checks agaisnt the local system to look for binaries and files on disk. These checks don't make sense in a cluster environment.
Ideally, we could make a Conduit call to the host (e.g., add something like `diffusion.querysetupstatus`) to do these checks, but since they're pretty basic config things and cluster installs are advanced, it doesn't seem super worthwhile for now.
Test Plan: Saw fewer checks in a cluster repo.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T2783
Differential Revision: https://secure.phabricator.com/D11102
Summary: Ref T5402.
Test Plan:
- Queried archived tasks.
- Grepped for use sites and verified no other callsites are order-sensitive.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T5402
Differential Revision: https://secure.phabricator.com/D11089
Summary: Ref T5402. This more or less "fixes" it but there's probably some polish to do?
Test Plan:
stopped and started daemons. error logs look good.
ran bin/storage upgrade. noted that `adjust` added the appropriate indices for active and archive task.
Reviewers: epriestley
Reviewed By: epriestley
Subscribers: Korvin, epriestley
Maniphest Tasks: T5402
Differential Revision: https://secure.phabricator.com/D11044
Summary: Ref T5402. This cleans up some code and sets us up to use this sort of data more easily later.
Test Plan: viewed the daemon console from the web and the log of a specific archived daemon. both looked good. for other callsites looked really, really carefully.
Reviewers: epriestley
Reviewed By: epriestley
Subscribers: Korvin, epriestley
Maniphest Tasks: T5402
Differential Revision: https://secure.phabricator.com/D11042
Summary:
Ref T6742. Root cause of the issue:
- Daemon was running on a machine with a very long host name, which produced a lease name which was longer than 64 characters.
- MySQL wasn't set in STRICT_ALL_TABLES.
- The daemon would `UPDATE .. SET leaseOwner = <very long string>` to lock a task, and MySQL would silently truncate.
- The daemon would then try to select the locked task, but fail, because there's no matching lease owner.
To resolve this, use only the first 32 characters of the hostname. See IRC for more discussion.
Test Plan: Will confirm with reporter.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T6742
Differential Revision: https://secure.phabricator.com/D10998
Summary:
Fixes T6702. Ref T3554. Currently, tasks can be cancelled, retried and freed from the web UI by any logged in user.
This isn't appreciably dangerous (I can't come up with a way that a user could do anything security-affecting), but I think I probably intended this to be admin-only, but these actions should move to the CLI anyway.
Move them to the CLI. Lay some groundwork for some future `bin/worker cancel --class SomeTaskClass`, but don't implement that yet.
Test Plan: Used `cancel`, `retry` and `free` from the CLI. Hit all the error/success states.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T3554, T6702
Differential Revision: https://secure.phabricator.com/D10939
Summary:
Ref T6615. Mixing ASC and DESC ordering on a multipart key makes it dramatically less effective (or perhaps totally ineffective).
Reverse the meaning of the `priority` column so it goes in the same direction as the `id` column (both ascending, lower values execute sooner).
Test Plan:
- Queued 1.2M tasks with `bin/worker flood`.
- Processed ~1 task/second with `bin/phd debug taskmaster` before patch.
- Applied patch, took ~5 seconds for ~1.2M rows.
- Processed ~100-200 tasks/second with `bin/phd debug taskmaster` after patch.
- "Next in Queue" query on daemon page dropped from 1.5s to <1ms.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: aklapper, 20after4, epriestley
Maniphest Tasks: T6615
Differential Revision: https://secure.phabricator.com/D10895
Summary: Ref T6615. Ref T3554. We need better tooling around the queue eventually, so start here.
Test Plan: Added 100K+ tasks locally with `bin/worker flood`. Executed some of them with `bin/phd debug taskmaster` (we already have a TestWorker, used in unit tests).
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T3554, T6615
Differential Revision: https://secure.phabricator.com/D10894
Summary:
Fixes an issue with T5336 / D9871. We did 99% of the work here but didn't actually turn on the priority sorting. The unit test passed by default, which didn't catch this.
- Fix the unit test (it failed).
- Fix the query (test now passes).
- Add a "Next in Queue" element to the UI to make this kind of thing easier to spot/understand.
Test Plan: Ran unit test. Viewed "Next in Queue". Queued some tasks, flushed the queue. Web UI tracked the state sensibly.
Reviewers: joshuaspence, btrahan
Reviewed By: btrahan
Subscribers: cburroughs, epriestley
Differential Revision: https://secure.phabricator.com/D10766
Summary:
Ref T1191. Now that the whole database is covered, we don't need to do as much work to build expected schemata. Doing them database-by-database was helpful in converting, but is just reudndant work now.
Instead of requiring every application to build its Lisk objects, just build all Lisk objects.
I removed `harbormaster.lisk_counter` because it is unused.
It would be nice to autogenerate edge schemata, too, but that's a little trickier.
Test Plan: Database setup issues are all green.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley, hach-que
Maniphest Tasks: T1191
Differential Revision: https://secure.phabricator.com/D10620
Summary:
Ref T1191. When changing the column type of an AUTO_INCREMENT column, we currently may lose the autoincrement attribute.
Instead, support it. This is a bit messy because AUTO_INCREMENT columns interact with PRIMARY KEY columns (tables may only have one AUTO_INCREMENT column, and it must be a primary key). We need to migrate in more phases to avoid this issue.
Introduce new `auto` and `auto64` types to represent autoincrement IDs.
Test Plan:
- Saw autoincrement show up correctly in web UI.
- Fixed an autoincrement issue on the XHProf storage table with `bin/storage adjust` safely.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T1191
Differential Revision: https://secure.phabricator.com/D10607
Summary:
Ref T1191. Ref T6203. While generating expected schemata, I ran into these columns which seem to have sketchy nullability.
- Mark most of them for later resolution (T6203). They work fine today and don't need to block T1191. Changing them can break the application, so we can't autofix them.
- Forgive a couple of them that are sort-of reasonable or going to get wiped out.
Test Plan: Saw 94 remaining warnings.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: hach-que, epriestley
Maniphest Tasks: T1191, T6203
Differential Revision: https://secure.phabricator.com/D10593
Summary:
Ref T1191. Some notes here:
- Drops the old LDAP and OAuth info tables. These were migrated to the ExternalAccount table a very long time ago.
- Separates surplus/missing keys from other types of surplus/missing things. In the long run, my plan is to have only two notice levels:
- Error: something we can't fix (missing database, table, or column; overlong key).
- Warning: something we can fix (surplus anything, missing key, bad column type, bad key columns, bad uniqueness, bad collation or charset).
- For now, retaining three levels is helpful in generating all the expected scheamta.
Test Plan:
- Saw ~200 issues resolve, leaving ~1,300.
- Grepped for removed tables.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T1191
Differential Revision: https://secure.phabricator.com/D10580
Summary: T1191. Nothing very notable here.
Test Plan: Saw more blue in web UI.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Differential Revision: https://secure.phabricator.com/D10522
Summary: Fixes T5884. Macro images are no longer public on most installs. We could generate tokens for them, but this (using Conduit to pull the file data) is easier and more correct.
Test Plan: Logged a bot into IRC and had it spam part of a macro before being killed for flooding.
Reviewers: btrahan
Reviewed By: btrahan
Subscribers: epriestley
Maniphest Tasks: T5884
Differential Revision: https://secure.phabricator.com/D10274