1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2024-12-27 16:00:59 +01:00
Commit graph

12 commits

Author SHA1 Message Date
epriestley
4adf55919c Port Diviner Core to Phabricator
Summary:
This implements most/all of the difficult parts of Diviner on top of Phabricator instead of as standalone components. See T988. In particular, here are the things I want to fix:

**Performance** The Diviner parser works in two stages. The first stage breaks source files into "Atoms". The second stage renders atoms into a display format (e.g., HTML). Diviner currently has a good caching story on the first step of the pipeline, but zero caching in the second step. This means it's very slow, even for a fairly small project like Phabricator. We must re-render every piece of documentation every time, instead of only changed documentation. Most of this diff concerns itself with addressing this problem. There's a fairly large explanatory comment about it, but the trickiest part is that when an atom changes, other atoms (defined in other places) may also change -- for example, if `class B extends A`, editing A should dirty B, even if B is in an entirely different file. We perform analysis in two stages to propagate these changes: first detecting direct changes, then detecting indirect changes. This isn't completely implemented -- we need to propagate 'extends' through more levels -- but I believe it's structurally correct and good enough until we actually document classes.

**Inheritance** Diviner currently has a very weak story on inheritance. I want to inherit a lot more metas/docs. If an interface documents a method, we should just pull that documentation in to every implementation by default (implementations can still override it if they want). It can be shown in grey or something, but it should be desirable and correct to omit documentation of a method implementation when you are implementing a parent. Similarly, I want to pull in inherited methods and @tasks and such. This diff sets up for that, by formalizing "extends" relationships between atoms.

**Overspecialization** Diviner currently specializes atoms (FileAtom, FunctionAtom, ClassAtom, etc.). This is pretty much not useful, because Atomizers (which produce the atoms) need to be highly specialized, and Renderers/Publishers (which consume the atoms) also need to be highly specialized. Nothing interesting actually lives in the atom specializations, and we don't benefit from having them -- it just costs us generality in storage/caches for them. In the new code, I've used a single Atom class to represent any type of atom.

**URIs** We have fairly hideous URIs right now, which are very cumbersome  For in-app doc links, I want to provide nice URIs ("/h/notfications" or similar) which are stable redirects, and probably add remarkup for it: !{notifications} or similar. This diff isn't related to that since it's too premature.

**Search** Once we have a database generation target, we can index the documentation.

**Design** Chad has some nice mocks.

Test Plan: Ran `bin/diviner generate`, `bin/diviner generate --clean`. Saw appropriate graph propagation after edits. This diff doesn't do anything very useful yet.

Reviewers: btrahan, vrana

Reviewed By: btrahan

CC: aran

Maniphest Tasks: T988

Differential Revision: https://secure.phabricator.com/D4340
2013-01-07 14:04:23 -08:00
epriestley
ba489f9d85 Add a local configuration source and a non-environmental ENV config source
Summary:
See discussion in T2221. Before we can move configuration to the database, we have a bootstrapping problem: we need database credentials to live //somewhere// if we can't guess them (and we can only really guess localhost / root / no password).

Some options for this are:

  - Have them live in ENV variables.
    - These are often somewhat unfamiliar to users.
    - Scripts would become a huge pain -- you'd have to dump a bunch of stuff into ENV.
    - Some environments have limited ability to set ENV vars.
    - SSH is also a pain.
  - Have them live in a normal config file.
    - This probably isn't really too awful, but:
    - Since we deploy/upgrade with git, we can't currently let them edit a file which already exists, or their working copy will become dirty.
    - So they have to copy or create a file, then edit it.
    - The biggest issue I have with this is that it will be difficult to give specific, easily-followed directions from Setup. The instructions need to be like "Copy template.conf.php to real.conf.php, then edit these keys: x, y, z". This isn't as easy to follow as "run script Y".
  - Have them live in an abnormal config file with script access (this diff).
    - I think this is a little better than a normal config file, because we can tell users 'run phabricator/bin/config set mysql.user phabricator' and such, which is easier to follow than editing a config file.

I think this is only a marginal improvement over a normal config file and am open to arguments against this approach, but I think it will be a little easier for users to deal with than a normal config file. In most cases they should only need to store three values in this file -- db user/host/pass -- since once we have those we can bootstrap everything else. Normal config files also aren't going away for more advanced users, we're just offering a simple alternative for most users.

This also adds an ENVIRONMENT file as an alternative to PHABRICATOR_ENV. This is just a simple way to specify the environment if you don't have convenient access to env vars.

Test Plan: Ran `config set x y`, verified writes. Wrote to ENVIRONMENT, ran `PHABRICATOR_ENV= ./bin/repository`.

Reviewers: btrahan, vrana, codeblock

Reviewed By: codeblock

CC: aran

Maniphest Tasks: T2221

Differential Revision: https://secure.phabricator.com/D4294
2012-12-30 06:16:15 -08:00
epriestley
f6b1964740 Improve Search architecture
Summary:
The search indexing API has several problems right now:

  - Always runs in-process.
    - It would be nice to push this into the task queue for performance. However, the API currently passses an object all the way through (and some indexers depend on preloaded object attributes), so it can't be dumped into the task queue at any stage since we can't serialize it.
    - Being able to use the task queue will also make rebuilding indexes faster.
    - Instead, make the API phid-oriented.
  - No uniform indexing API.
    - Each "Editor" currently calls SomeCustomIndexer::indexThing(). This won't work with AbstractTransactions. The API is also just weird.
    - Instead, provide a uniform API.
  - No uniform CLI.
    - We have `scripts/search/reindex_everything.php`, but it doesn't actually index everything. Each new document type needs to be separately added to it, leading to stuff like D3839. Third-party applications can't provide indexers.
    - Instead, let indexers expose documents for indexing.
  - Not application-oriented.
    - All the indexers live in search/ right now, which isn't the right organization in an application-orietned view of the world.
    - Instead, move indexers to applications and load them with SymbolLoader.

Test Plan:
  - `bin/search index`
    - Indexed one revision, one task.
    - Indexed `--type TASK`, `--type DREV`, etc., for all types.
    - Indexed `--all`.
  - Added the word "saboteur" to a revision, task, wiki page, and question and then searched for it.
    - Creating users is a pain; searched for a user after indexing.
    - Creating commits is a pain; searched for a commit after indexing.
    - Mocks aren't currently loadable in the result view, so their indexing is moot.

Reviewers: btrahan, vrana

Reviewed By: btrahan

CC: 20after4, aran

Maniphest Tasks: T1991, T2104

Differential Revision: https://secure.phabricator.com/D4261
2012-12-21 14:21:31 -08:00
epriestley
e78898970a Implement SSHD glue and Conduit SSH endpoint
Summary:
  - Build "sshd-auth" (for authentication) and "sshd-exec" (for command execution) binaries. These are callable by "sshd-vcs", located [[https://github.com/epriestley/sshd-vcs | in my account on GitHub]]. They are based on precursors [[https://github.com/epriestley/sshd-vcs-glue | here on GitHub]] which I deployed for TenXer about a year ago, so I have some confidence they at least basically work.
    - The problem this solves is that normally every user would need an account on a machine to connect to it, and/or their public keys would all need to be listed in `~/.authorized_keys`. This is a big pain in most installs. Software like Gitosis/Gitolite solve this problem by giving you an easy way to add public keys to `~/.authorized_keys`, but this is pretty gross.
    - Roughly, instead of looking in `~/.authorized_keys` when a user connects, the patched sshd instead runs `echo <public key> | sshd-auth`. The `sshd-auth` script looks up the public key and authorizes the matching user, if they exist. It also forces sshd to run `sshd-exec` instead of a normal shell.
    - `sshd-exec` receives the authenticated user and any command which was passed to ssh (like `git receive-pack`) and can route them appropriately.
    - Overall, this permits a single account to be set up on a server which all Phabricator users can connect to without any extra work, and which can safely execute commands and apply appropriate permissions, and disable users when they are disabled in Phabricator and all that stuff.
  - Build out "sshd-exec" to do more thorough checks and setup, and delegate command execution to Workflows (they now exist, and did not when I originally built this stuff).
  - Convert @btrahan's conduit API script into a workflow and slightly simplify it (ConduitCall did not exist at the time it was written).

The next steps here on the Repository side are to implement Workflows for Git, SVN and HG wire protocols. These will mostly just proxy the protocols, but also need to enforce permissions. So the approach will basically be:

  - Implement workflows for stuff like `git receive-pack`.
  - These workflows will implement enough of the underlying protocol to determine what resource the user is trying to access, and whether they want to read or write it.
  - They'll then do a permissons check, and kick the user out if they don't have permission to do whatever they are trying to do.
  - If the user does have permission, we just proxy the rest of the transaction.

Next steps on the Conduit side are more simple:

  - Make ConduitClient understand "ssh://" URLs.

Test Plan: Ran `sshd-exec --phabricator-ssh-user epriestley conduit differential.query`, etc. This will get a more comprehensive test once I set up sshd-vcs.

Reviewers: btrahan, vrana

Reviewed By: btrahan

CC: aran

Maniphest Tasks: T603, T550

Differential Revision: https://secure.phabricator.com/D4229
2012-12-19 11:08:07 -08:00
epriestley
07dc943215 Modernize the drydock script
Summary: Add a bin/drydock symlink and break it into workflows. Nothing too special here.

Test Plan: Ran `bin/drydock wait-for-lease`, `bin/drydock lease`, `bin/drydock help`, etc.

Reviewers: btrahan

Reviewed By: btrahan

CC: aran

Maniphest Tasks: T2015

Differential Revision: https://secure.phabricator.com/D3867
2012-11-01 15:30:14 -07:00
epriestley
5d1bd51627 Add a script to migrate files between storage engines
Summary: Quora requested this (moving to S3) but it's also clearly a good idea.

Test Plan:
Ran with various valid/invalid options to test options. Error/sanity checking seemed OK.

Migrated individual local files.

Migrated all my local files back and forth between engines several times.

Uploaded some new files.

Reviewers: btrahan, vrana

Reviewed By: vrana

CC: aran

Maniphest Tasks: T1950

Differential Revision: https://secure.phabricator.com/D3808
2012-10-25 11:36:38 -07:00
epriestley
7c934e4176 Add a basic "fact" application
Summary:
Basic "Fact" application with some storage, part of a daemon, and a control binary.

= Goals =

The general idea is that we have various statistics we'd like to compute, like the frequency of image macros, reviewer responsiveness, task close rates, etc. Computing these on page load is expensive and messy. By building an ETL pipeline and running it in a daemon, we can precompute statistics and just pull them out of "stats" tables.

One way to do this is just to completely hard-code everything, e.g. have a daemon that runs every hour which issues a big-ass query and dumps results into a table per-fact or per fact-group. But this has a bunch of drawbacks: adding new stuff to the pipeline is a pain, various fact aggregators can't share much code, updates are slow and expensive, we can never build generic graphs on top of it, etc.

I'm hoping to build an ETL pipeline which is generic enough that we can use it for most things we're interested in without needing schema changes, and so that installs can use it also without needing schema changes, while still being specific enough that it's fast and we can build useful stuff on top of it. I'm not sure if this will actually work, but it would be cool if it does so I'm starting pretty generally and we'll see how far I get. I haven't built this exact sort of thing before so I might be way off.

I'm basing the whole thing on analyzing entire objects, not analyzing changes to objects. So each part of the pipeline is handed an object and told "analyze this", not handed a change. It pretty much deletes all the old data about that thing and then writes new data. I think this is simpler to implement and understand, and it protects us from all sorts of weird issues where we end up with some kind of garbage in the DB and have to wipe the whole thing.

= Facts =

The general idea is that we extract "facts" out of objects, and then the various view interfaces just report those facts. This change has on type of fact, a "raw fact", which is directly derived from an object. These facts are concerete and relate specifically to the object they are derived from. Some examples of such facts might be:

  D123 has 9 comments.
  D123 uses macro "psyduck" 15 times.
  D123 adds 35 lines.
  D123 has 5 files.
  D123 has 1 object.
  D123 has 1 object of type "DREV".
  D123 was created at epoch timestamp 89812351235.
  D123 was accepted by @alincoln at epoch timestamp 8397981839.

The fact storage looks like this:

  <factType, objectPHID, objectA, valueX, valueY, epoch>

Currently, we supprot one optional secondary key (like a user PHID or macro PHID), two optional integer values, and an optional timestamp. We might add more later. Each fact type can use these fields if it wants. Some facts use them, others don't. For instance, this diff adds a "N:*" fact, which is just the count of total objects in the system. These facts just look like:

  <"N:*", "PHID-xxxx-yyyy", ...>

...where all other fields are ignored. But some of the more complex facts might look like:

  <"DREV:accept", "PHID-DREV-xxxx", "PHID-USER-yyyy", ..., ..., nnnn> # User 'yyyy' accepted at epoch 'nnnn'.
  <"FILE:macro", "PHID-DREV-xxxx", "PHID-MACR-yyyy", 17, ..., ...> # Object 'xxxx' uses macro 'yyyy' 17 times.

Facts have no uniqueness constraints. For @vrana's reviewer responsiveness stuff, we can insert multiple rows for each reviewer, e.g.

  <"DREV:reviewed", "PHID-DREV-xxxx", "PHID-USER-yyyy", nnnn, ..., mmmm> # User 'yyyy' reviewed revision 'xxxx' after 'nnnn' seconds at 'mmmm'.

The second value (valueY) is mostly because we need it if we sample anything (valueX = observed value, valueY = sample rate) but there might be other uses. We might need to add "objectB" at some point too -- currently we can't represent a fact like "User X used macro Y on revision Z", so it would be impossible to compute macro use rates //for a specific user// based on this schema. I think we can start here though and see how far we get.

= Aggregated Facts =

These aren't implemented yet, but the idea is that we can then take the "raw facts" and compute derived/aggregated/rollup facts based on the raw fact table. For example, the "count" fact can be aggregated to arrive at a count of all objects in the system. This stuff will live in a separate table which does have uniqueness constraints, and come in the next diff.

We might need some kind of time series facts too, not sure about that. I think most of our use cases today are covered by raw facts + aggregated facts.

Test Plan: Ran `bin/fact` commands and verified they seemed to do reasonable things.

Reviewers: vrana, btrahan

Reviewed By: vrana

CC: aran, majak

Maniphest Tasks: T1562

Differential Revision: https://secure.phabricator.com/D3078
2012-07-27 13:34:21 -07:00
epriestley
13d96e6377 Introduce "bin/repository" for repository management
Summary:
Nothing new or exciting here yet, just moving the random scripts/repositories/ things to bin/repository. Also add `repository list`.

(Console stuff comes from D2841.)

Test Plan: Ran `repository list`, `repository pull`, `repository discover`, `repository discover --verbose`, `repository help`.

Reviewers: jungejason, vrana

Reviewed By: vrana

CC: aran

Differential Revision: https://secure.phabricator.com/D2849
2012-06-25 12:35:37 -07:00
epriestley
bad22cb303 Add a bin/aphlict wrapper to handle aphlict config / daemonization
Summary: Simple wrapper script to configure, launch, daemonize and restart the Aphlict server.

Test Plan: Ran "bin/aphlict", checked /notification/status/. Ran "bin/aphlict --foreground".

Reviewers: jungejason, vrana

Reviewed By: jungejason

CC: aran

Maniphest Tasks: T944

Differential Revision: https://secure.phabricator.com/D2784
2012-06-18 15:11:19 -07:00
epriestley
087cc0808a Make SQL patch management DAG-based and provide namespace support
Summary:
This addresses three issues with the current patch management system:

  # Two people developing at the same time often pick the same SQL patch number, and then have to go rename it. The system catches this, but it's silly.
  # Second/third-party developers can't use the same system to manage auxiliary storage they may want to add.
  # There's no way to build mock databases for unit tests that need to do reads.

To resolve these things, you can now name your patches whatever you want and conflicts are just merge conflicts, which are less of a pain to fix than filename conflicts.

Dependencies are now a DAG, with implicit dependencies created on the prior patch if no dependencies are specified. Developers can add new concrete subclasses of `PhabricatorSQLPatchList` to add storage management, and define the dependency branchpoint of their patches so they apply in the correct order (although, generally, they should not depend on the mainline patches, presumably).

The commands `storage upgrade --namespace test1234` and `storage destroy --namespace test1234` will allow unit tests to build and destroy MySQL storage.

A "quickstart" mode allows an upgrade from scratch in ~1200ms. Destruction takes about 200ms. These seem like fairily reasonable costs to actually use in tests. Building from scratch patch-by-patch takes about 6000ms.

Test Plan:
  - Created new databases from scratch with and without quickstart in a separate test namespace. Pointed the webapp at the test namespaces, browsed around, everything looked good.
  - Compared quickstart and no-quickstart dump states, they're identical except for mysqldump timestamps and a few similar things.
  - Upgraded a legacy database to the new storage format.
  - Destroyed / dumped storage.

Reviewers: edward, vrana, btrahan, jungejason

Reviewed By: btrahan

CC: aran, nh

Maniphest Tasks: T140, T345

Differential Revision: https://secure.phabricator.com/D2323
2012-04-30 07:54:00 -07:00
epriestley
477954a57e Improve CLI script for account creation and document account/reg setup process
Summary:
There was an old "create_user.php" script but it really was only useful for
creating agents. Provide a more user-friendly script for creating the first
account.

Depends on D278.

Test Plan:
Used 'accountadmin' to create and edit accounts. Read documentation.

Reviewed By: tuomaspelkonen
Reviewers: jungejason, tuomaspelkonen, aran
CC: ccheever, aran, tuomaspelkonen
Differential Revision: 279
2011-05-12 18:44:53 -07:00
epriestley
4893146815 Improve parser scalability, fix a bug or two, provide 'phd', the Phabricator
Daemon interface.
2011-03-13 14:27:03 -07:00