Improve Search architecture
Summary:
The search indexing API has several problems right now:
- Always runs in-process.
- It would be nice to push this into the task queue for performance. However, the API currently passses an object all the way through (and some indexers depend on preloaded object attributes), so it can't be dumped into the task queue at any stage since we can't serialize it.
- Being able to use the task queue will also make rebuilding indexes faster.
- Instead, make the API phid-oriented.
- No uniform indexing API.
- Each "Editor" currently calls SomeCustomIndexer::indexThing(). This won't work with AbstractTransactions. The API is also just weird.
- Instead, provide a uniform API.
- No uniform CLI.
- We have `scripts/search/reindex_everything.php`, but it doesn't actually index everything. Each new document type needs to be separately added to it, leading to stuff like D3839. Third-party applications can't provide indexers.
- Instead, let indexers expose documents for indexing.
- Not application-oriented.
- All the indexers live in search/ right now, which isn't the right organization in an application-orietned view of the world.
- Instead, move indexers to applications and load them with SymbolLoader.
Test Plan:
- `bin/search index`
- Indexed one revision, one task.
- Indexed `--type TASK`, `--type DREV`, etc., for all types.
- Indexed `--all`.
- Added the word "saboteur" to a revision, task, wiki page, and question and then searched for it.
- Creating users is a pain; searched for a user after indexing.
- Creating commits is a pain; searched for a commit after indexing.
- Mocks aren't currently loadable in the result view, so their indexing is moot.
Reviewers: btrahan, vrana
Reviewed By: btrahan
CC: 20after4, aran
Maniphest Tasks: T1991, T2104
Differential Revision: https://secure.phabricator.com/D4261
2012-12-21 14:21:31 -08:00
|
|
|
<?php
|
|
|
|
|
2015-12-20 08:54:29 -08:00
|
|
|
final class PhabricatorIndexEngine extends Phobject {
|
2014-01-14 13:22:56 -08:00
|
|
|
|
2015-12-21 06:07:41 -08:00
|
|
|
private $object;
|
|
|
|
private $extensions;
|
|
|
|
private $versions;
|
|
|
|
private $parameters;
|
|
|
|
|
|
|
|
public function setParameters(array $parameters) {
|
|
|
|
$this->parameters = $parameters;
|
|
|
|
return $this;
|
|
|
|
}
|
|
|
|
|
|
|
|
public function getParameters() {
|
|
|
|
return $this->parameters;
|
|
|
|
}
|
|
|
|
|
|
|
|
public function setObject($object) {
|
|
|
|
$this->object = $object;
|
|
|
|
return $this;
|
|
|
|
}
|
|
|
|
|
|
|
|
public function getObject() {
|
|
|
|
return $this->object;
|
|
|
|
}
|
|
|
|
|
|
|
|
public function shouldIndexObject() {
|
|
|
|
$extensions = $this->newExtensions();
|
|
|
|
|
|
|
|
$parameters = $this->getParameters();
|
|
|
|
foreach ($extensions as $extension) {
|
|
|
|
$extension->setParameters($parameters);
|
|
|
|
}
|
|
|
|
|
|
|
|
$object = $this->getObject();
|
|
|
|
$versions = array();
|
|
|
|
foreach ($extensions as $key => $extension) {
|
|
|
|
$version = $extension->getIndexVersion($object);
|
|
|
|
if ($version !== null) {
|
|
|
|
$versions[$key] = (string)$version;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (idx($parameters, 'force')) {
|
|
|
|
$current_versions = array();
|
|
|
|
} else {
|
Allow index extensions to skip indexing if the object has not changed
Summary:
Fixes T9890. This allows IndexExtensions to emit an object version.
Before we build indexes, we check if the indexed version is the same as the current version. If it is, we just don't call that extension.
T9890 has a case where this is useful: a script went crazy and posted thousands of comments to a single task.
Without versioning, that results in the same comments being indexed over and over again. With versioning, most of the queue could just exit without doing any work.
Test Plan:
- Added a `sleep(1)` to the actual indexing, used `bin/search index --background` to queue up a lot of tasks, ran them with `bin/phd debug task`, saw them complete very quickly with only one actual index operation performed.
- Used `bin/search index --trace` and `bin/search index --trace --background` to observe the behavior of queries against the index version store, which looked sensible.
- Made comments/transactions, saw versions update.
- Used `bin/remove destroy`, verified index versions were purged.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9890
Differential Revision: https://secure.phabricator.com/D14845
2015-12-21 11:04:08 -08:00
|
|
|
$keys = array_keys($versions);
|
|
|
|
$current_versions = $this->loadIndexVersions($keys);
|
2015-12-21 06:07:41 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
foreach ($versions as $key => $version) {
|
|
|
|
$current_version = idx($current_versions, $key);
|
|
|
|
|
|
|
|
if ($current_version === null) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
// If nothing has changed since we built the current index, we do not
|
|
|
|
// need to rebuild the index.
|
|
|
|
if ($current_version === $version) {
|
|
|
|
unset($extensions[$key]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
$this->extensions = $extensions;
|
|
|
|
$this->versions = $versions;
|
|
|
|
|
|
|
|
// We should index the object only if there is any work to be done.
|
|
|
|
return (bool)$this->extensions;
|
|
|
|
}
|
|
|
|
|
|
|
|
public function indexObject() {
|
|
|
|
$extensions = $this->extensions;
|
|
|
|
$object = $this->getObject();
|
|
|
|
|
|
|
|
foreach ($extensions as $key => $extension) {
|
|
|
|
$extension->indexObject($this, $object);
|
|
|
|
}
|
|
|
|
|
Allow index extensions to skip indexing if the object has not changed
Summary:
Fixes T9890. This allows IndexExtensions to emit an object version.
Before we build indexes, we check if the indexed version is the same as the current version. If it is, we just don't call that extension.
T9890 has a case where this is useful: a script went crazy and posted thousands of comments to a single task.
Without versioning, that results in the same comments being indexed over and over again. With versioning, most of the queue could just exit without doing any work.
Test Plan:
- Added a `sleep(1)` to the actual indexing, used `bin/search index --background` to queue up a lot of tasks, ran them with `bin/phd debug task`, saw them complete very quickly with only one actual index operation performed.
- Used `bin/search index --trace` and `bin/search index --trace --background` to observe the behavior of queries against the index version store, which looked sensible.
- Made comments/transactions, saw versions update.
- Used `bin/remove destroy`, verified index versions were purged.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9890
Differential Revision: https://secure.phabricator.com/D14845
2015-12-21 11:04:08 -08:00
|
|
|
$this->saveIndexVersions($this->versions);
|
2015-12-21 06:07:41 -08:00
|
|
|
|
|
|
|
return $this;
|
|
|
|
}
|
|
|
|
|
|
|
|
private function newExtensions() {
|
|
|
|
$object = $this->getObject();
|
|
|
|
|
|
|
|
$extensions = PhabricatorIndexEngineExtension::getAllExtensions();
|
|
|
|
foreach ($extensions as $key => $extension) {
|
|
|
|
if (!$extension->shouldIndexObject($object)) {
|
|
|
|
unset($extensions[$key]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return $extensions;
|
|
|
|
}
|
|
|
|
|
Allow index extensions to skip indexing if the object has not changed
Summary:
Fixes T9890. This allows IndexExtensions to emit an object version.
Before we build indexes, we check if the indexed version is the same as the current version. If it is, we just don't call that extension.
T9890 has a case where this is useful: a script went crazy and posted thousands of comments to a single task.
Without versioning, that results in the same comments being indexed over and over again. With versioning, most of the queue could just exit without doing any work.
Test Plan:
- Added a `sleep(1)` to the actual indexing, used `bin/search index --background` to queue up a lot of tasks, ran them with `bin/phd debug task`, saw them complete very quickly with only one actual index operation performed.
- Used `bin/search index --trace` and `bin/search index --trace --background` to observe the behavior of queries against the index version store, which looked sensible.
- Made comments/transactions, saw versions update.
- Used `bin/remove destroy`, verified index versions were purged.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T9890
Differential Revision: https://secure.phabricator.com/D14845
2015-12-21 11:04:08 -08:00
|
|
|
private function loadIndexVersions(array $extension_keys) {
|
|
|
|
if (!$extension_keys) {
|
|
|
|
return array();
|
|
|
|
}
|
|
|
|
|
|
|
|
$object = $this->getObject();
|
|
|
|
$object_phid = $object->getPHID();
|
|
|
|
|
|
|
|
$table = new PhabricatorSearchIndexVersion();
|
|
|
|
$conn_r = $table->establishConnection('w');
|
|
|
|
|
|
|
|
$rows = queryfx_all(
|
|
|
|
$conn_r,
|
|
|
|
'SELECT * FROM %T WHERE objectPHID = %s AND extensionKey IN (%Ls)',
|
|
|
|
$table->getTableName(),
|
|
|
|
$object_phid,
|
|
|
|
$extension_keys);
|
|
|
|
|
|
|
|
return ipull($rows, 'version', 'extensionKey');
|
|
|
|
}
|
|
|
|
|
|
|
|
private function saveIndexVersions(array $versions) {
|
|
|
|
if (!$versions) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
$object = $this->getObject();
|
|
|
|
$object_phid = $object->getPHID();
|
|
|
|
|
|
|
|
$table = new PhabricatorSearchIndexVersion();
|
|
|
|
$conn_w = $table->establishConnection('w');
|
|
|
|
|
|
|
|
$sql = array();
|
|
|
|
foreach ($versions as $key => $version) {
|
|
|
|
$sql[] = qsprintf(
|
|
|
|
$conn_w,
|
|
|
|
'(%s, %s, %s)',
|
|
|
|
$object_phid,
|
|
|
|
$key,
|
|
|
|
$version);
|
|
|
|
}
|
|
|
|
|
|
|
|
queryfx(
|
|
|
|
$conn_w,
|
|
|
|
'INSERT INTO %T (objectPHID, extensionKey, version)
|
|
|
|
VALUES %Q
|
|
|
|
ON DUPLICATE KEY UPDATE version = VALUES(version)',
|
|
|
|
$table->getTableName(),
|
|
|
|
implode(', ', $sql));
|
|
|
|
}
|
|
|
|
|
Improve Search architecture
Summary:
The search indexing API has several problems right now:
- Always runs in-process.
- It would be nice to push this into the task queue for performance. However, the API currently passses an object all the way through (and some indexers depend on preloaded object attributes), so it can't be dumped into the task queue at any stage since we can't serialize it.
- Being able to use the task queue will also make rebuilding indexes faster.
- Instead, make the API phid-oriented.
- No uniform indexing API.
- Each "Editor" currently calls SomeCustomIndexer::indexThing(). This won't work with AbstractTransactions. The API is also just weird.
- Instead, provide a uniform API.
- No uniform CLI.
- We have `scripts/search/reindex_everything.php`, but it doesn't actually index everything. Each new document type needs to be separately added to it, leading to stuff like D3839. Third-party applications can't provide indexers.
- Instead, let indexers expose documents for indexing.
- Not application-oriented.
- All the indexers live in search/ right now, which isn't the right organization in an application-orietned view of the world.
- Instead, move indexers to applications and load them with SymbolLoader.
Test Plan:
- `bin/search index`
- Indexed one revision, one task.
- Indexed `--type TASK`, `--type DREV`, etc., for all types.
- Indexed `--all`.
- Added the word "saboteur" to a revision, task, wiki page, and question and then searched for it.
- Creating users is a pain; searched for a user after indexing.
- Creating commits is a pain; searched for a commit after indexing.
- Mocks aren't currently loadable in the result view, so their indexing is moot.
Reviewers: btrahan, vrana
Reviewed By: btrahan
CC: 20after4, aran
Maniphest Tasks: T1991, T2104
Differential Revision: https://secure.phabricator.com/D4261
2012-12-21 14:21:31 -08:00
|
|
|
}
|