1
0
Fork 0
mirror of https://we.phorge.it/source/arcanist.git synced 2024-11-22 06:42:41 +01:00

Don't compute intraline diffs if the input fails a coarse check for being huge

Summary:
Fixes T11744. Because intraline diffs are expensive to generate, we already bail out and decline to generate them for very long lines.

However, we currently split the inputs into lists of characters first, then check how long they are and make a decision to bail. For //huge// inputs (e.g., 1MB+), this is too late: just splitting them has a large CPU/RAM cost.

(These inputs are rare in normal source, but can appear in, e.g., JSON files written without newlines.)

Instead, add an extra "are the inputs really huge?" check first, and bail early if they are.

Test Plan:
  - Generated a 1MB "change a file full of Q to a file full of R" diff.
  - Before change: purged changeset cache; took about 7 seconds to load.
  - After change: purged changeset cache; took about 1 second to load.
  - Viewed some normal diffs to make sure intraline edits still displayed correctly.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11744

Differential Revision: https://secure.phabricator.com/D16683
This commit is contained in:
epriestley 2016-10-07 07:27:48 -07:00
parent 483e985d08
commit 2ad15c499a

View file

@ -56,7 +56,30 @@ final class ArcanistDiffUtils extends Phobject {
);
}
return self::computeIntralineEdits($o, $n);
// Do a fast check for certainly-too-long inputs before splitting the
// lines. Inputs take ~200x more memory to represent as lists than as
// strings, so we can run out of memory quickly if we try to split huge
// inputs. See T11744.
$ol = strlen($o);
$nl = strlen($n);
$max_glyphs = 80;
// This has some wiggle room for multi-byte UTF8 characters, and the
// fact that we're testing the sum of the lengths of both strings. It can
// still generate false positives for, say, Chinese text liberally
// slathered with combining characters, but this kind of text should be
// vitually nonexistent in real data.
$too_many_bytes = (16 * $max_glyphs);
if ($ol + $nl > $too_many_bytes) {
return array(
array(array(1, $ol)),
array(array(1, $nl)),
);
}
return self::computeIntralineEdits($o, $n, $max_glyphs);
}
public static function applyIntralineDiff($str, $intra_stack) {
@ -155,7 +178,7 @@ final class ArcanistDiffUtils extends Phobject {
->getEditString();
}
public static function computeIntralineEdits($o, $n) {
private static function computeIntralineEdits($o, $n, $max_glyphs) {
if (preg_match('/[\x80-\xFF]/', $o.$n)) {
$ov = phutil_utf8v_combined($o);
$nv = phutil_utf8v_combined($n);
@ -166,7 +189,7 @@ final class ArcanistDiffUtils extends Phobject {
$multibyte = false;
}
$result = self::generateEditString($ov, $nv);
$result = self::generateEditString($ov, $nv, $max_glyphs);
// Now we have a character-based description of the edit. We need to
// convert into a byte-based description. Walk through the edit string and