1
0
Fork 0
mirror of https://we.phorge.it/source/phorge.git synced 2025-01-11 15:21:03 +01:00

Slightly improve UTF-8 handling in Differential

Summary:
See comments. I think this will fix the issue, where we end up handling off
garbage to htmlspecialchars() after highlighting a file we've stuck full of \0
bytes.

The right fix for this is to make wordwrap and intraline-diff utf8 aware and
throw this whole thing away. I'll work on that but I think this fixes the
immediate issue.

Test Plan:
diffed the file with a UTF-8 quote in it and got a reasonable render in
Differential

Reviewed By: jungejason
Reviewers: jungejason, aran, tuomaspelkonen
CC: aran, jungejason
Differential Revision: 504
This commit is contained in:
epriestley 2011-06-23 12:01:00 -07:00
parent b093113d02
commit d031d3ae32

View file

@ -186,14 +186,6 @@ class DifferentialChangesetParser {
$this->parsedHunk = true; $this->parsedHunk = true;
$lines = $hunk->getChanges(); $lines = $hunk->getChanges();
// Flatten UTF-8 into "\0". We don't support UTF-8 because the diffing
// algorithms are byte-oriented (not character oriented) and everyone seems
// to be in agreement that it's fairly reasonable not to allow UTF-8 in
// source files. These bytes will later be replaced with a "?" glyph, but
// in the meantime we replace them with "\0" since Pygments is happy to
// deal with that.
$lines = preg_replace('/[\x80-\xFF]/', "\0", $lines);
$lines = str_replace( $lines = str_replace(
array("\t", "\r\n", "\r"), array("\t", "\r\n", "\r"),
array(' ', "\n", "\n"), array(' ', "\n", "\n"),
@ -702,9 +694,16 @@ class DifferentialChangesetParser {
protected function tokenHighlight(&$render) { protected function tokenHighlight(&$render) {
// TODO: This is really terribly horrible and should be fixed. We have two
// byte-oriented algorithms (wordwrap and intraline diff) which are not
// unicode-aware and can accept a valid UTF-8 string but emit an invalid
// one by adding markup inside the byte sequences of characters. The right
// fix here is to make them UTF-8 aware. Short of that, we can repair the
// possibly-broken UTF-8 string into a valid UTF-8 string by replacing all
// UTF-8 bytes with a Unicode Replacement Character.
foreach ($render as $key => $text) { foreach ($render as $key => $text) {
$render[$key] = str_replace( $render[$key] = preg_replace(
"\0", '/[\x80-\xFF]/',
'<span class="uu">'."\xEF\xBF\xBD".'</span>', '<span class="uu">'."\xEF\xBF\xBD".'</span>',
$text); $text);
} }