phorge-phorge/src/applications/files/engine/PhabricatorChunkedFileStorageEngine.php

<?php

final class PhabricatorChunkedFileStorageEngine
  extends PhabricatorFileStorageEngine {

  public function getEngineIdentifier() {
    return 'chunks';
  }

  public function getEnginePriority() {
    return 60000;
  }

  /**
   * We can write chunks if we have at least one valid storage engine
   * underneath us.
   */
  public function canWriteFiles() {
    return (bool)$this->getWritableEngine();
  }

  public function hasFilesizeLimit() {
    return false;
  }

  public function isChunkEngine() {
    return true;
  }

  public function writeFile($data, array $params) {
    // The chunk engine does not support direct writes.
    throw new PhutilMethodNotImplementedException();
  }

  public function readFile($handle) {
    // This is inefficient, but makes the API work as expected.
    $chunks = $this->loadAllChunks($handle, true);

    $buffer = '';
    foreach ($chunks as $chunk) {
      $data_file = $chunk->getDataFile();
      if (!$data_file) {
        throw new Exception(pht('This file data is incomplete!'));
      }

      $buffer .= $chunk->getDataFile()->loadFileData();
    }

    return $buffer;
  }

  public function deleteFile($handle) {
    $engine = new PhabricatorDestructionEngine();
    $chunks = $this->loadAllChunks($handle);
    foreach ($chunks as $chunk) {
      $engine->destroyObject($chunk);
    }
  }

  private function loadAllChunks($handle, $need_files) {
    $chunks = id(new PhabricatorFileChunkQuery())
      ->setViewer(PhabricatorUser::getOmnipotentUser())
      ->withChunkHandles(array($handle))
      ->needDataFiles($need_files)
      ->execute();

    $chunks = msort($chunks, 'getByteStart');

    return $chunks;
  }

  /**
   * Compute a chunked file hash for the viewer.
   *
   * We can not currently compute a real hash for chunked file uploads (because
   * no process sees all of the file data).
   *
   * We also can not trust the hash that the user claims to have computed. If
   * we trust the user, they can upload some `evil.exe` and claim it has the
   * same file hash as `good.exe`. When another user later uploads the real
   * `good.exe`, we'll just create a reference to the existing `evil.exe`. Users
   * who download `good.exe` will then receive `evil.exe`.
   *
   * Instead, we rehash the user's claimed hash with account secrets. This
   * allows users to resume file uploads, but not collide with other users.
   *
   * Ideally, we'd like to be able to verify hashes, but this is complicated
   * and time consuming and gives us a fairly small benefit.
   *
   * @param PhabricatorUser Viewing user.
   * @param string Claimed file hash.
   * @return string Rehashed file hash.
   */
  public static function getChunkedHash(PhabricatorUser $viewer, $hash) {
    if (!$viewer->getPHID()) {
      throw new Exception(
        pht('Unable to compute chunked hash without real viewer!'));
    }

    $input = $viewer->getAccountSecret().':'.$hash.':'.$viewer->getPHID();
    return self::getChunkedHashForInput($input);
  }

  public static function getChunkedHashForInput($input) {
    $rehash = PhabricatorHash::digest($input);

    // Add a suffix to identify this as a chunk hash.
    $rehash = substr($rehash, 0, -2).'-C';

    return $rehash;
  }

  public function allocateChunks($length, array $properties) {
    $file = PhabricatorFile::newChunkedFile($this, $length, $properties);

    $chunk_size = $this->getChunkSize();

    $handle = $file->getStorageHandle();

    $chunks = array();
    for ($ii = 0; $ii < $length; $ii += $chunk_size) {
      $chunks[] = PhabricatorFileChunk::initializeNewChunk(
        $handle,
        $ii,
        min($ii + $chunk_size, $length));
    }

    $file->openTransaction();
      foreach ($chunks as $chunk) {
        $chunk->save();
      }
      $file->save();
    $file->saveTransaction();

    return $file;
  }

  /**
   * Find a storage engine which is suitable for storing chunks.
   *
   * This engine must be a writable engine, have a filesize limit larger than
   * the chunk limit, and must not be a chunk engine itself.
   */
  private function getWritableEngine() {
    // NOTE: We can't just load writable engines or we'll loop forever.
    $engines = PhabricatorFileStorageEngine::loadAllEngines();

    foreach ($engines as $engine) {
      if ($engine->isChunkEngine()) {
        continue;
      }

      if ($engine->isTestEngine()) {
        continue;
      }

      if (!$engine->canWriteFiles()) {
        continue;
      }

      if ($engine->hasFilesizeLimit()) {
        if ($engine->getFilesizeLimit() < $this->getChunkSize()) {
          continue;
        }
      }

      return true;
    }

    return false;
  }

  public function getChunkSize() {
    return (4 * 1024 * 1024);
  }

  public function getFileDataIterator(PhabricatorFile $file, $begin, $end) {
    $chunks = id(new PhabricatorFileChunkQuery())
      ->setViewer(PhabricatorUser::getOmnipotentUser())
      ->withChunkHandles(array($file->getStorageHandle()))
      ->withByteRange($begin, $end)
      ->needDataFiles(true)
      ->execute();

    return new PhabricatorFileChunkIterator($chunks, $begin, $end);
  }

}
Add a chunking storage engine for files Summary: Ref T7149. This isn't complete and isn't active yet, but does basically work. I'll shore it up in the next few diffs. The new workflow goes like this: > Client, file.allocate(): I'd like to upload a file with length L, metadata M, and hash H. Then the server returns `upload` (a boolean) and `filePHID` (a PHID). These mean: \| upload \| filePHID \| means \| \|---\|---\|---\| \| false \| false \| Server can't accept file. \| false \| true \| File data already known, file created from hash. \| true \| false \| Just upload normally. \| true \| true \| Query chunks to start or resume a chunked upload. All but the last case are uninteresting and work like exising uploads with `file.uploadhash` (which we can eventually deprecate). In the last case: > Client, file.querychunks(): Give me a list of chunks that I should upload. This returns all the chunks for the file. Chunks have a start byte, an end byte, and a "complete" flag to indicate that the server already has the data. Then, the client fills in chunks by sending them: > Client, file.uploadchunk(): Here is the data for one chunk. This stuff doesn't work yet or has some caveats: - I haven't tested resume much. - Files need an "isPartial()" flag for partial uploads, and the UI needs to respect it. - The JS client needs to become chunk-aware. - Chunk size is set crazy low to make testing easier. - Some debugging flags that I'll remove soon-ish. - Downloading works, but still streams the whole file into memory. - This storage engine is disabled by default (hardcoded as a unit test engine) because it's still sketchy. - Need some code to remove the "isParital" flag when the last chunk is uploaded. - Maybe do checksumming on chunks. Test Plan: - Hacked up `arc upload` (see next diff) to be chunk-aware and uploaded a readme in 18 32-byte chunks. Then downloaded it. Got the same file back that I uploaded. - File UI now shows some basic chunk info for chunked files: {F336434} Reviewers: btrahan Reviewed By: btrahan Subscribers: joshuaspence, epriestley Maniphest Tasks: T7149 Differential Revision: https://secure.phabricator.com/D12060 2015-03-13 19:30:02 +01:00			`<?php`

			`final class PhabricatorChunkedFileStorageEngine`
			`extends PhabricatorFileStorageEngine {`

			`public function getEngineIdentifier() {`
			`return 'chunks';`
			`}`

			`public function getEnginePriority() {`
			`return 60000;`
			`}`

			`/**`
			`* We can write chunks if we have at least one valid storage engine`
			`* underneath us.`
			`*/`
			`public function canWriteFiles() {`
			`return (bool)$this->getWritableEngine();`
			`}`

			`public function hasFilesizeLimit() {`
			`return false;`
			`}`

			`public function isChunkEngine() {`
			`return true;`
			`}`

			`public function writeFile($data, array $params) {`
			`// The chunk engine does not support direct writes.`
			`throw new PhutilMethodNotImplementedException();`
			`}`

			`public function readFile($handle) {`
			`// This is inefficient, but makes the API work as expected.`
			`$chunks = $this->loadAllChunks($handle, true);`

			`$buffer = '';`
			`foreach ($chunks as $chunk) {`
			`$data_file = $chunk->getDataFile();`
			`if (!$data_file) {`
			`throw new Exception(pht('This file data is incomplete!'));`
			`}`

			`$buffer .= $chunk->getDataFile()->loadFileData();`
			`}`

			`return $buffer;`
			`}`

			`public function deleteFile($handle) {`
			`$engine = new PhabricatorDestructionEngine();`
			`$chunks = $this->loadAllChunks($handle);`
			`foreach ($chunks as $chunk) {`
			`$engine->destroyObject($chunk);`
			`}`
			`}`

			`private function loadAllChunks($handle, $need_files) {`
			`$chunks = id(new PhabricatorFileChunkQuery())`
			`->setViewer(PhabricatorUser::getOmnipotentUser())`
			`->withChunkHandles(array($handle))`
			`->needDataFiles($need_files)`
			`->execute();`

			`$chunks = msort($chunks, 'getByteStart');`

			`return $chunks;`
			`}`

			`/**`
			`* Compute a chunked file hash for the viewer.`
			`*`
			`* We can not currently compute a real hash for chunked file uploads (because`
			`* no process sees all of the file data).`
			`*`
			`* We also can not trust the hash that the user claims to have computed. If`
			* we trust the user, they can upload some `evil.exe` and claim it has the
			* same file hash as `good.exe`. When another user later uploads the real
			* `good.exe`, we'll just create a reference to the existing `evil.exe`. Users
			* who download `good.exe` will then receive `evil.exe`.
			`*`
			`* Instead, we rehash the user's claimed hash with account secrets. This`
			`* allows users to resume file uploads, but not collide with other users.`
			`*`
			`* Ideally, we'd like to be able to verify hashes, but this is complicated`
			`* and time consuming and gives us a fairly small benefit.`
			`*`
			`* @param PhabricatorUser Viewing user.`
			`* @param string Claimed file hash.`
			`* @return string Rehashed file hash.`
			`*/`
			`public static function getChunkedHash(PhabricatorUser $viewer, $hash) {`
			`if (!$viewer->getPHID()) {`
			`throw new Exception(`
			`pht('Unable to compute chunked hash without real viewer!'));`
			`}`

			`$input = $viewer->getAccountSecret().':'.$hash.':'.$viewer->getPHID();`
Support resuming JS uploads of chunked files Summary: Ref T7149. We can't compute hashes of large files efficiently, but we can resume uploads by the same author, with the same name and file size, which are only partially completed. This seems like a reasonable heuristic that is unlikely to ever misfire, even if it's a little magical. Test Plan: - Forced chunking on. - Started uploading a chunked file. - Closed the browser window. - Dropped it into a new window. - Upload resumed //(!!!)// - Did this again. - Downloaded the final file, which successfully reconstructed the original file. Reviewers: btrahan Reviewed By: btrahan Subscribers: joshuaspence, chad, epriestley Maniphest Tasks: T7149 Differential Revision: https://secure.phabricator.com/D12070 2015-03-14 16:28:46 +01:00			`return self::getChunkedHashForInput($input);`
			`}`

			`public static function getChunkedHashForInput($input) {`
			`$rehash = PhabricatorHash::digest($input);`

			`// Add a suffix to identify this as a chunk hash.`
			`$rehash = substr($rehash, 0, -2).'-C';`

			`return $rehash;`
Add a chunking storage engine for files Summary: Ref T7149. This isn't complete and isn't active yet, but does basically work. I'll shore it up in the next few diffs. The new workflow goes like this: > Client, file.allocate(): I'd like to upload a file with length L, metadata M, and hash H. Then the server returns `upload` (a boolean) and `filePHID` (a PHID). These mean: \| upload \| filePHID \| means \| \|---\|---\|---\| \| false \| false \| Server can't accept file. \| false \| true \| File data already known, file created from hash. \| true \| false \| Just upload normally. \| true \| true \| Query chunks to start or resume a chunked upload. All but the last case are uninteresting and work like exising uploads with `file.uploadhash` (which we can eventually deprecate). In the last case: > Client, file.querychunks(): Give me a list of chunks that I should upload. This returns all the chunks for the file. Chunks have a start byte, an end byte, and a "complete" flag to indicate that the server already has the data. Then, the client fills in chunks by sending them: > Client, file.uploadchunk(): Here is the data for one chunk. This stuff doesn't work yet or has some caveats: - I haven't tested resume much. - Files need an "isPartial()" flag for partial uploads, and the UI needs to respect it. - The JS client needs to become chunk-aware. - Chunk size is set crazy low to make testing easier. - Some debugging flags that I'll remove soon-ish. - Downloading works, but still streams the whole file into memory. - This storage engine is disabled by default (hardcoded as a unit test engine) because it's still sketchy. - Need some code to remove the "isParital" flag when the last chunk is uploaded. - Maybe do checksumming on chunks. Test Plan: - Hacked up `arc upload` (see next diff) to be chunk-aware and uploaded a readme in 18 32-byte chunks. Then downloaded it. Got the same file back that I uploaded. - File UI now shows some basic chunk info for chunked files: {F336434} Reviewers: btrahan Reviewed By: btrahan Subscribers: joshuaspence, epriestley Maniphest Tasks: T7149 Differential Revision: https://secure.phabricator.com/D12060 2015-03-13 19:30:02 +01:00			`}`

			`public function allocateChunks($length, array $properties) {`
			`$file = PhabricatorFile::newChunkedFile($this, $length, $properties);`

			`$chunk_size = $this->getChunkSize();`

			`$handle = $file->getStorageHandle();`

			`$chunks = array();`
			`for ($ii = 0; $ii < $length; $ii += $chunk_size) {`
			`$chunks[] = PhabricatorFileChunk::initializeNewChunk(`
			`$handle,`
			`$ii,`
			`min($ii + $chunk_size, $length));`
			`}`

			`$file->openTransaction();`
			`foreach ($chunks as $chunk) {`
			`$chunk->save();`
			`}`
			`$file->save();`
			`$file->saveTransaction();`

			`return $file;`
			`}`

Enable the chunk storage engine Summary: Ref T7149. This works now, so enable it. Test Plan: - Uploaded large and small files in Firefox, Safari and Chrome. - Uploaded large files with `arc upload`. - Stopped/resumed large files with all clients. Reviewers: btrahan Reviewed By: btrahan Subscribers: epriestley Maniphest Tasks: T7149 Differential Revision: https://secure.phabricator.com/D12079 2015-03-15 19:37:05 +01:00			`/**`
			`* Find a storage engine which is suitable for storing chunks.`
			`*`
			`* This engine must be a writable engine, have a filesize limit larger than`
			`* the chunk limit, and must not be a chunk engine itself.`
			`*/`
Add a chunking storage engine for files Summary: Ref T7149. This isn't complete and isn't active yet, but does basically work. I'll shore it up in the next few diffs. The new workflow goes like this: > Client, file.allocate(): I'd like to upload a file with length L, metadata M, and hash H. Then the server returns `upload` (a boolean) and `filePHID` (a PHID). These mean: \| upload \| filePHID \| means \| \|---\|---\|---\| \| false \| false \| Server can't accept file. \| false \| true \| File data already known, file created from hash. \| true \| false \| Just upload normally. \| true \| true \| Query chunks to start or resume a chunked upload. All but the last case are uninteresting and work like exising uploads with `file.uploadhash` (which we can eventually deprecate). In the last case: > Client, file.querychunks(): Give me a list of chunks that I should upload. This returns all the chunks for the file. Chunks have a start byte, an end byte, and a "complete" flag to indicate that the server already has the data. Then, the client fills in chunks by sending them: > Client, file.uploadchunk(): Here is the data for one chunk. This stuff doesn't work yet or has some caveats: - I haven't tested resume much. - Files need an "isPartial()" flag for partial uploads, and the UI needs to respect it. - The JS client needs to become chunk-aware. - Chunk size is set crazy low to make testing easier. - Some debugging flags that I'll remove soon-ish. - Downloading works, but still streams the whole file into memory. - This storage engine is disabled by default (hardcoded as a unit test engine) because it's still sketchy. - Need some code to remove the "isParital" flag when the last chunk is uploaded. - Maybe do checksumming on chunks. Test Plan: - Hacked up `arc upload` (see next diff) to be chunk-aware and uploaded a readme in 18 32-byte chunks. Then downloaded it. Got the same file back that I uploaded. - File UI now shows some basic chunk info for chunked files: {F336434} Reviewers: btrahan Reviewed By: btrahan Subscribers: joshuaspence, epriestley Maniphest Tasks: T7149 Differential Revision: https://secure.phabricator.com/D12060 2015-03-13 19:30:02 +01:00			`private function getWritableEngine() {`
			`// NOTE: We can't just load writable engines or we'll loop forever.`
			`$engines = PhabricatorFileStorageEngine::loadAllEngines();`

			`foreach ($engines as $engine) {`
			`if ($engine->isChunkEngine()) {`
			`continue;`
			`}`

			`if ($engine->isTestEngine()) {`
			`continue;`
			`}`

			`if (!$engine->canWriteFiles()) {`
			`continue;`
			`}`

			`if ($engine->hasFilesizeLimit()) {`
			`if ($engine->getFilesizeLimit() < $this->getChunkSize()) {`
			`continue;`
			`}`
			`}`

			`return true;`
			`}`

			`return false;`
			`}`

Support HTML5 / Javascript chunked file uploads Summary: Ref T7149. This adds chunking support to drag-and-drop uploads. It never activates right now unless you hack things up, since the chunk engine is still hard-coded as disabled. The overall approach is the same as `arc upload` in D12061, with some slight changes to the API return values to avoid a few extra HTTP calls. Test Plan: - Enabled chunk engine. - Uploaded some READMEs in a bunch of tiny 32 byte chunks. - Worked out of the box in Safari, Chrome, Firefox. Reviewers: btrahan Reviewed By: btrahan Subscribers: epriestley Maniphest Tasks: T7149 Differential Revision: https://secure.phabricator.com/D12066 2015-03-13 19:30:36 +01:00			`public function getChunkSize() {`
Enable the chunk storage engine Summary: Ref T7149. This works now, so enable it. Test Plan: - Uploaded large and small files in Firefox, Safari and Chrome. - Uploaded large files with `arc upload`. - Stopped/resumed large files with all clients. Reviewers: btrahan Reviewed By: btrahan Subscribers: epriestley Maniphest Tasks: T7149 Differential Revision: https://secure.phabricator.com/D12079 2015-03-15 19:37:05 +01:00			`return (4 * 1024 * 1024);`
Add a chunking storage engine for files Summary: Ref T7149. This isn't complete and isn't active yet, but does basically work. I'll shore it up in the next few diffs. The new workflow goes like this: > Client, file.allocate(): I'd like to upload a file with length L, metadata M, and hash H. Then the server returns `upload` (a boolean) and `filePHID` (a PHID). These mean: \| upload \| filePHID \| means \| \|---\|---\|---\| \| false \| false \| Server can't accept file. \| false \| true \| File data already known, file created from hash. \| true \| false \| Just upload normally. \| true \| true \| Query chunks to start or resume a chunked upload. All but the last case are uninteresting and work like exising uploads with `file.uploadhash` (which we can eventually deprecate). In the last case: > Client, file.querychunks(): Give me a list of chunks that I should upload. This returns all the chunks for the file. Chunks have a start byte, an end byte, and a "complete" flag to indicate that the server already has the data. Then, the client fills in chunks by sending them: > Client, file.uploadchunk(): Here is the data for one chunk. This stuff doesn't work yet or has some caveats: - I haven't tested resume much. - Files need an "isPartial()" flag for partial uploads, and the UI needs to respect it. - The JS client needs to become chunk-aware. - Chunk size is set crazy low to make testing easier. - Some debugging flags that I'll remove soon-ish. - Downloading works, but still streams the whole file into memory. - This storage engine is disabled by default (hardcoded as a unit test engine) because it's still sketchy. - Need some code to remove the "isParital" flag when the last chunk is uploaded. - Maybe do checksumming on chunks. Test Plan: - Hacked up `arc upload` (see next diff) to be chunk-aware and uploaded a readme in 18 32-byte chunks. Then downloaded it. Got the same file back that I uploaded. - File UI now shows some basic chunk info for chunked files: {F336434} Reviewers: btrahan Reviewed By: btrahan Subscribers: joshuaspence, epriestley Maniphest Tasks: T7149 Differential Revision: https://secure.phabricator.com/D12060 2015-03-13 19:30:02 +01:00			`}`

Stream chunks when sending chunked files Summary: Ref T7149. Return a real iterator from the Chunk engine, which processes chunks sequentially. Test Plan: This is a bit hard to read, but shows the underlying chunks being accessed one at a time and only some being accessed when requesting a range of a file: ``` $ ./bin/files cat F878 --trace --begin 100 --end 256 ... >>> [10] <query> SELECT * FROM `file_storageblob` WHERE `id` = 85 <<< [10] <query> 240 us better software. Phabricat>>> [11] <query> SELECT * FROM `file_storageblob` WHERE `id` = 84 <<< [11] <query> 205 us or includes applications for: >>> [12] <query> SELECT * FROM `file_storageblob` WHERE `id` = 83 <<< [12] <query> 226 us - reviewing and auditing source>>> [13] <query> SELECT * FROM `file_storageblob` WHERE `id` = 82 <<< [13] <query> 203 us code; - hosting and browsing >>> [14] <query> SELECT * FROM `file_storageblob` WHERE `id` = 81 <<< [14] <query> 231 us repositories; - tracking bugs; ``` Reviewers: btrahan Reviewed By: btrahan Subscribers: joshuaspence, epriestley Maniphest Tasks: T7149 Differential Revision: https://secure.phabricator.com/D12073 2015-03-14 16:29:30 +01:00			`public function getFileDataIterator(PhabricatorFile $file, $begin, $end) {`
			`$chunks = id(new PhabricatorFileChunkQuery())`
			`->setViewer(PhabricatorUser::getOmnipotentUser())`
			`->withChunkHandles(array($file->getStorageHandle()))`
			`->withByteRange($begin, $end)`
			`->needDataFiles(true)`
			`->execute();`

			`return new PhabricatorFileChunkIterator($chunks, $begin, $end);`
			`}`

Add a chunking storage engine for files Summary: Ref T7149. This isn't complete and isn't active yet, but does basically work. I'll shore it up in the next few diffs. The new workflow goes like this: > Client, file.allocate(): I'd like to upload a file with length L, metadata M, and hash H. Then the server returns `upload` (a boolean) and `filePHID` (a PHID). These mean: \| upload \| filePHID \| means \| \|---\|---\|---\| \| false \| false \| Server can't accept file. \| false \| true \| File data already known, file created from hash. \| true \| false \| Just upload normally. \| true \| true \| Query chunks to start or resume a chunked upload. All but the last case are uninteresting and work like exising uploads with `file.uploadhash` (which we can eventually deprecate). In the last case: > Client, file.querychunks(): Give me a list of chunks that I should upload. This returns all the chunks for the file. Chunks have a start byte, an end byte, and a "complete" flag to indicate that the server already has the data. Then, the client fills in chunks by sending them: > Client, file.uploadchunk(): Here is the data for one chunk. This stuff doesn't work yet or has some caveats: - I haven't tested resume much. - Files need an "isPartial()" flag for partial uploads, and the UI needs to respect it. - The JS client needs to become chunk-aware. - Chunk size is set crazy low to make testing easier. - Some debugging flags that I'll remove soon-ish. - Downloading works, but still streams the whole file into memory. - This storage engine is disabled by default (hardcoded as a unit test engine) because it's still sketchy. - Need some code to remove the "isParital" flag when the last chunk is uploaded. - Maybe do checksumming on chunks. Test Plan: - Hacked up `arc upload` (see next diff) to be chunk-aware and uploaded a readme in 18 32-byte chunks. Then downloaded it. Got the same file back that I uploaded. - File UI now shows some basic chunk info for chunked files: {F336434} Reviewers: btrahan Reviewed By: btrahan Subscribers: joshuaspence, epriestley Maniphest Tasks: T7149 Differential Revision: https://secure.phabricator.com/D12060 2015-03-13 19:30:02 +01:00			`}`