Sunday, December 5, 2010

S3 MultiPart Upload in boto

Amazon recently introduced MultiPart Upload to S3.  This new feature lets you upload large files in multiple parts rather than in one big chunk.  This provides two main benefits:

  • You can get resumable uploads and don't have to worry about high-stakes uploading of a 5GB file which might fail after 4.9GB.  Instead, you can upload in parts and know that the all of the parts that have successfully uploaded are there patiently waiting for the rest of the bytes to make it to S3.
  • You can parallelize your upload operation.  So, not only can you break your 5GB file into 1000 5MB chunks, you can run 20 uploader processes and get much better overall throughput to S3.
It took a few weeks but we have just added full support for MultiPart Upload to the boto library.  This post gives a very quick intro to the new functionality to help get you started.

Below is a transcript from an interactive IPython session that exercises the new features.  Below that is a line by line commentary of what's going on.



  1. Self-explanatory, I hope 8^)
  2. We create a connection to the S3 service and assign it to the variable c.
  3. We lookup an existing bucket in S3 and assign that to the variable b.
  4. We initiate a MultiPart Upload to bucket b.  We pass in the key_name.  This key_name will be the name of the object in S3 once all of the parts are uploaded.  This creates a new instance of a MultiPartUpload object and assigns it to the variable mp.
  5. You might want to do a bit of exploration of the new object.  In particular, it has an attribute called id which is the upload transaction ID assigned by S3.  This transaction ID must accompany all subsequent requests related to this MultiPart Upload.
  6. I open a local file.  In this case, I had a 17MB PDF file.  I split that into 5MB chunks using the split command ("split -b5m test.pdf").  This creates 3 5MB chunks and one smaller chunk with the leftovers.  You can use larger chunk sizes if you want but 5MB is the minimum size (except for the last, of course).
  7. I upload this chunk to S3 using the upload_part_from_file method of the MultiPartUpload object.
  8. Close the filepointer
  9. Open the file for the second chunk.
  10. Upload it.
  11. Close it.
  12. Open the file for the third chunk.
  13. Upload it.
  14. Close it.
  15. Open the file for the fourth and final chunk (the small one).
  16. Upload it.
  17. Close it.
  18. I can now examine all of the parts that are currently uploaded to S3 related to this key_name.  As you can see, I can use the MultiPartUpload object as an iterator and, when so doing, the generator object handles any pagination of results from S3 automatically.  Each object in the list is an instance of the Part class and, as you can see, have attributes such as part_number, size, etag.
  19. Now that the last part has been uploaded I can complete the MultiPart Upload transaction by calling the complete_upload method of the MultiPartUpload object.  If, on the other hand, I wanted to cancel the operation I could call cancel_upload and all of the parts that had been uploaded would be deleted in S3.
This provides a simple example.  However, to really benefit fully from the MultiPart Upload functionality, you should consider trying to introduce some concurrency into the mix.  Either fire off separate threads or subprocesses to upload different parts in parallel.  The actual order the parts are uploaded doesn't matter as long as they are numbered sequentially.

Update

To find all of the current MultiPart Upload transactions for a given bucket, you can do this: