Sunday, December 5, 2010

S3 MultiPart Upload in boto

Amazon recently introduced MultiPart Upload to S3.  This new feature lets you upload large files in multiple parts rather than in one big chunk.  This provides two main benefits:

  • You can get resumable uploads and don't have to worry about high-stakes uploading of a 5GB file which might fail after 4.9GB.  Instead, you can upload in parts and know that the all of the parts that have successfully uploaded are there patiently waiting for the rest of the bytes to make it to S3.
  • You can parallelize your upload operation.  So, not only can you break your 5GB file into 1000 5MB chunks, you can run 20 uploader processes and get much better overall throughput to S3.
It took a few weeks but we have just added full support for MultiPart Upload to the boto library.  This post gives a very quick intro to the new functionality to help get you started.

Below is a transcript from an interactive IPython session that exercises the new features.  Below that is a line by line commentary of what's going on.



  1. Self-explanatory, I hope 8^)
  2. We create a connection to the S3 service and assign it to the variable c.
  3. We lookup an existing bucket in S3 and assign that to the variable b.
  4. We initiate a MultiPart Upload to bucket b.  We pass in the key_name.  This key_name will be the name of the object in S3 once all of the parts are uploaded.  This creates a new instance of a MultiPartUpload object and assigns it to the variable mp.
  5. You might want to do a bit of exploration of the new object.  In particular, it has an attribute called id which is the upload transaction ID assigned by S3.  This transaction ID must accompany all subsequent requests related to this MultiPart Upload.
  6. I open a local file.  In this case, I had a 17MB PDF file.  I split that into 5MB chunks using the split command ("split -b5m test.pdf").  This creates 3 5MB chunks and one smaller chunk with the leftovers.  You can use larger chunk sizes if you want but 5MB is the minimum size (except for the last, of course).
  7. I upload this chunk to S3 using the upload_part_from_file method of the MultiPartUpload object.
  8. Close the filepointer
  9. Open the file for the second chunk.
  10. Upload it.
  11. Close it.
  12. Open the file for the third chunk.
  13. Upload it.
  14. Close it.
  15. Open the file for the fourth and final chunk (the small one).
  16. Upload it.
  17. Close it.
  18. I can now examine all of the parts that are currently uploaded to S3 related to this key_name.  As you can see, I can use the MultiPartUpload object as an iterator and, when so doing, the generator object handles any pagination of results from S3 automatically.  Each object in the list is an instance of the Part class and, as you can see, have attributes such as part_number, size, etag.
  19. Now that the last part has been uploaded I can complete the MultiPart Upload transaction by calling the complete_upload method of the MultiPartUpload object.  If, on the other hand, I wanted to cancel the operation I could call cancel_upload and all of the parts that had been uploaded would be deleted in S3.
This provides a simple example.  However, to really benefit fully from the MultiPart Upload functionality, you should consider trying to introduce some concurrency into the mix.  Either fire off separate threads or subprocesses to upload different parts in parallel.  The actual order the parts are uploaded doesn't matter as long as they are numbered sequentially.

Update

To find all of the current MultiPart Upload transactions for a given bucket, you can do this:

16 comments:

  1. I believe I have found, and perhaps fixed, a bug related to multipart upload of Unicode (utf8) key names.
    Scenario:
    b = _conn.get_bucket("mybucket")
    k = u'/\u00f1'.encode('utf8')
    mp = b.initiate_multipart_upload(k)
    mp = mp.upload_part_from_file(s,0)
    mp = mp.complete_upload()

    A fix that seems to work for me is to add a test for the endElement method of class MultiPartUpload:

    elif name == 'Key':
    if type(value) == unicode:
    value = value.encode('utf8')

    ReplyDelete
  2. Could you explain a bit more about the problem you were having? I'm not sure I understand the fix you have provided. Also, it may be better to create an issue on the project page to better track this.

    ReplyDelete
  3. Sorry it wasn't clear. I posted issue #66 to the project page with a very short sample program to https://github.com/boto/boto/issues/#issue/66

    ReplyDelete
  4. This is announcement is great news, I just hit the need for this feature in boto. Now the hard part, is this stable enough for production? Do you have any roadmap plan?

    Dave

    ReplyDelete
  5. I don't really have a roadmap plan. I've been trying to figure out the best way to handle that. It's kind of complicated trying to define the current "stability" of boto when it is comprised of so many different modules at different levels of maturity.

    The Multipart Upload is a recent addition so I'm sure it's less mature but it really is an incremental addition to existing code. From my POV, the best way to mature software is for it to be used in production but I can also understand your reluctance to use it in production before it is 100% stable. All I can offer is my commitment to respond to any issues as quickly as possible.

    ReplyDelete
  6. Thanks again for the quick reply. Yes I am a little concerned but our application has a limited set of needs. Some sqs and a few s3 read/write calls. Having another tag would make it easier to use, at least something to report issues against. Is there a 2.0b4 in the near future?

    ReplyDelete
  7. Yes, there will be a 2.0b4 release soon. Can't say exactly when but probably within a week. I know I've said that before but this time I really mean it. Really!

    ReplyDelete
  8. Thank you for this post. This is exactly what I need. I'm having some trouble getting the multipart upload working properly though.

    When the files are incorrectly numbered (something I discovered by accident when copying and pasting), and they all have the same number, I get a file uploaded to my bucket that looks like you might expect (content is that of the last file). But when correctly numbered, nothing shows up in my bucket at all.

    - - -
    conn = S3Connection(AWS_ID, AWS_KEY)
    bucket = conn.lookup('mybucket')
    mpart = bucket.initiate_multipart_upload('testtmp')
    fp = open('w_aa', 'r')
    mpart.upload_part_from_file(fp, 1)
    fp.close()
    fp = open('w_ab', 'r')
    mpart.upload_part_from_file(fp, 2)
    fp.close()
    fp = open('w_ac', 'r')
    mpart.upload_part_from_file(fp, 3)
    fp.close()
    fp = open('w_ad', 'r')
    mpart.upload_part_from_file(fp, 4)
    fp.close()
    mpart.complete_upload()
    - - -

    Thankful for any help or pointers.

    ReplyDelete
  9. How large are the parts that you are uploading? What happens if you try to list the parts prior to completing the upload?

    ReplyDelete
  10. Aah, my mistake. My part files were just 1MB. I was supposed to split the file in 10MB parts. Now it works.

    Thank you.

    ReplyDelete
  11. There was a bug in boto, too. It was not correctly handling error responses that were sent in the body of 200 responses from the server so it was failing silently on these. That was corrected a few weeks ago in the github master repo.

    ReplyDelete
  12. Thanks for implementing that feature! I wrote FileChunkIO that you can avoid splitting the file with. And maybe my code example about parallel S3 multipart uploads using boto is of interest.

    ReplyDelete
  13. Yes, I say your article about parallel uploads and tweeted about it. Very cool. I'll check out your FileChunkIO, as well.

    ReplyDelete
  14. Any reason you're using the deprecated lookup method?

    ReplyDelete
  15. Fabian:

    This multipart copy is an issue when copy from S3 bucket to S3 bucket as opposed to uploads from local source to S3. Any ideas on a similar function as FileChunkIO that would take an S3 key without having to download and upload?

    ReplyDelete