- You can get resumable uploads and don't have to worry about high-stakes uploading of a 5GB file which might fail after 4.9GB. Instead, you can upload in parts and know that the all of the parts that have successfully uploaded are there patiently waiting for the rest of the bytes to make it to S3.
- You can parallelize your upload operation. So, not only can you break your 5GB file into 1000 5MB chunks, you can run 20 uploader processes and get much better overall throughput to S3.
Below is a transcript from an interactive IPython session that exercises the new features. Below that is a line by line commentary of what's going on.
- Self-explanatory, I hope 8^)
- We create a connection to the S3 service and assign it to the variable
- We lookup an existing bucket in S3 and assign that to the variable
- We initiate a MultiPart Upload to bucket
b. We pass in the
key_namewill be the name of the object in S3 once all of the parts are uploaded. This creates a new instance of a MultiPartUpload object and assigns it to the variable
- You might want to do a bit of exploration of the new object. In particular, it has an attribute called
idwhich is the upload transaction ID assigned by S3. This transaction ID must accompany all subsequent requests related to this MultiPart Upload.
- I open a local file. In this case, I had a 17MB PDF file. I split that into 5MB chunks using the split command ("
split -b5m test.pdf"). This creates 3 5MB chunks and one smaller chunk with the leftovers. You can use larger chunk sizes if you want but 5MB is the minimum size (except for the last, of course).
- I upload this chunk to S3 using the
upload_part_from_filemethod of the MultiPartUpload object.
- Close the filepointer
- Open the file for the second chunk.
- Upload it.
- Close it.
- Open the file for the third chunk.
- Upload it.
- Close it.
- Open the file for the fourth and final chunk (the small one).
- Upload it.
- Close it.
- I can now examine all of the parts that are currently uploaded to S3 related to this key_name. As you can see, I can use the MultiPartUpload object as an iterator and, when so doing, the generator object handles any pagination of results from S3 automatically. Each object in the list is an instance of the Part class and, as you can see, have attributes such as
part_number, size, etag.
- Now that the last part has been uploaded I can complete the MultiPart Upload transaction by calling the
complete_upload methodof the MultiPartUpload object. If, on the other hand, I wanted to cancel the operation I could call
cancel_uploadand all of the parts that had been uploaded would be deleted in S3.
I believe I have found, and perhaps fixed, a bug related to multipart upload of Unicode (utf8) key names.ReplyDelete
b = _conn.get_bucket("mybucket")
k = u'/\u00f1'.encode('utf8')
mp = b.initiate_multipart_upload(k)
mp = mp.upload_part_from_file(s,0)
mp = mp.complete_upload()
A fix that seems to work for me is to add a test for the endElement method of class MultiPartUpload:
elif name == 'Key':
if type(value) == unicode:
value = value.encode('utf8')
Could you explain a bit more about the problem you were having? I'm not sure I understand the fix you have provided. Also, it may be better to create an issue on the project page to better track this.ReplyDelete
Sorry it wasn't clear. I posted issue #66 to the project page with a very short sample program to https://github.com/boto/boto/issues/#issue/66ReplyDelete
This is announcement is great news, I just hit the need for this feature in boto. Now the hard part, is this stable enough for production? Do you have any roadmap plan?ReplyDelete
I don't really have a roadmap plan. I've been trying to figure out the best way to handle that. It's kind of complicated trying to define the current "stability" of boto when it is comprised of so many different modules at different levels of maturity.ReplyDelete
The Multipart Upload is a recent addition so I'm sure it's less mature but it really is an incremental addition to existing code. From my POV, the best way to mature software is for it to be used in production but I can also understand your reluctance to use it in production before it is 100% stable. All I can offer is my commitment to respond to any issues as quickly as possible.
Thanks again for the quick reply. Yes I am a little concerned but our application has a limited set of needs. Some sqs and a few s3 read/write calls. Having another tag would make it easier to use, at least something to report issues against. Is there a 2.0b4 in the near future?ReplyDelete
Yes, there will be a 2.0b4 release soon. Can't say exactly when but probably within a week. I know I've said that before but this time I really mean it. Really!ReplyDelete
Thank you for this post. This is exactly what I need. I'm having some trouble getting the multipart upload working properly though.ReplyDelete
When the files are incorrectly numbered (something I discovered by accident when copying and pasting), and they all have the same number, I get a file uploaded to my bucket that looks like you might expect (content is that of the last file). But when correctly numbered, nothing shows up in my bucket at all.
- - -
conn = S3Connection(AWS_ID, AWS_KEY)
bucket = conn.lookup('mybucket')
mpart = bucket.initiate_multipart_upload('testtmp')
fp = open('w_aa', 'r')
fp = open('w_ab', 'r')
fp = open('w_ac', 'r')
fp = open('w_ad', 'r')
- - -
Thankful for any help or pointers.
How large are the parts that you are uploading? What happens if you try to list the parts prior to completing the upload?ReplyDelete
Aah, my mistake. My part files were just 1MB. I was supposed to split the file in 10MB parts. Now it works.ReplyDelete
There was a bug in boto, too. It was not correctly handling error responses that were sent in the body of 200 responses from the server so it was failing silently on these. That was corrected a few weeks ago in the github master repo.ReplyDelete
Thanks for implementing that feature! I wrote FileChunkIO that you can avoid splitting the file with. And maybe my code example about parallel S3 multipart uploads using boto is of interest.ReplyDelete
Yes, I say your article about parallel uploads and tweeted about it. Very cool. I'll check out your FileChunkIO, as well.ReplyDelete
Any reason you're using the deprecated lookup method?ReplyDelete
This multipart copy is an issue when copy from S3 bucket to S3 bucket as opposed to uploads from local source to S3. Any ideas on a similar function as FileChunkIO that would take an S3 key without having to download and upload?