Python AWS SDK Failed to Retrieve More Than 1000 Files
Recently, I found a limitation where AWS S3 can only return up to 1000 objects in a single retrieve and this one is coming from their boto3
SDK, but this limitation does not apply to the awscli
command. Here’s the sample from my object storage bucket.
$ aws s3 ls s3://<BUCKET_NAME> --endpoint-url <ENDPOINT_URL> | wc -l # 1391
$ aws s3api list-objects --bucket <BUCKET_NAME> --endpoint-url <ENDPOINT_URL> \
| jq -r '.Contents | length' # 1391
OK, that means we need to do a workaround in Python AWS SDK. I believe there are more efficient ways to do it, but since I’m a novice in Python programming, so I thought I’d do it plainly. Here’s the snippet code on how to do it.
import boto3
c = boto3.client(service_name="s3",
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
endpoint_url=endpoint_url)
cloudflare_files = list()
def retrieve_files_req(continuation_token):
if continuation_token == "":
s3_files_response = c.list_objects_v2(Bucket=bucket_name,
MaxKeys=500,
EncodingType="url")
else:
s3_files_response = c.list_objects_v2(Bucket=bucket_name,
MaxKeys=500,
EncodingType="url",
ContinuationToken=continuation_token)
for file in s3_files_response["Contents"]:
cloudflare_files.append(file)
print('Name: {}. Size" {}. Storage Class: {}'.format(file["Key"],
file["Size"],
file["StorageClass"]))
))
try:
print("Success retrieve {} files ...".format(len(cloudflare_files)))
return s3_files_response["NextContinuationToken"]
except KeyError: return "RETRIEVE_COMPLETELY"
continuation_token = ""
retrieve_again = True
while retrieve_again:
continuation_token = retrieve_files_req(continuation_token)
if continuation_token == "RETRIEVE_COMPLETELY":
retrieve_again = False
There is a while loop function that keeps calling the retrieve_files_req
. When the first time the function is called, it’ll retrieve up to 1000 files (in default, in this case we’re retrieving 500 files in one go).
Here’s the thing.
If the number of files is more than the MaxKeys
, it’ll return NextContinuationToken
and we can use this token to send the request again until the key is no longer in the response. If the key no longer exists in the response, it means we’ve already retrieved all the files in the bucket.
To check the key, I put a try-except clause. It’ll return the token until it’s no longer in the response. If the key is no longer in the response, it’ll return RETRIEVE_COMPLETELY
and the while loop will stop.
If we run the script above, it’ll produce an output like this.
$ python3 retrieve-many-files.py
Success retrieve 500 files ...
Success retrieve 1000 files ...
Success retrieve 1391 files ...
Finished retrieving 1391 datas
That concludes my mini tutorial on how to retrieve too many files in the AWS S3 object storage (and other S3-compatible services like Cloudflare R2).