On occasion I like to grab large public data sets for experimentation and store them in S3. Think public data sets that are often many gigabytes in size. Because of the design of S3, I can’t tell the AWS console or CLI to “go download the file in this URL” natively. It just doesn’t work that way.
While I could always download something via web browser and then upload it to S3 using the AWS console, that’s cumbersome and slow. A faster and easier way is to stream a copy of the data. At a shell prompt and using the AWS S3 CLI, I can issue a single command using wget and aws s3 cp to write the data directly to S3 as I’m receiving it via wget:
wget -O- https://PUBLICDATA.SITE/VeryLargeDataset.json | aws s3 cp - s3://bucket/folder/VeryLargeDataset.json
Simply modify this example command to point to the URL of the data set you want to work with as well as the S3 bucket, folder, and file name of your choosing.
If you’re curious about how this works, notice these component parts:
wget -O- https://PUBLICDATA.SITE/VeryLargeDataset.json
wget -O writes to a file, but when you specify the “-” character after -O as above, wget writes to standard output instead of a named file.
By piping from standard output we can then pass the file along to standard input using the same “-” character, which is supported by the AWS S3 CLI:
aws s3 cp - s3://bucket/folder/VeryLargeDataset.json
Saying all of that in plain English, our full command downloads a file, sends it to standard output, then we pipe that to standard input which is fed to S3 using the AWS CLI. You are only limited by what your infrastructure and shell are able to handle, which is plenty for almost everyone.
Do you have a similar AWS CLI tip to share? Leave me a comment explaining it! And thanks for reading.