How to use Amazon's S3 web service for Scaling Image Hosting

THIS POST WAS WRITTEN BY SCOTT WINDSOR, ON THE TEACHSTREET BLOG -- BUT THAT WAS SHUT DOWN, SO I MOVED IT HERE, AS IT WAS POPULAR.

Most startups have been there - you have a simple site, and you want to have users upload photos of themselves or something else to share.  We were there as well just a few years ago, when building out the very first versions of TeachStreet.  While previously working at Amazon, I worked on a few image hosting solutions and already knew some of the pitfalls and challenges of building out a system to scale.

Here were some of our high level requirements:
  • Keep redundant copies of images in case of failure
  • Allow dynamic resizing and cropping of images (so we don't have to pre-generate them)
  • Must be fast (but cheap)
  • Must scale independently of our core web application
Having worked with keeping source images in sync between multiple hosts before, we knew that it could be a challenge, and in terms of host failure, a huge pain.  Right around that time, S3 gained traction, and solved our redundant copies issues.  We could push our images to Amazon, and never have to worry about backing them up, or keeping extra copies in case of hardware failure (this became Amazon's problem). Our Solution We chose to write a separate rails app to serve these images and handle the resizing, cropping, or any other effects we needed.  Rmagick (which uses ImageMagik) is able to provide these changes for us, and serve the image back to the user.  The process is as follows:
  1. Handle request
  2. Fetch original source image from S3
  3. Resize/apply effects
  4. Return result back to user
Now we need to go back and optimize for the "fast" requirement.  Doing a request to S3 for each request (and resize) takes some time.  For performance, each of our image servers cache the source, and any resize, locally to disk.  Since images are never updated (only created), and get a unique ID for each one, we don't have to worry about cache invalidation, only expiration.  We can then write a simple script to remove images from this disk cache with files of an access time greater than a certain threshold (say 30 days).  That way, if we change from one size thumbnail to another, eventually the old thumbnail sizes will get purged. Performance Optimizations Implementation-wise, we also can use a few more tricks up our sleeves to eek out performance.  First, when fronting our rails app with nginx, we can use x-sendfile to return the file location to nginx.  This allows the rails app to prevent from having to stream the file data back, and on subsequent requests, results in just a file lookup on disk (it doesn't have to read the contents of the file).  Also, we can ensure that all files are converted to jpgs, then optimized and stripped of any extra header info.  This will minimize the file size as much as possible before sending them back, which will improve the overal latency and throughput for later requests. Lastly, from the client-side, we can trick browsers a bit further.  By creating extra DNS entries for the same servers, we can make the browser think that these are different servers.  Many modern browsers allow for a maximum of four simultaneous requests per host.  Our web app is then responsible for distributing the requests - by hashing & modding the url, we can evenly distribute the images across four hostnames.  This allows browsers with the capability to parallelize requests, at a slight cost of extra dns lookups.  By leveraging Amazon's S3 web service technology, we've been able to reduce our overhead in having to build/manage a redundant file store.

Further steps/more optimization? Still, there are more steps we could take to optimize this further, if needed.  First, if we know commonly requested image sizes & effects, we could prime the cache on image upload.  This would avoid the extra lookup to S3 except in a failure case.  If our caches begin to get very large (as we scale), we could use the dns to map to different servers, even increasing the number of dns entries for servers (modding out to a larger set), or routing to different servers based on url (for different image sizes/etc). Right now, most of our users are in the US.  If we had an international site, we might consider using different S3 backends for storage (in Singapore, Hong Kong, Japan, or Europe), as well as using a CDN to front images.  Generally speaking, CDNs are quite expensive for scrappy little startups like us.  We could even consider using Amazon Cloudfront as our CDN. Alternatives? Other alternatives we've seen to this problem have varied.  Paperclip is a great plugin that provides much of the same functionality, but doesn't provide the on-the-fly resizing, and is usually applied to a database model (our solution relies on external guids for each image).  Cassandra (or MongoDB with GridFS) could also be an alternative backend for S3 if the latency on non-cached requests needs further improvement.