Extracting high resolution images from hostile web sites
I have a friend in the printed media business. He gets to handle weird requests for technical things which have to happen, no matter what it entails. Sometimes this means wrangling Linux boxes, other times it means Macs, and I bet there are probably Windows boxes too. He deals with routers, VOIP phones, and their Internet presence(s) too.
But, now and then, he gets a really strange one. They were doing some kind of story about some new products and needed really good images for it. Apparently they were okay to just snag images off the web sites of the vendors involved (or so they said...), but most of the images were relatively low-res. The only way to get high-res ones were to click a "zoom" button, at which point this screwy Flash applet would come up and give a tiny little window into the image.
They needed the highest resolution possible for print purposes because these things were supposed to wind up on actual dead trees. These dumb web sites would only dribble out the data in annoying increments and would never just cough up a nice simple jpeg. They asked him for help, and he came and found me.
Looking at the network traffic revealed that this site used one Flash viewer to pull in some kind of resource which also was inside a SWF. It was simple enough to just yank that second file straight out of the network dump for further analysis. Then I found some nice tools online which would split it up somewhat but it still left these massive blobs behind. My image still wasn't available. I finally had to grovel around in the blobs for a JPEG header, and then just extracted from that to what I guessed was the end of that region. That turned out to be a usable high-res file and we went with it.
Later on, I remembered 'binwalk' from my days of poking around flash ROM images for things like my stupid print server box and realized I should have just used that. It would have found the JPEG and given me proper offsets without having to do hexdump and crazy dd incantations by hand. So much for that.
More recently, this came up again with the content stored on a site which did things even differently. This one had a Flash applet, but then that applet proceeded to load tiles sort of like certain mapping web sites. I could tell it was tile-based since it would show a blurry scaled-up version for a moment before replacing it with a high-res copy as I scrolled around.
It was clear this one wasn't going to be a simple matter of extracting the image as a single blob from the network. Still, I fired up tcpdump and went digging. What I found was a series of hits to some kind of backend process with some rather interesting URL parameters. As I scrolled around, the X and Y coordinates changed. It was clear that this viewer made successive requests as my viewport moved.
Playing around with these URLs showed that it could be made to give nearly any offset within the file. Further, I could set two different corner locations so I could just ask for it to go from (0,0) to relatively big numbers for X and Y to have it hand me the entire image. Still, it was giving me SWF junk and I wanted it as a JPEG. I decided to dig some more.
On the assumption that the URL parameters were unique, I did a web search for a few of them to see what I could find. I came up with a bunch of sites which apparently use that same image-serving technology, but then I also found something far better: documentation for the image server software! Now instead of guessing at the parameter meanings, I could make educated changes to them to get exactly what I wanted.
In the end I was able to deliver a minimal URL which provided a high-res copy of the item without any additional compression artifacts. They got their image and were happy, and I got a nice story to turn into a post and use as an example of what can be done when you poke around and do a little research.
All of this data is already out there. You just have to know how to ask for it.