Monthly Archives: April 2012

Google App Engine Myth

I'm still working with App Engine using Python, doing scraping of the same sites, but in this post, I'll write about what happened since my last post. One of the things I managed to do, was to download PDF's using Python Mechanize, Urllib and Urllib2. You may be pondering, well that is not something too complex.... as a matter of fact, it was, and it was terribly time consuming. The site I was scraping serves PDF's files, but the files aren't anchors that you can simply click and download. The site uses session, yes, like in PHP or any other language. So, I actually do not have a "link" per se, I have an input type image that with javascript hits an endpoint. That endpoint refreshes the server session and when we have the response back, we will be redirected (with javascript too) to another page, which is the landing page of the site I'm scraping. In the landing page, it will use my session to detect if I requested a PDF file and there the server will magically give me the file. Written like that doesn't sounds complex, but you have to take into account that

  • I can't use Javascript in Mechanize.
  • The only Javascript libraries for python in GAE (such as Python Spidermonkey) doesn't seems to help too much.
  • I can't use Selenium, because that won't run in GAE, and the client that hired me specifically wants to run this in GAE.

So, after a couple of days (I think that it took me 2 days to discover how the site worked using firebug and analyzing the requests) I came up with this.

browser = self.mechanize_browser
browser._factory.encoding = 'utf-8'
browser.select_form(nr=0)
browser[api_input_name] = api
response = browser.submit(name=search_input_name)
filename = parser.determine_position(response.read(), job_date)
if len(filename) > 0:
   browser.select_form(nr=0)
   # Create a custom request
   data = self.create_custom_api_download_request(api, browser, filename)
   browser.select_form(nr=0)
   # Prepare their ASPSESSION and simulate a submit,
   # that will guarantee
   # a fresh session for the next GET request
   browser.open(main_site, data)
   time.sleep(time_before_opening_page)
   # Now, we indicate their server that we will do a GET
   # this allows us to get the stream
   stream = browser.open(download_url)
   pdf_stream = stream.read()
def create_custom_api_download_request(self, api, browser, event_argument):
        """
        Create a custom request using urllib and return
        the encoded request parameter.
        The keys __EVENTKEY and __EVENTVALIDATION
        are tracking values that the server sends back
        on every page. They change per request
        @var api: String The api of the well
        @var browser: Mechanize.browser
        @var event_argument: The filename
        @return: urllib.urlencode dictionary
        """
        if browser.form is None:
            raise Exception('Form is NONE')
        api_input_name = self.config[self.config_key]['api_input']
        custom_post_params = \
        self.config[self.config_key]['download_post_params']
        payload = {}
        for key in custom_post_params:
            payload[key] = custom_post_params[key]
        payload['__EVENTVALIDATION'] = browser.form['__EVENTVALIDATION']
        payload['__VIEWSTATE'] = browser.form['__VIEWSTATE']
        payload['__EVENTARGUMENT'] = event_argument
        payload[api_input_name] = api
        return urllib.urlencode(payload)

A couple of notes

Using a custom factory for mechanize was required, since we were reading a raw PDF string, the default factory (ie, the parser that mechanize uses to read the data, such as BeautifulSoup) was having a problem with the raw pdf stream. So, using browser._factory.encoding = 'utf-8' solved that problem. Regarding the method determine_position, well don't pay attention to that, because that is just part of the business logic that the site has, and it has to be solved using that method, let's just say that the method locates the pdf "link" in a table, since I can have multiple results. Then, we create a custom request using urllib, that is the method create_custom_api_download_request . With that custom request, we will feed our mechanize browser instance and again, more complexities of the site. If I didn't put that sleep, I was going to be hitting the site really fast and I was getting bad responses, so I used a sleep to win some time. After that, we just use the method open, with our custom request but pointing to the landing page, and voila, I will get the pdf.

Downsides of doing this

Well, without taking into account that the whole flow is terribly complex, and I'm just writing about one specific thing I do, using GAE for this kind of tasks doesn't seems a pretty good idea.

GAE MYTH?

Well, now really the main thing. Our client is really focused and interested on using only GAE for this complex scraping app. He pointed me out about using "tasks", or the push tasks, because you could configure the rate of execution, blah blah blah. Our most important task is PDF Scraping, that I do with PdfMiner The thing is, this is an automated application, even creating a custom task won't help it, it is too "heavy" to use in GAE, it depletes the resources really fast. By that, I mean, if you have a $2 budget you will have to come up with a very good rate configuration. Pdfminer is the only good library that can actually give me results in XML that I can parse using lxml. The pdf files that I read, are complex tables converted to pdf from Microsoft Excel. It was a really complex task to figure out how it worked, but my client provided me with a sample for the first section of the pdf, and I worked out with the second part of the pdf. I can process 10 pdf's per minute, any value higher than that (ie, 20 tasks per minute, or 20 tasks per second) will end up with the queue dropping tasks because it can't process, and my budget will be depleted faster. See, I believe that if you are going to be using something experimental as GAE, you should first spend a lot of time researching, not just throw your cash there and expect immediate results. So, even though i got a budget increase of 5 bucks, I still can't have 24 real hours of uptime. The instance now is heavily focused on running pdf's but if I enable all the things that the instance should be doing, $5 isn't enough !. I managed to run with $2 for around 10 real hours, but again, the only thing that the application could do, was to scrape 10 pdf's per minute and every 15, it was sending HRD results to Fusion tables. (That is complex too ). When I say "real hours", I mean real hours, in app engine, it will show up something as 68 hours of uptime, but that are like 10 real hours.

CONCLUSION

Before jumping into something experimental, research , research and research even more. Before jumping into the GAE wagon, research a lot, and I can't stress out the "a lot" part. I don't blame GAE for this, I think that is a great thing from Google, but you have to use the right tool , and it happens to be that GAE is not the right tool when you don't have any plan and expect it to adapt magically to your needs... read the fine manual!