Tag Archives: google app engine

GAE Pipelines

I keep on working with Pipelines, stuff didn't improve since last time. I'm still using the free version of GAE and scraping.

I've come to a spot were I've refactored my code a lot during the last 9 days.

The flow is quite straight forward, open a page, scrape the information and follow links, were I'll have additional fields to extract.

The problem relies on how I handle the connection to the server, I can't move out of Soup, it isn't nice to fetch fields using node.parent.parent.parent, but that's the best I can do with Soup.

I'm seriously pondering about using Yql, but I can't get the real stats on how many records there are on the page I'm scraping.

One of the pages has 1600 rows that i need to extract. I still can't find the proper approach to handle this scraping, even creating the objects is difficult. It's hard to design objects with pipelines, there's no easy way to measure how effective they should be. There are more constraints if i try to think on adding a lot of parallel jobs, from simple things as data write ops going higher due to the fact of having more pipelines, and also, the amount of data to extract in this page is quite brutal.

Each record may have about 56 to 41 write ops, that just allows me to run around 5 executions before my quota runs out.

I can't index the entity, since so far, all my attempts failed in production.

The flow goes like this

  • Open the page
  • Post with a particular field (5 times)
  • [yield] and process HTML
  • For each n records, follow the links and
  • [yield] and extract

N records is an int between 6 being the lowest I've seen and 1600

Each followed link is a yield operation also, which uses Soup inside to fetch more data.

Here's the ouput, more or less we have this

Posts   Generate n childs per link  Scraped records
1       20                           100%
5       Unlimited                      0

If I search only 1 record, and on that page that may have between 6 and 1600 records and follow only 20 childs, the 20 yielded childs work fine. The pattern I'm forced to use with soup is

soup = BeautifulSoup(fugly_html, from_encoding='ascii')
soup.find(text=re.compile('something'))

You may ponder yourself, why are you using a regex and a search like that ?. Well, the page is a mess, and stuff on the page may be dynamic, I honestly can't assure that stuff is going to be were I'm expecting it. That particular line of code works 100% only if i'm under those conditions. The next row of that table explains how it goes, if I post 5 times, and generate 5 childs, and those childs generate an unlimited amount of connections, basically the re.compile fails. I tried upping the memory of the instance, opening the links before yielding the page, but the other thing that is playing against me, is that, each of the links is also a different record, I can't cache that page and process it multiple times. I'm seriously pondering about Yql, but I don't know if this has a workaround. I also have the 10 minute deadline for each pipeline child, and also, each generated child consumes at least 1 write op in my database, which basically plays against my quota...

I may have to rely on Tasks, particularly Push.

Google App Engine

My ongoing project it's deployed on Google App Engine.

The setup that I'm using includes

  • Python 2.7
  • HRD
  • Pipeline API
  • Mechanize
  • urlfetch
  • BeautifulSoup

I've been mentioning this for awhile now, but after a couple of months of experience on my back, I can mention more things about google app engine.

I'm using the free version, that has a daily quota. After that, the deploy goes down and you can't do pretty much, except than to wait till next day till the quota resets, and you are fresh to start again.

What consumes my quota ?

Mostly, HRD write operations and read operations, that is, when you operate with your database. This includes also the Pipeline tasks, that also generate write operations on the HRD. Since the nature of my project requires the pipelines, we have that playing against us too, but it is better handled than the regular write ops that I have on my entities.

What's one of the challenges here ?

Create an instance, with the proper indexes to reduce the write operation costs.

I saw a huge drop on the write operations when I defined indexes, though this took time. Finding the information and applying it took time, mostly due to the whole thing of learning what my client wants, and during this iterative process that I have with him, we learn from our mistakes and improve during the next run.

The entities that I'm working aren't complex, and aren't using parent / child relations, they are straight forward entities. Why ?

The workflow I'm working on, it heavily uses the datastore to store scraped data from the web. A bad defined index will increase the write operations per record, which incurs on burning your quota faster.

Some stuff to keep in mind

There's a lot of information and opinions going around about Google App Engine, there are very negative ones, but after using it for awhile, it's clear that you have to choose the proper tool for what you are using.

Yes, sometimes it's a bit frustrating to find out that your application isn't working, or suddenly you hit your quota and you can't operate in production.

So far, I did not have downtime due to Google errors. I mention this, because I saw a lot of, you will have a lot of downtime. That didn't happen to me so far.

The development version may work differently to what you have in production, I did notice this with the indexes.

One of my workflows was working perfectly here, but when I uploaded it, I started to have a lot of indexes errors.

I'm still trying to learn how to stub services, but not the datastore, since that is properly documented, mostly I'm talking about the User stub, since I couldn't make it work with testbed.setenv(), I had to use os.environment['var'] ..I have some complex stuff to test to, that puts me in the position where I've got to think how much it would solve and allow me to move forward, rather than consume my work hours on a test task, that won't cover much of the whole application.

After all, it's a nice exercise for the mind and I'm enjoying it so far, it's something different than what I was doing previously on my other company, that was boring and dull.

Google App Engine Myth

I'm still working with App Engine using Python, doing scraping of the same sites, but in this post, I'll write about what happened since my last post. One of the things I managed to do, was to download PDF's using Python Mechanize, Urllib and Urllib2. You may be pondering, well that is not something too complex.... as a matter of fact, it was, and it was terribly time consuming. The site I was scraping serves PDF's files, but the files aren't anchors that you can simply click and download. The site uses session, yes, like in PHP or any other language. So, I actually do not have a "link" per se, I have an input type image that with javascript hits an endpoint. That endpoint refreshes the server session and when we have the response back, we will be redirected (with javascript too) to another page, which is the landing page of the site I'm scraping. In the landing page, it will use my session to detect if I requested a PDF file and there the server will magically give me the file. Written like that doesn't sounds complex, but you have to take into account that

  • I can't use Javascript in Mechanize.
  • The only Javascript libraries for python in GAE (such as Python Spidermonkey) doesn't seems to help too much.
  • I can't use Selenium, because that won't run in GAE, and the client that hired me specifically wants to run this in GAE.

So, after a couple of days (I think that it took me 2 days to discover how the site worked using firebug and analyzing the requests) I came up with this.

browser = self.mechanize_browser
browser._factory.encoding = 'utf-8'
browser.select_form(nr=0)
browser[api_input_name] = api
response = browser.submit(name=search_input_name)
filename = parser.determine_position(response.read(), job_date)
if len(filename) > 0:
   browser.select_form(nr=0)
   # Create a custom request
   data = self.create_custom_api_download_request(api, browser, filename)
   browser.select_form(nr=0)
   # Prepare their ASPSESSION and simulate a submit,
   # that will guarantee
   # a fresh session for the next GET request
   browser.open(main_site, data)
   time.sleep(time_before_opening_page)
   # Now, we indicate their server that we will do a GET
   # this allows us to get the stream
   stream = browser.open(download_url)
   pdf_stream = stream.read()
def create_custom_api_download_request(self, api, browser, event_argument):
        """
        Create a custom request using urllib and return
        the encoded request parameter.
        The keys __EVENTKEY and __EVENTVALIDATION
        are tracking values that the server sends back
        on every page. They change per request
        @var api: String The api of the well
        @var browser: Mechanize.browser
        @var event_argument: The filename
        @return: urllib.urlencode dictionary
        """
        if browser.form is None:
            raise Exception('Form is NONE')
        api_input_name = self.config[self.config_key]['api_input']
        custom_post_params = \
        self.config[self.config_key]['download_post_params']
        payload = {}
        for key in custom_post_params:
            payload[key] = custom_post_params[key]
        payload['__EVENTVALIDATION'] = browser.form['__EVENTVALIDATION']
        payload['__VIEWSTATE'] = browser.form['__VIEWSTATE']
        payload['__EVENTARGUMENT'] = event_argument
        payload[api_input_name] = api
        return urllib.urlencode(payload)

A couple of notes

Using a custom factory for mechanize was required, since we were reading a raw PDF string, the default factory (ie, the parser that mechanize uses to read the data, such as BeautifulSoup) was having a problem with the raw pdf stream. So, using browser._factory.encoding = 'utf-8' solved that problem. Regarding the method determine_position, well don't pay attention to that, because that is just part of the business logic that the site has, and it has to be solved using that method, let's just say that the method locates the pdf "link" in a table, since I can have multiple results. Then, we create a custom request using urllib, that is the method create_custom_api_download_request . With that custom request, we will feed our mechanize browser instance and again, more complexities of the site. If I didn't put that sleep, I was going to be hitting the site really fast and I was getting bad responses, so I used a sleep to win some time. After that, we just use the method open, with our custom request but pointing to the landing page, and voila, I will get the pdf.

Downsides of doing this

Well, without taking into account that the whole flow is terribly complex, and I'm just writing about one specific thing I do, using GAE for this kind of tasks doesn't seems a pretty good idea.

GAE MYTH?

Well, now really the main thing. Our client is really focused and interested on using only GAE for this complex scraping app. He pointed me out about using "tasks", or the push tasks, because you could configure the rate of execution, blah blah blah. Our most important task is PDF Scraping, that I do with PdfMiner The thing is, this is an automated application, even creating a custom task won't help it, it is too "heavy" to use in GAE, it depletes the resources really fast. By that, I mean, if you have a $2 budget you will have to come up with a very good rate configuration. Pdfminer is the only good library that can actually give me results in XML that I can parse using lxml. The pdf files that I read, are complex tables converted to pdf from Microsoft Excel. It was a really complex task to figure out how it worked, but my client provided me with a sample for the first section of the pdf, and I worked out with the second part of the pdf. I can process 10 pdf's per minute, any value higher than that (ie, 20 tasks per minute, or 20 tasks per second) will end up with the queue dropping tasks because it can't process, and my budget will be depleted faster. See, I believe that if you are going to be using something experimental as GAE, you should first spend a lot of time researching, not just throw your cash there and expect immediate results. So, even though i got a budget increase of 5 bucks, I still can't have 24 real hours of uptime. The instance now is heavily focused on running pdf's but if I enable all the things that the instance should be doing, $5 isn't enough !. I managed to run with $2 for around 10 real hours, but again, the only thing that the application could do, was to scrape 10 pdf's per minute and every 15, it was sending HRD results to Fusion tables. (That is complex too ). When I say "real hours", I mean real hours, in app engine, it will show up something as 68 hours of uptime, but that are like 10 real hours.

CONCLUSION

Before jumping into something experimental, research , research and research even more. Before jumping into the GAE wagon, research a lot, and I can't stress out the "a lot" part. I don't blame GAE for this, I think that is a great thing from Google, but you have to use the right tool , and it happens to be that GAE is not the right tool when you don't have any plan and expect it to adapt magically to your needs... read the fine manual!