Tag Archives: GAE

GAE Pipelines

I keep on working with Pipelines, stuff didn't improve since last time. I'm still using the free version of GAE and scraping.

I've come to a spot were I've refactored my code a lot during the last 9 days.

The flow is quite straight forward, open a page, scrape the information and follow links, were I'll have additional fields to extract.

The problem relies on how I handle the connection to the server, I can't move out of Soup, it isn't nice to fetch fields using node.parent.parent.parent, but that's the best I can do with Soup.

I'm seriously pondering about using Yql, but I can't get the real stats on how many records there are on the page I'm scraping.

One of the pages has 1600 rows that i need to extract. I still can't find the proper approach to handle this scraping, even creating the objects is difficult. It's hard to design objects with pipelines, there's no easy way to measure how effective they should be. There are more constraints if i try to think on adding a lot of parallel jobs, from simple things as data write ops going higher due to the fact of having more pipelines, and also, the amount of data to extract in this page is quite brutal.

Each record may have about 56 to 41 write ops, that just allows me to run around 5 executions before my quota runs out.

I can't index the entity, since so far, all my attempts failed in production.

The flow goes like this

  • Open the page
  • Post with a particular field (5 times)
  • [yield] and process HTML
  • For each n records, follow the links and
  • [yield] and extract

N records is an int between 6 being the lowest I've seen and 1600

Each followed link is a yield operation also, which uses Soup inside to fetch more data.

Here's the ouput, more or less we have this

Posts   Generate n childs per link  Scraped records
1       20                           100%
5       Unlimited                      0

If I search only 1 record, and on that page that may have between 6 and 1600 records and follow only 20 childs, the 20 yielded childs work fine. The pattern I'm forced to use with soup is

soup = BeautifulSoup(fugly_html, from_encoding='ascii')
soup.find(text=re.compile('something'))

You may ponder yourself, why are you using a regex and a search like that ?. Well, the page is a mess, and stuff on the page may be dynamic, I honestly can't assure that stuff is going to be were I'm expecting it. That particular line of code works 100% only if i'm under those conditions. The next row of that table explains how it goes, if I post 5 times, and generate 5 childs, and those childs generate an unlimited amount of connections, basically the re.compile fails. I tried upping the memory of the instance, opening the links before yielding the page, but the other thing that is playing against me, is that, each of the links is also a different record, I can't cache that page and process it multiple times. I'm seriously pondering about Yql, but I don't know if this has a workaround. I also have the 10 minute deadline for each pipeline child, and also, each generated child consumes at least 1 write op in my database, which basically plays against my quota...

I may have to rely on Tasks, particularly Push.

LXML on Google App Engine

If you read the Google App Engine documentation, you can indeed use libxml if you configure your app.yaml to include that library.

For example

libraries:
 - name: lxml
 - version: latest

Here's the problem

In my previous posts, I've been writing about using BeautifulSoup for scraping web pages, using either urllib or Mechanize.

When you enable lxml, beautifulsoup will automatically start using lxml, which triggers the following error.

ParserError: Unicode parsing is not supported on this platform

Here you will find the StackOverflow page with shows the same error

Lxml Unicode Parsererror

Yeah, that was a huge step back in my project. I was happily using xpath expressions, mostly due to the nature of a really complex web page, that not even using python-chardet I can find out which is the damn encoding of the page.

Soup will break, even if I use ascii as the encoding.

Decoding the string before sending it back to Soup won't help, since I couldn't either encode or decode it, using either ascii, utf-8 and will hit that previous StackOverflow page I added the link.

Mechanize will break if I try to send a custom request, using utf-8 as the encoding

I was decided to keep my xpath expressions, that were clear and allowed me to extract the portions of the page I wanted too, without too much problems.

So i went to my next option, Py-Dom-Xpath

I don't have to tell you, that this approach didn't work out either... it needs a well formed source.

I pushed minidom, soup and this one to work together, but still, no cigar....

I had to suck it up and go back to soup, and do some transverse over the nodes using regex and awesome stuff like node.parent.parent.parent.parent

The reason, I couldn't find the expression to fetch the row based on a string, that the web page awesomely stuff all together sometimes like

<tr><td><font>Hell:</font></td><td><font>33333</font></td></tr>

I need the 3333 value, that will be dynamic.

Yet, the only workaround I found was to regex Hell and transverse back to the tr, then ask for the fonts (findAll) and I get that 3333

The only known thing about this page, is that almost always will have that structure, sometimes it may put an anchor instead of a font, and sometimes the values will have \xa0's and \xc2's , sometimes I will find huge white spaces that aren't rendered, sometimes I will have nbsp's too.

I had to hack that quickly, mostly due to a time shortage.

It would had been an awesome thing to know this beforehand, since I made the mistake of just uploading to google app engine and test lxml alone.

Here's an important tip

It doesn't matters at all that it works on your side with the SDK, the app engine production instance is different, and it will bite you back.

I spent time doing tdd, but it's becoming painful to maintain this approach to create readable code and then hitting this surprises.

Last time I also thought that I had the indexes error fixed, but that wasn't the case.

After I spent time reading about indexes, and how to create them, vacuum them, update them (Indexes) my application, that works with a cron, and HRD entities worked without problems for two days.

Today, when I saw my lxml crash, i also noticed that I started to get the same error of the indexes again.

    Generator xxxxxx(*(), **{})#d211aee1aa9511e1ba78ed0703a61199 raised exception. NeedIndexError: no matching index found.
    The suggested index for this query is:
    - kind: my_entity
      properties:
      - name: xx --> boolean
      - name: xxxx --> date
    Traceback (most recent call last):
      File "pipeline.py", line 2003, in evaluate
        yielded = pipeline_iter.send(next_value)
      File "my_file.py", line 36, in run
        states = controller.fetch_records_to_process()
      File "my_file.py", line 78, in fetch_records_to_process
        rows = self.fetch_states_with_pending_pass()
      File "my_file.py", line 100, in fetch_states_with_pending_pass
        .fetch(self.min_rows_to_process)
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2144, in fetch
        return list(self.run(limit=limit, offset=offset, **kwargs))
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2313, in next
        return self.__model_class.from_entity(self.__iterator.next())
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2809, in next
        next_batch = self.__batcher.next()
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2671, in next
        return self.next_batch(self.AT_LEAST_ONE)
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2708, in next_batch
        batch = self.__next_batch.get_result()
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 604, in get_result
        return self.__get_result_hook(self)
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2460, in __query_result_hook
        str(exc) + '\nThe suggested index for this query is:\n' + yaml)
    NeedIndexError: no matching index found.
    The suggested index for this query is:
    - kind: my_entity
      properties:
      - name: xx --> boolean
      - name: xxxx --> date

    W2012-05-30 13:27:51.736

Giving up on pipeline ID "d211aee1aa9511e1ba78ed0703a61199" after 3 attempt(s); causing abort all the way to the root pipeline ID "d211aee1aa9511e1ba78ed0703a61199"

I don't know, I guess I'll have to rewrite the damn thing to and remove one of the values, and see how far it goes, or try to find more information.

Neither vacuum, update, or delete the whole records and re-populate worked at all.

Google App Engine

My ongoing project it's deployed on Google App Engine.

The setup that I'm using includes

  • Python 2.7
  • HRD
  • Pipeline API
  • Mechanize
  • urlfetch
  • BeautifulSoup

I've been mentioning this for awhile now, but after a couple of months of experience on my back, I can mention more things about google app engine.

I'm using the free version, that has a daily quota. After that, the deploy goes down and you can't do pretty much, except than to wait till next day till the quota resets, and you are fresh to start again.

What consumes my quota ?

Mostly, HRD write operations and read operations, that is, when you operate with your database. This includes also the Pipeline tasks, that also generate write operations on the HRD. Since the nature of my project requires the pipelines, we have that playing against us too, but it is better handled than the regular write ops that I have on my entities.

What's one of the challenges here ?

Create an instance, with the proper indexes to reduce the write operation costs.

I saw a huge drop on the write operations when I defined indexes, though this took time. Finding the information and applying it took time, mostly due to the whole thing of learning what my client wants, and during this iterative process that I have with him, we learn from our mistakes and improve during the next run.

The entities that I'm working aren't complex, and aren't using parent / child relations, they are straight forward entities. Why ?

The workflow I'm working on, it heavily uses the datastore to store scraped data from the web. A bad defined index will increase the write operations per record, which incurs on burning your quota faster.

Some stuff to keep in mind

There's a lot of information and opinions going around about Google App Engine, there are very negative ones, but after using it for awhile, it's clear that you have to choose the proper tool for what you are using.

Yes, sometimes it's a bit frustrating to find out that your application isn't working, or suddenly you hit your quota and you can't operate in production.

So far, I did not have downtime due to Google errors. I mention this, because I saw a lot of, you will have a lot of downtime. That didn't happen to me so far.

The development version may work differently to what you have in production, I did notice this with the indexes.

One of my workflows was working perfectly here, but when I uploaded it, I started to have a lot of indexes errors.

I'm still trying to learn how to stub services, but not the datastore, since that is properly documented, mostly I'm talking about the User stub, since I couldn't make it work with testbed.setenv(), I had to use os.environment['var'] ..I have some complex stuff to test to, that puts me in the position where I've got to think how much it would solve and allow me to move forward, rather than consume my work hours on a test task, that won't cover much of the whole application.

After all, it's a nice exercise for the mind and I'm enjoying it so far, it's something different than what I was doing previously on my other company, that was boring and dull.

Google App Engine Myth

I'm still working with App Engine using Python, doing scraping of the same sites, but in this post, I'll write about what happened since my last post. One of the things I managed to do, was to download PDF's using Python Mechanize, Urllib and Urllib2. You may be pondering, well that is not something too complex.... as a matter of fact, it was, and it was terribly time consuming. The site I was scraping serves PDF's files, but the files aren't anchors that you can simply click and download. The site uses session, yes, like in PHP or any other language. So, I actually do not have a "link" per se, I have an input type image that with javascript hits an endpoint. That endpoint refreshes the server session and when we have the response back, we will be redirected (with javascript too) to another page, which is the landing page of the site I'm scraping. In the landing page, it will use my session to detect if I requested a PDF file and there the server will magically give me the file. Written like that doesn't sounds complex, but you have to take into account that

  • I can't use Javascript in Mechanize.
  • The only Javascript libraries for python in GAE (such as Python Spidermonkey) doesn't seems to help too much.
  • I can't use Selenium, because that won't run in GAE, and the client that hired me specifically wants to run this in GAE.

So, after a couple of days (I think that it took me 2 days to discover how the site worked using firebug and analyzing the requests) I came up with this.

browser = self.mechanize_browser
browser._factory.encoding = 'utf-8'
browser.select_form(nr=0)
browser[api_input_name] = api
response = browser.submit(name=search_input_name)
filename = parser.determine_position(response.read(), job_date)
if len(filename) > 0:
   browser.select_form(nr=0)
   # Create a custom request
   data = self.create_custom_api_download_request(api, browser, filename)
   browser.select_form(nr=0)
   # Prepare their ASPSESSION and simulate a submit,
   # that will guarantee
   # a fresh session for the next GET request
   browser.open(main_site, data)
   time.sleep(time_before_opening_page)
   # Now, we indicate their server that we will do a GET
   # this allows us to get the stream
   stream = browser.open(download_url)
   pdf_stream = stream.read()
def create_custom_api_download_request(self, api, browser, event_argument):
        """
        Create a custom request using urllib and return
        the encoded request parameter.
        The keys __EVENTKEY and __EVENTVALIDATION
        are tracking values that the server sends back
        on every page. They change per request
        @var api: String The api of the well
        @var browser: Mechanize.browser
        @var event_argument: The filename
        @return: urllib.urlencode dictionary
        """
        if browser.form is None:
            raise Exception('Form is NONE')
        api_input_name = self.config[self.config_key]['api_input']
        custom_post_params = \
        self.config[self.config_key]['download_post_params']
        payload = {}
        for key in custom_post_params:
            payload[key] = custom_post_params[key]
        payload['__EVENTVALIDATION'] = browser.form['__EVENTVALIDATION']
        payload['__VIEWSTATE'] = browser.form['__VIEWSTATE']
        payload['__EVENTARGUMENT'] = event_argument
        payload[api_input_name] = api
        return urllib.urlencode(payload)

A couple of notes

Using a custom factory for mechanize was required, since we were reading a raw PDF string, the default factory (ie, the parser that mechanize uses to read the data, such as BeautifulSoup) was having a problem with the raw pdf stream. So, using browser._factory.encoding = 'utf-8' solved that problem. Regarding the method determine_position, well don't pay attention to that, because that is just part of the business logic that the site has, and it has to be solved using that method, let's just say that the method locates the pdf "link" in a table, since I can have multiple results. Then, we create a custom request using urllib, that is the method create_custom_api_download_request . With that custom request, we will feed our mechanize browser instance and again, more complexities of the site. If I didn't put that sleep, I was going to be hitting the site really fast and I was getting bad responses, so I used a sleep to win some time. After that, we just use the method open, with our custom request but pointing to the landing page, and voila, I will get the pdf.

Downsides of doing this

Well, without taking into account that the whole flow is terribly complex, and I'm just writing about one specific thing I do, using GAE for this kind of tasks doesn't seems a pretty good idea.

GAE MYTH?

Well, now really the main thing. Our client is really focused and interested on using only GAE for this complex scraping app. He pointed me out about using "tasks", or the push tasks, because you could configure the rate of execution, blah blah blah. Our most important task is PDF Scraping, that I do with PdfMiner The thing is, this is an automated application, even creating a custom task won't help it, it is too "heavy" to use in GAE, it depletes the resources really fast. By that, I mean, if you have a $2 budget you will have to come up with a very good rate configuration. Pdfminer is the only good library that can actually give me results in XML that I can parse using lxml. The pdf files that I read, are complex tables converted to pdf from Microsoft Excel. It was a really complex task to figure out how it worked, but my client provided me with a sample for the first section of the pdf, and I worked out with the second part of the pdf. I can process 10 pdf's per minute, any value higher than that (ie, 20 tasks per minute, or 20 tasks per second) will end up with the queue dropping tasks because it can't process, and my budget will be depleted faster. See, I believe that if you are going to be using something experimental as GAE, you should first spend a lot of time researching, not just throw your cash there and expect immediate results. So, even though i got a budget increase of 5 bucks, I still can't have 24 real hours of uptime. The instance now is heavily focused on running pdf's but if I enable all the things that the instance should be doing, $5 isn't enough !. I managed to run with $2 for around 10 real hours, but again, the only thing that the application could do, was to scrape 10 pdf's per minute and every 15, it was sending HRD results to Fusion tables. (That is complex too ). When I say "real hours", I mean real hours, in app engine, it will show up something as 68 hours of uptime, but that are like 10 real hours.

CONCLUSION

Before jumping into something experimental, research , research and research even more. Before jumping into the GAE wagon, research a lot, and I can't stress out the "a lot" part. I don't blame GAE for this, I think that is a great thing from Google, but you have to use the right tool , and it happens to be that GAE is not the right tool when you don't have any plan and expect it to adapt magically to your needs... read the fine manual!