Monthly Archives: June 2012

LXML on Google App Engine

If you read the Google App Engine documentation, you can indeed use libxml if you configure your app.yaml to include that library.

For example

libraries:
 - name: lxml
 - version: latest

Here's the problem

In my previous posts, I've been writing about using BeautifulSoup for scraping web pages, using either urllib or Mechanize.

When you enable lxml, beautifulsoup will automatically start using lxml, which triggers the following error.

ParserError: Unicode parsing is not supported on this platform

Here you will find the StackOverflow page with shows the same error

Lxml Unicode Parsererror

Yeah, that was a huge step back in my project. I was happily using xpath expressions, mostly due to the nature of a really complex web page, that not even using python-chardet I can find out which is the damn encoding of the page.

Soup will break, even if I use ascii as the encoding.

Decoding the string before sending it back to Soup won't help, since I couldn't either encode or decode it, using either ascii, utf-8 and will hit that previous StackOverflow page I added the link.

Mechanize will break if I try to send a custom request, using utf-8 as the encoding

I was decided to keep my xpath expressions, that were clear and allowed me to extract the portions of the page I wanted too, without too much problems.

So i went to my next option, Py-Dom-Xpath

I don't have to tell you, that this approach didn't work out either... it needs a well formed source.

I pushed minidom, soup and this one to work together, but still, no cigar....

I had to suck it up and go back to soup, and do some transverse over the nodes using regex and awesome stuff like node.parent.parent.parent.parent

The reason, I couldn't find the expression to fetch the row based on a string, that the web page awesomely stuff all together sometimes like

<tr><td><font>Hell:</font></td><td><font>33333</font></td></tr>

I need the 3333 value, that will be dynamic.

Yet, the only workaround I found was to regex Hell and transverse back to the tr, then ask for the fonts (findAll) and I get that 3333

The only known thing about this page, is that almost always will have that structure, sometimes it may put an anchor instead of a font, and sometimes the values will have \xa0's and \xc2's , sometimes I will find huge white spaces that aren't rendered, sometimes I will have nbsp's too.

I had to hack that quickly, mostly due to a time shortage.

It would had been an awesome thing to know this beforehand, since I made the mistake of just uploading to google app engine and test lxml alone.

Here's an important tip

It doesn't matters at all that it works on your side with the SDK, the app engine production instance is different, and it will bite you back.

I spent time doing tdd, but it's becoming painful to maintain this approach to create readable code and then hitting this surprises.

Last time I also thought that I had the indexes error fixed, but that wasn't the case.

After I spent time reading about indexes, and how to create them, vacuum them, update them (Indexes) my application, that works with a cron, and HRD entities worked without problems for two days.

Today, when I saw my lxml crash, i also noticed that I started to get the same error of the indexes again.

    Generator xxxxxx(*(), **{})#d211aee1aa9511e1ba78ed0703a61199 raised exception. NeedIndexError: no matching index found.
    The suggested index for this query is:
    - kind: my_entity
      properties:
      - name: xx --> boolean
      - name: xxxx --> date
    Traceback (most recent call last):
      File "pipeline.py", line 2003, in evaluate
        yielded = pipeline_iter.send(next_value)
      File "my_file.py", line 36, in run
        states = controller.fetch_records_to_process()
      File "my_file.py", line 78, in fetch_records_to_process
        rows = self.fetch_states_with_pending_pass()
      File "my_file.py", line 100, in fetch_states_with_pending_pass
        .fetch(self.min_rows_to_process)
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2144, in fetch
        return list(self.run(limit=limit, offset=offset, **kwargs))
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2313, in next
        return self.__model_class.from_entity(self.__iterator.next())
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2809, in next
        next_batch = self.__batcher.next()
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2671, in next
        return self.next_batch(self.AT_LEAST_ONE)
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2708, in next_batch
        batch = self.__next_batch.get_result()
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 604, in get_result
        return self.__get_result_hook(self)
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2460, in __query_result_hook
        str(exc) + '\nThe suggested index for this query is:\n' + yaml)
    NeedIndexError: no matching index found.
    The suggested index for this query is:
    - kind: my_entity
      properties:
      - name: xx --> boolean
      - name: xxxx --> date

    W2012-05-30 13:27:51.736

Giving up on pipeline ID "d211aee1aa9511e1ba78ed0703a61199" after 3 attempt(s); causing abort all the way to the root pipeline ID "d211aee1aa9511e1ba78ed0703a61199"

I don't know, I guess I'll have to rewrite the damn thing to and remove one of the values, and see how far it goes, or try to find more information.

Neither vacuum, update, or delete the whole records and re-populate worked at all.