Unignoring files in bazaar

Bazaar is a great tool for quickly start versioning a python project. For example if you have this one in particular:

my_project 
 - __init__.py 
 - my_module.py 
 - my_module.pyc 
 - main.py 
 - library.so

To start versioning, at the root level execute the following commands:

$ bzr init 
$ bzr commit -m "Initial commit"

Bazaar, by default, will ignore all .pyc files, so we don't have to worry about commit them by mistake. .pyc files are not the only ones being ingored by default. Bazaar will also ignore vim buffer files (.*swp), dynamically linked libraries (.so) and some others too. So, what if we need to "unignore" some of this default patterns?.

Here is what we should do

Just create a .bzrignore file at the top level of the project, and add the pattern you want to unignore, preceded by a ! mark, for example if we want to start versioning all .so files, we just need to add the following pattern

!*.so

If we check our repository status now it will say

unknown
library.so

Now we can add this file and start versioning it.

Happy "bazaaring"!

Creating a Backbone, RequireJs and Compass application from scratch with Yeoman

This guide will show you how to scaffold a Backbone + RequireJs + Compass application completely from scratch.

In the way, we are going to cover how to install Node, NPM, RVM, Ruby, Compass and Yeoman (yo, Grunt, Bower). We will cover some of the common pitfalls and how to solve them. This guide assumes you are using a GNU/Linux based operative system and it was tested on Ubuntu 12.04 64 bits.

1. Node.js and NPM

First we install Node.js and Node Package Manager (NPM). It is important to NOT use SUDO when doing this. Using sudo will get you into some nasty permission conflicts. Installing Node and NPM can present some difficulties depending on your machine's setup (firewall, users, etc.) I have found that this Gist has the right solution for most cases. Right now we are going to use the first solution.

We install Node:

echo 'export PATH=$HOME/local/bin:$PATH' >> ~/.bashrc
. ~/.bashrc
mkdir ~/local
mkdir ~/node-latest-install
cd ~/node-latest-install
curl http://nodejs.org/dist/node-latest.tar.gz | tar xz --strip-components=1
./configure --prefix=~/local
make install

Then we install NPM:

curl https://www.npmjs.org/install.sh | sh

To verify:

node -v
npm -v

Node should be v0.10.26 or larger, NPM should be 1.4.3 or larger.

2. Compass

With Node and NPM in place, we need to install Compass.

"Compass is an open-source CSS authoring framework which uses the Sass stylesheet language to make writing stylesheets powerful and easy."

To install Compass, we will need Ruby.

If you already have Ruby installed, verify you have the latest version with

ruby -v

And update your ruby gems with

gem update --system

If you don't, we are going to install Ruby (the latest version being 2.1.1). There are different ways to do so, in this case we choose to do it through Ruby Version Manager (RVM).

"RVM is a command-line tool which allows you to easily install, manage, and work with multiple ruby environments from interpreters to sets of gems."

curl -L https://get.rvm.io | bash -s stable
source ~/.rvm/scripts/rvm
rvm install 2.1.1
gem install compass

verify:

ruby -v
compass -v

Open a new console, and try ruby -v again. If you don't have the ruby command anymore, you have to enable "Run command as a login shell" on your console's settings. Why this is so is explained in this article.

3. Yeoman

To setup our application, we are going to use the Backbone generator for Yeoman. It will handle most of the hard work for us.

Yeoman is a great tool by Addy Osmani (same guy from the Backbone Fundamentals book) and others, that will help you

"by scaffolding workflows for creating modern webapps, while at the same time mixing in many of the best practices that have evolved within the industry.""

It gives you three tools: yo, for new apps scaffolding, Grunt, to build, preview, test and any other task in your workflow, and Bower, for managing packages and their dependencies.

We install Yeoman (using the -g flag to indicate NPM that the module should be available globally):

Note we don't use SUDO at all when installing with npm!

npm install -g yo

If you see the following output when installing Yeoman:

[Yeoman Doctor] Uh oh, I found potential errors on your machine

[Error] NPM root value is not in your NODE_PATH [info] NODE_PATH = /usr/lib/nodejs:/usr/lib/node_modules:/usr/share/javascript NPM root = ~/local/lib/node_modules

[Fix] Append the NPM root value to your NODE_PATH variable Add this line to your .bashrc export NODE_PATH=$NODE_PATH:~/local/lib/node_modules Or run this command echo "export NODE_PATH=$NODE_PATH:~/local/lib/node_modules" >> ~/.bashrc && source ~/.bashrc

Just follow the instructions in the [Fix] section (if only all CLIs were this helpful!), adding the NPM root to your NODE_PATH.

For scaffolding our Backbone app, we need a Yeoman generator that knows how to do it. Let's install it:

npm install -g generator-backbone

To verify the installation, run

yo -h

You should see a list of generators under the message "Please choose a generator below." Backbone should be there. If it isn't listed, try

echo "export NODE_PATH=$NODE_PATH:~/local/lib/node_modules" >> ~/.bashrc && source ~/.bashrc

Scaffolding our application

We are almost ready. Create a directory for your application, and generate the Backbone app inside it:

mkdir <my-app-name>
cd <my-app-name>
yo backbone

When asked, select Bootstrap for Sass and RequireJs.

[?] What more would you like? 
 ⬢ Bootstrap for Sass
 ⬡ Use CoffeeScript
‣⬢ Use RequireJs

Next we install the project's dependecies:

npm install
bower install

Running our application

Our application is setup. To try it out, run Grunt with the "serve" task:

grunt serve

Our default browser should open at http://localhost:9000/, showing the scaffolded app.

You can see in the console output the different tasks Grunt performed for us (tasks are defined in Gruntfile.js), to get the app up and running. You can see a "connect:livereload" task was run, and a "watch" task is running. This means you can, thanks to LiveReload edit your index.html at /app and the browser will automatically refresh the page with your changes! A huge time-saver.

You can get to work on your Backbone app! The Yeoman generator not only created the app scaffold for us, we can use it to create the basic parts of any Backbone app (models, views, etc.), for instance the router:

yo backbone:router ''

Happy coding!

True statements about agile software development management

Today in Hacker news I've seen one of the best posts about software development management, which summarizes almost all of the don'ts and burdens that all of us have to deal with at some points of our careers.

This is it: Coconut Headphones: Why Agile Has Failed

I like Scrum, I think that when correctly used is a powerful tool to help a team do their best effort in producing high-quality software. But I also think that having a ScrumMaster whose only programming experience were some Java homework from College and a few lines of VisualBasic is a waste of time for everybody.

The post reminded me of the boss I had before joining Devecoop. He strongly believed in the GitHub model: everyone should be allowed to code a feature the best way they think. However, I saw it in practice and it lead to a poor architecture mostly because of the lack of design leadership.

I believe that a minimal -and mostly technical- leadership must exist in every team, to ensure that the voice of the best and most experienced programmers is seriously taken into account -and thus, defining the architecture and other important things- but, at the same time, everybody can have opinions.

Now I have to go back to coding. I will elaborate on that last approach soon, stay tuned!

Documenting directory Trees with tree

I had to document the directory hierarchy of our running servers. It occurred to me to use the 'tree' command to generate a txt fromthe hierarchy tree and that can then be added to our wiki. Plus you can add by hand a brief description for each directory.

The tree command can generate the directory hierarchy from a specific file. It can print to the screen, generate a text file and also can generate an html file.

Example:

/usr
├── bin
├── games
├── include
├── lib
├── lib32
├── local
├── sbin
├── share
└── src

9 directories

To install it with apt:

$ sudo apt-get install tree

To copy the output to a text file you can use the -n option (to deactivate the color special characters) and -o to indicate a file name

$ tree -d -L 1 -n -o fhs.txt /

You can generate html with the -H

$ tree -H -d -L 1 -n -o fhs.html /

You can specify a pattern of files to include with the -P option and also estipulate several searching directories. Don't forget to add the single quotes around the -P pattern to prevent bash to expand it.

$ tree -P '*.list' sources.list.d/ /etc/apt/

Postgres down in Flask

I'm using Flask and Flask-SqlAlchemy for a personal project. One of the recent ubuntu upgrades had a change in postgres.conf

Suddenly my integration between flask-openid / flask-login or just flask in general wasn't working at all.

After a couple of debugging hours, I found the following link: https://github.com/celery/celery/issues/634

Basically they added as a default in postgres.conf

ssl = true

Since I don't have ssl I didn't knew what hit my box

I can't find nothing directly related in the changelog: http://changelogs.ubuntu.com/changelogs/pool/main/p/postgresql-9.1/postgresql-9.1_9.1.8-0ubuntu12.10/changelog that mentions why that setting is forced to true, but either way, if you are using Postgres, and celery or flask, you will notice that the application will start to crash and it will start throwing , using sqlalchemy, operational errors exceptions.

GAE Pipelines

I keep on working with Pipelines, stuff didn't improve since last time. I'm still using the free version of GAE and scraping.

I've come to a spot were I've refactored my code a lot during the last 9 days.

The flow is quite straight forward, open a page, scrape the information and follow links, were I'll have additional fields to extract.

The problem relies on how I handle the connection to the server, I can't move out of Soup, it isn't nice to fetch fields using node.parent.parent.parent, but that's the best I can do with Soup.

I'm seriously pondering about using Yql, but I can't get the real stats on how many records there are on the page I'm scraping.

One of the pages has 1600 rows that i need to extract. I still can't find the proper approach to handle this scraping, even creating the objects is difficult. It's hard to design objects with pipelines, there's no easy way to measure how effective they should be. There are more constraints if i try to think on adding a lot of parallel jobs, from simple things as data write ops going higher due to the fact of having more pipelines, and also, the amount of data to extract in this page is quite brutal.

Each record may have about 56 to 41 write ops, that just allows me to run around 5 executions before my quota runs out.

I can't index the entity, since so far, all my attempts failed in production.

The flow goes like this

  • Open the page
  • Post with a particular field (5 times)
  • [yield] and process HTML
  • For each n records, follow the links and
  • [yield] and extract

N records is an int between 6 being the lowest I've seen and 1600

Each followed link is a yield operation also, which uses Soup inside to fetch more data.

Here's the ouput, more or less we have this

Posts   Generate n childs per link  Scraped records
1       20                           100%
5       Unlimited                      0

If I search only 1 record, and on that page that may have between 6 and 1600 records and follow only 20 childs, the 20 yielded childs work fine. The pattern I'm forced to use with soup is

soup = BeautifulSoup(fugly_html, from_encoding='ascii')
soup.find(text=re.compile('something'))

You may ponder yourself, why are you using a regex and a search like that ?. Well, the page is a mess, and stuff on the page may be dynamic, I honestly can't assure that stuff is going to be were I'm expecting it. That particular line of code works 100% only if i'm under those conditions. The next row of that table explains how it goes, if I post 5 times, and generate 5 childs, and those childs generate an unlimited amount of connections, basically the re.compile fails. I tried upping the memory of the instance, opening the links before yielding the page, but the other thing that is playing against me, is that, each of the links is also a different record, I can't cache that page and process it multiple times. I'm seriously pondering about Yql, but I don't know if this has a workaround. I also have the 10 minute deadline for each pipeline child, and also, each generated child consumes at least 1 write op in my database, which basically plays against my quota...

I may have to rely on Tasks, particularly Push.

LXML on Google App Engine

If you read the Google App Engine documentation, you can indeed use libxml if you configure your app.yaml to include that library.

For example

libraries:
 - name: lxml
 - version: latest

Here's the problem

In my previous posts, I've been writing about using BeautifulSoup for scraping web pages, using either urllib or Mechanize.

When you enable lxml, beautifulsoup will automatically start using lxml, which triggers the following error.

ParserError: Unicode parsing is not supported on this platform

Here you will find the StackOverflow page with shows the same error

Lxml Unicode Parsererror

Yeah, that was a huge step back in my project. I was happily using xpath expressions, mostly due to the nature of a really complex web page, that not even using python-chardet I can find out which is the damn encoding of the page.

Soup will break, even if I use ascii as the encoding.

Decoding the string before sending it back to Soup won't help, since I couldn't either encode or decode it, using either ascii, utf-8 and will hit that previous StackOverflow page I added the link.

Mechanize will break if I try to send a custom request, using utf-8 as the encoding

I was decided to keep my xpath expressions, that were clear and allowed me to extract the portions of the page I wanted too, without too much problems.

So i went to my next option, Py-Dom-Xpath

I don't have to tell you, that this approach didn't work out either... it needs a well formed source.

I pushed minidom, soup and this one to work together, but still, no cigar....

I had to suck it up and go back to soup, and do some transverse over the nodes using regex and awesome stuff like node.parent.parent.parent.parent

The reason, I couldn't find the expression to fetch the row based on a string, that the web page awesomely stuff all together sometimes like

<tr><td><font>Hell:</font></td><td><font>33333</font></td></tr>

I need the 3333 value, that will be dynamic.

Yet, the only workaround I found was to regex Hell and transverse back to the tr, then ask for the fonts (findAll) and I get that 3333

The only known thing about this page, is that almost always will have that structure, sometimes it may put an anchor instead of a font, and sometimes the values will have \xa0's and \xc2's , sometimes I will find huge white spaces that aren't rendered, sometimes I will have nbsp's too.

I had to hack that quickly, mostly due to a time shortage.

It would had been an awesome thing to know this beforehand, since I made the mistake of just uploading to google app engine and test lxml alone.

Here's an important tip

It doesn't matters at all that it works on your side with the SDK, the app engine production instance is different, and it will bite you back.

I spent time doing tdd, but it's becoming painful to maintain this approach to create readable code and then hitting this surprises.

Last time I also thought that I had the indexes error fixed, but that wasn't the case.

After I spent time reading about indexes, and how to create them, vacuum them, update them (Indexes) my application, that works with a cron, and HRD entities worked without problems for two days.

Today, when I saw my lxml crash, i also noticed that I started to get the same error of the indexes again.

    Generator xxxxxx(*(), **{})#d211aee1aa9511e1ba78ed0703a61199 raised exception. NeedIndexError: no matching index found.
    The suggested index for this query is:
    - kind: my_entity
      properties:
      - name: xx --> boolean
      - name: xxxx --> date
    Traceback (most recent call last):
      File "pipeline.py", line 2003, in evaluate
        yielded = pipeline_iter.send(next_value)
      File "my_file.py", line 36, in run
        states = controller.fetch_records_to_process()
      File "my_file.py", line 78, in fetch_records_to_process
        rows = self.fetch_states_with_pending_pass()
      File "my_file.py", line 100, in fetch_states_with_pending_pass
        .fetch(self.min_rows_to_process)
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2144, in fetch
        return list(self.run(limit=limit, offset=offset, **kwargs))
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2313, in next
        return self.__model_class.from_entity(self.__iterator.next())
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2809, in next
        next_batch = self.__batcher.next()
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2671, in next
        return self.next_batch(self.AT_LEAST_ONE)
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2708, in next_batch
        batch = self.__next_batch.get_result()
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 604, in get_result
        return self.__get_result_hook(self)
      File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2460, in __query_result_hook
        str(exc) + '\nThe suggested index for this query is:\n' + yaml)
    NeedIndexError: no matching index found.
    The suggested index for this query is:
    - kind: my_entity
      properties:
      - name: xx --> boolean
      - name: xxxx --> date

    W2012-05-30 13:27:51.736

Giving up on pipeline ID "d211aee1aa9511e1ba78ed0703a61199" after 3 attempt(s); causing abort all the way to the root pipeline ID "d211aee1aa9511e1ba78ed0703a61199"

I don't know, I guess I'll have to rewrite the damn thing to and remove one of the values, and see how far it goes, or try to find more information.

Neither vacuum, update, or delete the whole records and re-populate worked at all.

Google App Engine

My ongoing project it's deployed on Google App Engine.

The setup that I'm using includes

  • Python 2.7
  • HRD
  • Pipeline API
  • Mechanize
  • urlfetch
  • BeautifulSoup

I've been mentioning this for awhile now, but after a couple of months of experience on my back, I can mention more things about google app engine.

I'm using the free version, that has a daily quota. After that, the deploy goes down and you can't do pretty much, except than to wait till next day till the quota resets, and you are fresh to start again.

What consumes my quota ?

Mostly, HRD write operations and read operations, that is, when you operate with your database. This includes also the Pipeline tasks, that also generate write operations on the HRD. Since the nature of my project requires the pipelines, we have that playing against us too, but it is better handled than the regular write ops that I have on my entities.

What's one of the challenges here ?

Create an instance, with the proper indexes to reduce the write operation costs.

I saw a huge drop on the write operations when I defined indexes, though this took time. Finding the information and applying it took time, mostly due to the whole thing of learning what my client wants, and during this iterative process that I have with him, we learn from our mistakes and improve during the next run.

The entities that I'm working aren't complex, and aren't using parent / child relations, they are straight forward entities. Why ?

The workflow I'm working on, it heavily uses the datastore to store scraped data from the web. A bad defined index will increase the write operations per record, which incurs on burning your quota faster.

Some stuff to keep in mind

There's a lot of information and opinions going around about Google App Engine, there are very negative ones, but after using it for awhile, it's clear that you have to choose the proper tool for what you are using.

Yes, sometimes it's a bit frustrating to find out that your application isn't working, or suddenly you hit your quota and you can't operate in production.

So far, I did not have downtime due to Google errors. I mention this, because I saw a lot of, you will have a lot of downtime. That didn't happen to me so far.

The development version may work differently to what you have in production, I did notice this with the indexes.

One of my workflows was working perfectly here, but when I uploaded it, I started to have a lot of indexes errors.

I'm still trying to learn how to stub services, but not the datastore, since that is properly documented, mostly I'm talking about the User stub, since I couldn't make it work with testbed.setenv(), I had to use os.environment['var'] ..I have some complex stuff to test to, that puts me in the position where I've got to think how much it would solve and allow me to move forward, rather than consume my work hours on a test task, that won't cover much of the whole application.

After all, it's a nice exercise for the mind and I'm enjoying it so far, it's something different than what I was doing previously on my other company, that was boring and dull.

Google App Engine Myth

I'm still working with App Engine using Python, doing scraping of the same sites, but in this post, I'll write about what happened since my last post. One of the things I managed to do, was to download PDF's using Python Mechanize, Urllib and Urllib2. You may be pondering, well that is not something too complex.... as a matter of fact, it was, and it was terribly time consuming. The site I was scraping serves PDF's files, but the files aren't anchors that you can simply click and download. The site uses session, yes, like in PHP or any other language. So, I actually do not have a "link" per se, I have an input type image that with javascript hits an endpoint. That endpoint refreshes the server session and when we have the response back, we will be redirected (with javascript too) to another page, which is the landing page of the site I'm scraping. In the landing page, it will use my session to detect if I requested a PDF file and there the server will magically give me the file. Written like that doesn't sounds complex, but you have to take into account that

  • I can't use Javascript in Mechanize.
  • The only Javascript libraries for python in GAE (such as Python Spidermonkey) doesn't seems to help too much.
  • I can't use Selenium, because that won't run in GAE, and the client that hired me specifically wants to run this in GAE.

So, after a couple of days (I think that it took me 2 days to discover how the site worked using firebug and analyzing the requests) I came up with this.

browser = self.mechanize_browser
browser._factory.encoding = 'utf-8'
browser.select_form(nr=0)
browser[api_input_name] = api
response = browser.submit(name=search_input_name)
filename = parser.determine_position(response.read(), job_date)
if len(filename) > 0:
   browser.select_form(nr=0)
   # Create a custom request
   data = self.create_custom_api_download_request(api, browser, filename)
   browser.select_form(nr=0)
   # Prepare their ASPSESSION and simulate a submit,
   # that will guarantee
   # a fresh session for the next GET request
   browser.open(main_site, data)
   time.sleep(time_before_opening_page)
   # Now, we indicate their server that we will do a GET
   # this allows us to get the stream
   stream = browser.open(download_url)
   pdf_stream = stream.read()
def create_custom_api_download_request(self, api, browser, event_argument):
        """
        Create a custom request using urllib and return
        the encoded request parameter.
        The keys __EVENTKEY and __EVENTVALIDATION
        are tracking values that the server sends back
        on every page. They change per request
        @var api: String The api of the well
        @var browser: Mechanize.browser
        @var event_argument: The filename
        @return: urllib.urlencode dictionary
        """
        if browser.form is None:
            raise Exception('Form is NONE')
        api_input_name = self.config[self.config_key]['api_input']
        custom_post_params = \
        self.config[self.config_key]['download_post_params']
        payload = {}
        for key in custom_post_params:
            payload[key] = custom_post_params[key]
        payload['__EVENTVALIDATION'] = browser.form['__EVENTVALIDATION']
        payload['__VIEWSTATE'] = browser.form['__VIEWSTATE']
        payload['__EVENTARGUMENT'] = event_argument
        payload[api_input_name] = api
        return urllib.urlencode(payload)

A couple of notes

Using a custom factory for mechanize was required, since we were reading a raw PDF string, the default factory (ie, the parser that mechanize uses to read the data, such as BeautifulSoup) was having a problem with the raw pdf stream. So, using browser._factory.encoding = 'utf-8' solved that problem. Regarding the method determine_position, well don't pay attention to that, because that is just part of the business logic that the site has, and it has to be solved using that method, let's just say that the method locates the pdf "link" in a table, since I can have multiple results. Then, we create a custom request using urllib, that is the method create_custom_api_download_request . With that custom request, we will feed our mechanize browser instance and again, more complexities of the site. If I didn't put that sleep, I was going to be hitting the site really fast and I was getting bad responses, so I used a sleep to win some time. After that, we just use the method open, with our custom request but pointing to the landing page, and voila, I will get the pdf.

Downsides of doing this

Well, without taking into account that the whole flow is terribly complex, and I'm just writing about one specific thing I do, using GAE for this kind of tasks doesn't seems a pretty good idea.

GAE MYTH?

Well, now really the main thing. Our client is really focused and interested on using only GAE for this complex scraping app. He pointed me out about using "tasks", or the push tasks, because you could configure the rate of execution, blah blah blah. Our most important task is PDF Scraping, that I do with PdfMiner The thing is, this is an automated application, even creating a custom task won't help it, it is too "heavy" to use in GAE, it depletes the resources really fast. By that, I mean, if you have a $2 budget you will have to come up with a very good rate configuration. Pdfminer is the only good library that can actually give me results in XML that I can parse using lxml. The pdf files that I read, are complex tables converted to pdf from Microsoft Excel. It was a really complex task to figure out how it worked, but my client provided me with a sample for the first section of the pdf, and I worked out with the second part of the pdf. I can process 10 pdf's per minute, any value higher than that (ie, 20 tasks per minute, or 20 tasks per second) will end up with the queue dropping tasks because it can't process, and my budget will be depleted faster. See, I believe that if you are going to be using something experimental as GAE, you should first spend a lot of time researching, not just throw your cash there and expect immediate results. So, even though i got a budget increase of 5 bucks, I still can't have 24 real hours of uptime. The instance now is heavily focused on running pdf's but if I enable all the things that the instance should be doing, $5 isn't enough !. I managed to run with $2 for around 10 real hours, but again, the only thing that the application could do, was to scrape 10 pdf's per minute and every 15, it was sending HRD results to Fusion tables. (That is complex too ). When I say "real hours", I mean real hours, in app engine, it will show up something as 68 hours of uptime, but that are like 10 real hours.

CONCLUSION

Before jumping into something experimental, research , research and research even more. Before jumping into the GAE wagon, research a lot, and I can't stress out the "a lot" part. I don't blame GAE for this, I think that is a great thing from Google, but you have to use the right tool , and it happens to be that GAE is not the right tool when you don't have any plan and expect it to adapt magically to your needs... read the fine manual!