I keep on working with Pipelines, stuff didn't improve since last time. I'm still using the free version of GAE and scraping.
I've come to a spot were I've refactored my code a lot during the last 9 days.
The flow is quite straight forward, open a page, scrape the information and follow links, were I'll have additional fields to extract.
The problem relies on how I handle the connection to the server, I can't move out of Soup, it isn't nice to fetch fields using node.parent.parent.parent, but that's the best I can do with Soup.
I'm seriously pondering about using Yql, but I can't get the real stats on how many records there are on the page I'm scraping.
One of the pages has 1600 rows that i need to extract. I still can't find the proper approach to handle this scraping, even creating the objects is difficult. It's hard to design objects with pipelines, there's no easy way to measure how effective they should be. There are more constraints if i try to think on adding a lot of parallel jobs, from simple things as data write ops going higher due to the fact of having more pipelines, and also, the amount of data to extract in this page is quite brutal.
Each record may have about 56 to 41 write ops, that just allows me to run around 5 executions before my quota runs out.
I can't index the entity, since so far, all my attempts failed in production.
The flow goes like this
- Open the page
- Post with a particular field (5 times)
- [yield] and process HTML
- For each n records, follow the links and
- [yield] and extract
N records is an int between 6 being the lowest I've seen and 1600
Each followed link is a yield operation also, which uses Soup inside to fetch more data.
Here's the ouput, more or less we have this
Posts Generate n childs per link Scraped records 1 20 100% 5 Unlimited 0
If I search only 1 record, and on that page that may have between 6 and 1600 records and follow only 20 childs, the 20 yielded childs work fine. The pattern I'm forced to use with soup is
soup = BeautifulSoup(fugly_html, from_encoding='ascii') soup.find(text=re.compile('something'))
You may ponder yourself, why are you using a regex and a search like that ?. Well, the page is a mess, and stuff on the page may be dynamic, I honestly can't assure that stuff is going to be were I'm expecting it. That particular line of code works 100% only if i'm under those conditions. The next row of that table explains how it goes, if I post 5 times, and generate 5 childs, and those childs generate an unlimited amount of connections, basically the re.compile fails. I tried upping the memory of the instance, opening the links before yielding the page, but the other thing that is playing against me, is that, each of the links is also a different record, I can't cache that page and process it multiple times. I'm seriously pondering about Yql, but I don't know if this has a workaround. I also have the 10 minute deadline for each pipeline child, and also, each generated child consumes at least 1 write op in my database, which basically plays against my quota...
I may have to rely on Tasks, particularly Push.