Monkeyshines: guided scraper
Monkeyshines is a tool for doing an algorithmic scrape.
It’s designed to handle large-scale scrapes that may exceed the capabilities of single-machine relational databases, so it plays nicely with Hadoop / Wukong, with distributed databases (MongoDB, tokyocabinet, etc.), and distributed job queue (eg edamame/beanstalk).
Send Monkeyshines questions to the Infinite Monkeywrench mailing list
Install
This is best run standalone — not as a gem; it’s still in heavy development. I recommend cloning
- http://github.com/mrflip/edamame
- http://github.com/mrflip/wuclan
- http://github.com/mrflip/wukong
- http://github.com/mrflip/monkeyshines (this repo)
into a common directory.
Additionally, you’ll need some of these gems:
- addressable (2.1.0)
- extlib (0.9.12)
- htmlentities (4.2.0)
To build the gem, you’ll need
- git (1.2.2)
- jeweler (1.2.1)
- rake (0.8.7)
- rspec (1.2.6)
- rubyforge (1.0.4)
- sources (0.0.1)
And if you spell ruby with a ‘j’, you’ll want
- jruby-openssl (0.5.2)
- json-jruby (1.1.7)
Request Queue
Periodic requests
Request stream can be metered using read-through, scheduled (eg cron), or test-and-sleep.
- Scheduled
- Test and sleep. A queue of resources is cyclically polled, sleeping whenever bored.
Requests
- Base: simple fetch and store of URI. (URI specifies immutable unique resource)
- : single resource, want to check for updates over time.
- Timeline:
- Message stream, eg. twitter search or user timeline. Want to do paginated requests back to last-seen
- Feed: Poll the resource and extract contents, store by GUID. Want to poll frequently enough that single-page request gives full coverage.
Scraper
- HttpScraper —
- JSON
- HTML
- \0 separates records, \t separates initial fields;
- map \ to \\, then tab, cr and newline to \t, \r and \n resp.
- map tab, cr and newline to 	 
 and 
 resp.
x9 xa xd x7f
- HeadScraper — records the HEAD parameters
Store
- Flat file (chunked)
- Key store
- Read-through cache
Periodic
- Log only every N requests, or t minutes, or whatever.
- Restart session every hour
- Close file and start new chunk every 4 hours or so. (Mitigates data loss if a file is corrupted, makes for easy batch processing).
Pagination
Session
- Twitter Search: Each req brings in up to 100 results in strict reverse ID (pseudo time) order. If the last item ID in a request is less than the previous scrape session’s max_id, or if fewer than 100 results are returned, the scrape session is complete. We maintain two scrape_intervals: one spans from the earliest seen search hit to the highest one from the previous scrape; the other ranges backwards from the highest in this scrape session (the first item in the first successful page request) to the lowest in this scrape session (the last item on the most recent successful page request).
- Set no upper limit on the first request.
- Request by page, holding the max_id fixed
- Use the lowest ID from the previous request as the new max_id
- Use the supplied ‘next page’ parameter
- Twitter Followers: Each request brings in 100 followers in reverse order of when the relationship formed. A separate call to the user can tell you how many total followers there are, and you can record how many there were at end of last scrape, but there’s some slop (if 100 people in the middle of the list /un/follow and 100 more people at the front /follow/ then the total will be the same). High-degree accounts may have as many as 2M followers (20,000 calls).
- FriendFeed: Up to four pages. Expiry given by result set of <100 results.
- Paginated: one resource, but requires one or more requests to
- Paginated + limit (max_id/since_date): rather than request by increasing page, request one page with a limit parameter until the last-on-page overlaps the previous scrape. For example, say you are scraping search results, and that when you last made the request the max ID was 120_000; the current max_id is 155_000. Request the first page (no limit). Using the last result on each page as the new limit_id until that last result is less than 120_000.
- Paginated + stop_on_duplicate: request pages until the last one on the page matches an already-requested instance.
- Paginated + velocity_estimate: . For example, say a user acquires on average 4.1 followers/day and it has been 80 days since last scrape. With 100 followers/req you will want to request ceil( 4.1 * 80 / 100 ) = 4 pages.
Rescheduling
Want to perform next scrape to give a couple pages or a mostly-full page. Need to track a rate (num_items / timespan), clamped to a min_reschedule / max_reschedule bounds.