download sourceball as .zipdownload sourceball as .tardownload sourceball as static gem

Monkeyshines :: scraper

Monkeyshines: guided scraper

Monkeyshines is a tool for doing an algorithmic scrape.

It’s designed to handle large-scale scrapes that may exceed the capabilities of single-machine relational databases, so it plays nicely with Hadoop / Wukong, with distributed databases (MongoDB, tokyocabinet, etc.), and distributed job queue (eg edamame/beanstalk).

Send Monkeyshines questions to the Infinite Monkeywrench mailing list

Install

This is best run standalone — not as a gem; it’s still in heavy development. I recommend cloning

into a common directory.

Additionally, you’ll need some of these gems:

To build the gem, you’ll need

And if you spell ruby with a ‘j’, you’ll want

Request Queue

Periodic requests

Request stream can be metered using read-through, scheduled (eg cron), or test-and-sleep.

Requests

Scraper

x9 xa xd x7f

Store

Periodic

Pagination

Session

Rescheduling

Want to perform next scrape to give a couple pages or a mostly-full page. Need to track a rate (num_items / timespan), clamped to a min_reschedule / max_reschedule bounds.

More info

There are many useful examples in the examples/ directory.

Credits

monkeyshines was written by Philip (flip) Kromer (flip@infochimps.org / @mrflip) for the infochimps project

Help!

Send monkeyshines questions to the Infinite Monkeywrench mailing list

Monkeyshines News


Fork me on GitHub