- 
                Notifications
    You must be signed in to change notification settings 
- Fork 8
Spider is a Web spidering library for Ruby. It handles the robots.txt, scraping, collecting, and looping so that you can just handle the data.
psq/spider
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Spider, a Web spidering library for Ruby. It handles the robots.txt,
scraping, collecting, and looping so that you can just handle the data.
== Examples
=== Crawl the Web, loading each page in turn, until you run out of memory
 require 'spider'
 Spider.start_at('http://mike-burns.com/') {}
=== To handle erroneous responses
 require 'spider'
 Spider.start_at('http://mike-burns.com/') do |s|
   s.on :failure do |a_url, resp, prior_url|
     puts "URL failed: #{a_url}"
     puts " linked from #{prior_url}"
   end
 end
=== Or handle successful responses
 require 'spider'
 Spider.start_at('http://mike-burns.com/') do |s|
   s.on :success do |a_url, resp, prior_url|
     puts "#{a_url}: #{resp.code}"
     puts resp.body
     puts
   end
 end
=== Limit to just one domain
 require 'spider'
 Spider.start_at('http://mike-burns.com/') do |s|
   s.add_url_check do |a_url|
     a_url =~ %r{^http://mike-burns.com.*}
   end
 end
=== Pass headers to some requests
 require 'spider'
 Spider.start_at('http://mike-burns.com/') do |s|
   s.setup do |a_url|
     if a_url =~ %r{^http://.*wikipedia.*}
       headers['User-Agent'] = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
     end
   end
 end
=== Use memcached to track cycles
 require 'spider'
 require 'spider/included_in_memcached'
 SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211']
 Spider.start_at('http://mike-burns.com/') do |s|
   s.check_already_seen_with IncludedInMemcached.new(SERVERS)
 end
=== Track cycles with a custom object
 require 'spider'
 class ExpireLinks < Hash
   def <<(v)
     [v] = Time.now
   end
   def include?(v)
     [v] && (Time.now + 86400) <= [v]
   end
 end
 Spider.start_at('http://mike-burns.com/') do |s|
   s.check_already_seen_with ExpireLinks.new
 end
=== Store nodes to visit with Amazon SQS
 require 'spider'
 require 'spider/next_urls_in_sqs'
 Spider.start_at('http://mike-burns.com') do |s|
   s.store_next_urls_with NextUrlsInSQS.new(AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY)
 end
==== Store nodes to visit with a custom object
 require 'spider'
 class MyArray < Array
   def pop
	super
   end
  
   def push(a_msg)
     super(a_msg)
   end
 end
 Spider.start_at('http://mike-burns.com') do |s|
   s.store_next_urls_with MyArray.new
 end
=== Create a URL graph
 require 'spider'
 nodes = {}
 Spider.start_at('http://mike-burns.com/') do |s|
   s.add_url_check {|a_url| a_url =~ %r{^http://mike-burns.com.*} }
   s.on(:every) do |a_url, resp, prior_url|
     nodes[prior_url] ||= []
     nodes[prior_url] << a_url
   end
 end
=== Use a proxy
 require 'net/http_configuration'
 require 'spider'
 http_conf = Net::HTTP::Configuration.new(:proxy_host => '7proxies.org',
                                          :proxy_port => 8881)  
 http_conf.apply do
   Spider.start_at('http://img.4chan.org/b/') do |s|
     s.on(:success) do |a_url, resp, prior_url|
       File.open(a_url.gsub('/',':'),'w') do |f|
         f.write(resp.body)
       end
     end
   end
 end
== Author
John Nagro [email protected]
Mike Burns http://mike-burns.com [email protected] (original author)
Help from Matt Horan, and Henri Cook.
With `robot_rules' from James Edward Gray II via
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589
About
Spider is a Web spidering library for Ruby. It handles the robots.txt, scraping, collecting, and looping so that you can just handle the data.
Resources
Stars
Watchers
Forks
Releases
No releases published
              Packages 0
        No packages published