I have heard many good things about Google Webmaster Tools, and set out to get brockbouchard.net registered. One of the best features of the webmaster tools is that you can build a “Sitemap” for your site (which is just XML describing your site’s content) and submit it directly to Google. However, generating the Sitemap at first looks like an arduous task. Fortunately, some individuals in the Rails community set out to make the task easier for all of us:
- Alastair Brunton, with improvements from Harry Love, created a means of generating your Sitemap dynamically for each model in your database.
- Phil Misiowiec at Webficient created a tool to generate a Sitemap for your Rails app’s static content.
While each is incredibly useful, I wanted a solution that combined both. I thus took the code created by all of the above, and extended their solutions to generate a Sitemap for both your dynamic and static content all at once. Curiously, I also ran into and fixed a problem with the dynamic Sitemap generator whereby the XML created was a single line and Google was rejecting it with a non-descript error.
To get up and running with all of this, do the following:
1. Make sure you have the “mechanize” gem installed:
sudo gem install mechanize
2. Be sure to create a “sitemaps” subfolder in your [rails_app]/public directory.
3. Copy the two files below to your [rails_app]/lib directory:
# static_crawler.rb
require 'mechanize'
class StaticCrawler
# EXTENSIONS_IGNORED = %w[.csv .doc .docx .gif .jpg .jpeg .js .mp3 .mp4 .mpg .mpeg .pdf .png .ppt .rss .swf .txt .xls .xlsx .xml]
# BRB - In my case, I want to index document types like doc and pdf
EXTENSIONS_IGNORED = %w[.csv .gif .jpg .jpeg .js .mp3 .mp4 .mpg .mpeg .png .rss .swf .xml]
PROTOCOLS_IGNORED = %w[feed ftp itms javascript mailto]
def initialize(starting_url, credentials = nil, quiet_mode = false, sitemap = false, debug = false)
@bad_pages = []
@agent = WWW::Mechanize.new
@sitemap = sitemap
@debug = debug
@visited_pages = []
if credentials
creds = credentials.split(':')
@agent.basic_auth(creds[0], creds[1])
end
@quiet_mode = quiet_mode
@starting_url = starting_url
@starting_url_domain = starting_url[/([a-z0-9-]+)\.([a-z.]+)/i]
puts "domain: #{@starting_url_domain}" if @debug
extract_and_call_urls(starting_url)
generate_sitemap if @sitemap
end
def extract_and_call_urls(url)
#get page
puts "#{@visited_pages.size+1} #{url}" unless @quiet_mode
begin
page = @agent.get(url)
rescue => exception
@bad_pages << url
puts "error: #{url}, #{exception.message}"
return
end
#for any content types we may have missed above, exit if content type is not html
return if page.instance_of?(WWW::Mechanize::File) || page.content_type.index('text/html') == nil
#add to array
@visited_pages << url
#get links found on page
links = page.links
#for each link, call the url if not in history
links.each{ |link| extract_and_call_urls(link.href) unless
ignore_url?(link.href) || @visited_pages.include?(link.href) }
end
private
def ignore_url?(url)
begin
return ignored = true if url.nil? ||
(url.include? 'http' and !url.include?("webficient.com")) ||
@bad_pages.include?(url) ||
PROTOCOLS_IGNORED.find{ |prt| url =~ /#{prt}:/ } != nil ||
EXTENSIONS_IGNORED.find{ |ext| url =~ /#{ext}$/ } != nil
ensure
puts "ignored: #{url}" if ignored and @debug
end
end
def generate_sitemap
xml_str = ""
xml = Builder::XmlMarkup.new(:target => xml_str, :indent => 2)
xml.instruct!
xml.urlset(:xmlns=>'http://www.sitemaps.org/schemas/sitemap/0.9') {
@visited_pages.each do |url|
unless @starting_url == url
xml.url {
xml.loc(@starting_url + url)
xml.lastmod(Time.now.utc.strftime("%Y-%m-%dT%H:%M:%S+00:00"))
xml.changefreq('weekly')
}
end
end
}
save_file(xml_str)
# BRB - don't need to call this as something similar is called at the end of ModelCrawler
# update_google
end
# Saves the xml file to disc. This could also be used to ping the webmaster tools
def save_file(xml)
File.open(RAILS_ROOT + '/public/sitemaps/static.xml', "w+") do |f|
f.write(xml)
end
end
# Notify google of the new sitemap
# def update_google
# sitemap_uri = @starting_url + '/sitemap.xml'
# escaped_sitemap_uri = URI.escape(sitemap_uri)
# Net::HTTP.get('www.google.com',
# '/webmasters/tools/ping?sitemap=' +
# escaped_sitemap_uri)
# end
end
# model_crawler.rb
require 'net/http'
require 'uri'
require 'zlib'
# A class specific to the application which generates a google sitemap from the contents of the database.
# Author: Alastair Brunton
# Modified: Harry Love 2008-06-09
class ModelCrawler
def initialize(base_url, sources)
@base_url = base_url
@sources = sources
end
# 1. Iterate through each model's #get_paths method
# 2. Create sitemap file for each model
# 3. Create sitemap index file
# 4. Ping Google
def generate
path_ar = []
sitemaps = []
@sources.each do |source|
# initialize the class and call the get_paths method on it.
path_ar = eval("#{source}.get_paths")
xml = generate_sitemap(path_ar)
save_file(source, xml)
end
index = generate_sitemap_index(@sources)
save_file('index', index)
update_google
end
# Create a sitemap document for a model
def generate_sitemap(path_ar)
xml_str = ""
xml = Builder::XmlMarkup.new(:target => xml_str)
xml.instruct!
xml.urlset(:xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9') {
path_ar.each do |path|
xml.url {
xml.loc(@base_url + path[:url])
xml.lastmod(path[:last_mod])
xml.changefreq('weekly')
}
end
}
xml_str
end
# Create a sitemap index document
def generate_sitemap_index(sitemaps)
xml_str = ""
xml = Builder::XmlMarkup.new(:target => xml_str, :indent => 2)
xml.instruct!
xml.sitemapindex(:xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9') {
xml.sitemap {
xml.loc(@base_url + "/sitemaps/static.xml")
xml.lastmod(Time.now.strftime('%Y-%m-%d'))
}
sitemaps.each do |site|
xml.sitemap {
xml.loc(@base_url + "/sitemaps/#{site}.xml.gz")
xml.lastmod(Time.now.strftime('%Y-%m-%d'))
}
end
}
xml_str
end
# Save the xml file (gzipped) to disk
def save_file(source, xml)
File.open(RAILS_ROOT + "/public/sitemaps/#{source}.xml.gz", 'w+') do |f|
gz = Zlib::GzipWriter.new(f)
gz.write xml
gz.close
end
end
# Notify Google of the new sitemap index file
def update_google
sitemap_uri = @base_url + '/sitemaps/index.xml.gz'
escaped_sitemap_uri = URI.escape(sitemap_uri)
Net::HTTP.get('www.google.com', '/webmasters/tools/ping?sitemap=' + escaped_sitemap_uri)
end
end
4. Alter deploy.rb
Now you’ll need an entry in your [rails_app]/config/deploy.rb file to copy your Sitemaps over with each new release:
namespace :sitemap do
desc "Copy the sitemap files after deploy"
task :copy_sitemap, :roles => :app, :on_error => :continue do
puts "copying Rails sitemap files"
run "cp #{previous_release}/public/sitemaps/* #{current_release}/public/sitemaps/"
end
end
after :deploy, 'sitemap:copy_sitemap'
5. Create a rake task
Now add a rake task to actually perform the Sitemap generation by creating the [rails_app]/lib/tasks/sitemap.rake file and adding the following code:
require 'static_crawler'
require 'model_crawler'
site_url = ENV['URL'] || 'http://localhost:3000'
namespace :sitemap do
desc 'Crawl the site and create sitemap xml files for both static and dynamic content. Set CREDS as username:password if you are hitting a password protected site.'
task(:generate => :environment) do
# Generate static sitemap
sitemap = StaticCrawler.new(site_url, (ENV['CREDS'] if ENV['CREDS']), true, true, false)
# Generate dynamic sitemaps for each of the models listed in the array
models = %w( Project )
sitemap = ModelCrawler.new(site_url, models)
sitemap.generate
end
end
6. Setup a cron task
Finally, add an entry in your crontab to periodically run the rake task and generate Sitemaps:
30 9 * * * cd /path/to/rails/app && /path/to/rake sitemap:generate URL=http://domain.com RAILS_ENV=production
Be sure to verify the path to your rake command. It can be different on some systems.