SharePoint 2007 – Sitemap XML files for Better Crawling by Search Engines

If you are using SharePoint to build a pubic facing web site and you’d like your site to indexed (sometimes called crawled in SharePoint speak) by the popular Internet search engines (Google, Live Search, Yahoo, etc) you’ll want to pay particular attention to this article.  

By provide two XML files named sitemap.xml and sitemap_index.xml at the root of your site you will improve your odds of having search engines properly crawl your SharePoint site.  For example: http://www.example.com/sitemap.xml and http://www.example.com/sitemap_index.xml 

When your web site is indexed by search engines all the URLs listed in the sitemap file will be indexed and followed. The data provided in sitemap will let the search engine know what pages to crawl, the frequency the web page is updated and how important the page is to your site. Sitemap files do not totally guarantee your site will be indexed and receive high search engine rankings, however it is an important step to assist search engines which could lead to better results.

Here’s how the two files work together: sitemap.xml is an xml file that lists the pages in your web site, sitemap_index.xml is a file that references your sitemap.xml and allows you to provide multiple sitemap.xml files. 

Here is an example of what a sitemap.xml document should look like:

<?xml version="1.0" encoding="UTF-8"?>↵
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
    <loc>http://www.example.com/index.html</loc>
    <lastmod>2008-04-20</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.9</priority>
</url>
<url>
    <loc>http://www.example.com/news.html</loc>
    <lastmod>2008-08-22</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.5</priority>
</url>
<url>
    <loc>http://www.example.com/tips.html</loc>
    <lastmod>2008-08-22</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.5</priority>
</url>
</urlset>

Basically you want to have one <url> tagset for each page on your web site that you want indexed  The tag <loc> is the only required element inside the <url> tagset, the tags <changefreq> and <priority> are optional.

A sitemap.xml file should contain no more than 50,000 URL’s. If this is exceeded you will need to create a sitemap index file (sitemap_index.xml) which can contain up to 1000 sitemap.xml entries.  There are also facilities to compress your sitemap files if they become very large.

Here is an example of what a sitemap_index.xml file should look like:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http://www.example.com/sitemap.xml</loc>
      <lastmod>2008-05-01T16:44:12+00:00</lastmod>
   </sitemap>
   <sitemap>
      <loc>http://www.example.com/sitemap2.xml</loc>
      <lastmod>2008-05-01</lastmod>
   </sitemap>
</sitemapindex>

There’s a great page that details out the complete specifications for sitemap files on the www.sitemaps.org web site.  Read the complete specification here.  

Sitemap and sitemap index files are simple to create if you have a small site, but how would you create one for a large site, one with many many pages or one with extremely dynamic content (adding pages, deleting pages, adding subsites frequently)?  The answer is to use one of the following tools (there may be more) to generate your sitemap files for you.  Both of the following software pages are written to explore a SharePoint site using the SharePoint object model, then create the sitemap files and automatically save them at the root of your site.

KWizCom SharePoint XML Site Map Builder

Tim Dobrinski’s sitemap tool