SEO For ASP.NET Web Sites: Getting content indexed

This is the last in a series of articles that explore how to optimise ASP.NET web sites for search engines. Each article in the series concentrates on a specific topic and looks at the features and tools available to ASP.NET developers to help make sites search engine friendly and thereby improve rankings and click throughs from search results. This article explores how to ensure that search engines are aware of all of the content you want them to crawl and how to inform them when the location of that content is no longer valid.

The full series of articles consists of

Sitemaps

Search engines generally follow links, which is how they build up a picture of the world wide web. If your site content is well linked, most search engines will be able to locate and crawl all of your content. Nevertheless you may want to consider constructing and submitting a sitemap to Google or Bing or any other search engine to help them locate your content, but also to help them understand more about that content such as its relative importance among all other content on your site, or when it was last modified.

map

A sitemap is a structured document. Google and Bing commonly support multiple XML-based formats including Sitemap 0.9, RSS 2.0 and ATOM 1.0. They also individually support some other formats. The sitemap should include the URLs of all the pages that you would like the search engines to crawl and index. There is an API within the .NET Framework for building RSS and ATOM feeds. It can be found in the System.ServiceModel.Syndication namespace. There is nothing that supports the sitemap protocol, but it is an uncomplicated structure so building your own sitemap generator is not difficult. First, here's an overview of the structure:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.mikesdotnetting.com/</loc>
    <lastmod>2015-12-01T12:00:00Z</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
</urlset>

The root of the document is urlset, which contains a collection of url nodes. The only required node within the url element is loc, which contains the URL to a resource. The lastmod node holds hte dtime that the resource was last modified in W3C DateTime format. You can use the changefreq node to suggest the frequency of any likely updates to the resource using one of a predefined list of options. Finally, the priority node lets you tell a search engine what priority you want to give a resource (from 0.1 to 1.0) relative to other resources on your site. Guidance on how to use this measure is provided on the sitemaps.org site.

The requirements for generating a sitemap are: a class to represent the sitemap and its items; an enumeration for the changefreq values; a means of populating the items; and a way to serialize the populated sitemap to XML. First the classes for the SiteMap and its items:

public class SiteMap
{
    public List<SiteMapItem> Items { get; set; }
}

public class SiteMapItem
{
    public Uri Loc { get; set; }
    public DateTime? LastMod { get; set; }
    public ChangeFrequency ChangeFreq { get; set; }
    public double? Priority { get; set; }
}

Here's the ChangeFrequency enumeration to ensure that only valid values are available for selection:

public enum ChangeFrequency
{
    NotSet,
    Always,
    Hourly,
    Daily,
    Weekly,
    Monthly,
    Yearly,
    Never
}

The SiteMapWriter class is reponsible for serialsing a SiteMap instance to XML. It is reliant on System.Linq, System.Xml and System.Xml.Linq, so you need to include using directives for those namespaces:

public class SiteMapWriter
{
    private SiteMap _siteMap;

    public SiteMapWriter(SiteMap siteMap)
    {
        _siteMap = siteMap;
    }

    public void WriteTo(XmlWriter writer)
    {
        XNamespace ns = "http://www.sitemaps.org/schemas/sitemap/0.9";
        var xml = new XDocument(new XDeclaration("1.0", "utf-8", "yes"),
            new XElement(ns + "urlset",
                    _siteMap.Items
                    .Select(item => new XElement(ns + "url",
                        new XElement(ns + "loc", item.Loc),
                                    item.LastMod.HasValue ? new XElement(ns + "lastmod", item.LastMod.Value.ToString("s") + "Z") : null,
                                    item.ChangeFreq != ChangeFrequency.NotSet ? new XElement(ns + "changefreq", item.ChangeFreq.ToString().ToLower()) : null,
                                    item.Priority.HasValue ? new XElement(ns + "priority", item.Priority.Value) : null
                                )
                            )
                        )
                    );
        xml.Save(writer);
    }
}

This code takes a SiteMap instance and iterates its Items property, which is a collection of SiteMapItems. It will generate url nodes for each item, only outputting child nodes that have values assigned. How you create your SiteMap will depend very much on the nature of your site. Mine is generated from data retrieved from the database along with some hard coded entries for the home page:

public SiteMap GetSiteMap()
{
    using (DotnettingContext context = new DotnettingContext())
    {
        var items = context.Articles.Select(a => new 
        {
            ID = a.ArticleID,
            Title = a.Headline,
            LastModified = a.DateAmended,
            DateCreated = a.DateCreated
        }).ToList().Select(i => new SiteMapItem
        {
            Loc = new Uri(string.Format("http://www.mikesdotnetting.com/article/{0}/{1}", i.ID, i.Title.ToSlug())),
            ChangeFreq = ChangeFrequency.Daily,
            LastMod = i.LastModified ?? i.DateCreated
        }).ToList();

        items.Add(new SiteMapItem {
            Loc = new Uri("http://www.mikesdotnetting.com"),
            ChangeFreq = ChangeFrequency.Hourly
        });
        return new SiteMap { Items = items };
    }
}

If you have a Web Forms site, a generic handler (.ashx file) is a good choice for exposing a SiteMap:

public class sitemap : IHttpHandler
{
    public void ProcessRequest(HttpContext context)
    {
        var service = new SiteMapService();
        var siteMap = service.GetSiteMap();
        var siteMapWriter = new SiteMapWriter(siteMap);
        using (var output = XmlWriter.Create(context.Response.Output))
        {
            context.Response.ContentType = "text/xml";
            siteMapWriter.WriteTo(output);
        }
    }

    public bool IsReusable
    {
        get
        {
            return false;
        }
    }
}

The code above assumes that the GetSiteMap method is housed in a class called SiteMapService. You can put it anywhere that suits your application. Then the code takes the XML generated by the SiteMapWriter class, and writes it to the Response when when the handler is requested and the ProcessRequest method is called by the ASP.NET runtime.

In an MVC application, you are likely to create a custom ActionResult to serve up the sitemap:

public class SiteMapResult : ActionResult
{
    private SiteMap _siteMap;

    public SiteMapResult(SiteMap siteMap)
    {
        _siteMap = siteMap;
    }

    public override void ExecuteResult(ControllerContext context)
    {

        context.HttpContext.Response.ContentType = "text/xml";

        var siteMapWriter = new SiteMapWriter(_siteMap);

        using (var output = XmlWriter.Create(context.HttpContext.Response.Output))
        {
            siteMapWriter.WriteTo(output);
        }
    }
}

The code for this is very similar to the generic handler. The main difference is that the sitemap is passed in to the ActionResult's constructor rather than being instantiated within the class. The SiteMapResult is used in a controller action:

public SiteMapResult SiteMap()
{
    var siteMap = _service.GetSiteMap();
    return new SiteMapResult(siteMap);
}

Once you have created your sitemap, you need to tell search engines about it. There are two ways to do this. The first is to register the sitemap file individually with each search engine. Google and Bing both provide webmaster tools that enable you to submit your sitemap to them. I'm not sure about other search engines, but I guess it would be quite time consuming to search around for such a feature and make individual submissions. Fortunately, the second method for informing search engines about your sitemap is a lot easier: do it in your robots.txt file.

Robots.txt

A robots.txt file is a way for site owners to give instructions to web robots, including search engine crawlers about what they should and shouldn't crawl. It is a standard text file, but it is placed in the root folder of your site. It is requested by crawlers that observe what is known as the /robots.txt standard, so that they can honour the directives it contains. My robots.txt file contains the following:

sitemap:	http://www.mikesdotnetting.com/sitemap
user-agent: *
disallow: /downloads/

The first line contains the sitemap directive and points to the canonical URL for mine. The second line specifies that what follows should apply to all user agents (bots/crawlers etc). The third line says that the contents of the downloads directory should not be indexed. This directory holds a few zip files contining working samples of code for some of my articles. The article includes a link to the download file so a search engine will certainly crawl, and my index the content. I could also remove the files from the attentions of search engines by adding rel="nofollow" to the download link itself, but this approach has two main flaws: first, I have to remember to do it each and every time, and second, someone else mght link to the zip file wtihout adding the nofollow attribute to their link.

You might be tempted to exclude robots from crawling your images, javascript or css files. However, this would be a very bad idea. Google in particular requests access to these files so they can download your complete page as a user would see it, partly so that they can assess download speed, which plays a part in their ranking algorithm. In fact, they state that blocking these resources may well result in suboptimal rankings for your content.

HTTP Status Codes

In many evolving dynamic sites, it's not uncommon for URL structures to change and for items to move about or disappear altogether. It is also a fact of life that your URLs can get mangled when visitors attempt to reproduce them on their own site, in forums, on social media and so on. The correct use of HTTP status codes is your best tool for informing search engines about any updates they should make to their index when they attempt to crawl content that no longer exists, never existed in the first place, or has moved. The most common HTTP Status codes that you will work with are:

  • 301 - Moved Permanently
  • 302 - Found (Moved Temporarily)
  • 404 - Not Found
  • 500 - Internal Server Error

301 Moved Permanently is used to tell search engines that the URL they requested has been replaced permanently with another one. This is the status code returned by the Friendly URLs framework if you request an .aspx file directly. It is also the code returned by the URL Rewrite module when you specify the redirectType as permanent. 301 is also returned when you use Response.RedirectPermanent in Web Forms or the RedirectPermanent or RedirectToRoutePermanent methods in ASP.NET MVC. When you return a 301 status code, any SEO value that the original URL has earned will be transferred to the new location.

302 Moved Temporarily serves no useful purpose from an SEO perspective. It is the status code that's generated whenever you use Response.Redirect in Web Forms, or the Redirect and RedirectToRoute action methods in MVC. When you return a 302, any SEO value that the original URL had stays with the old URL. Search engine indexes are not updated.

404 Not Found tells the search engine that the requested URL is not valid and that there is no replacement. Search engines will drop such URLs from their index. Advice from SEO experts is that you should create a custom 404 page providing alternative navigation options instead of relying on the default web server 404 page. That way, you are more likely to retain the visitor.

500 Internal Server Error is returned when an unhandled error occurs on the web server. Ideally, you should never return a 500 status code. However, there are circumstances outside of your control that may result in unhandled errors such as loss of database connectivity, files being moved, SMTP servers down and so on. You should log all 500 errors and examine the causes, mitigating where possible. If, for example, your application raises exceptions because a request contains a malformed URL - you expected and tried to process an int in your code, but it was missing or out of the expected range of values, for example - you should raise a new exception with an HTTP status code of 404 and show the user your custom 404 page.

Configuring Custom Error Pages With The Correct Status Code

This section shows how to configure error pages for 404 and 500 errors in Web Forms and MVC 5 and below. Most existing examples shpw how to configure custom errors using customErrors in the web.config file. This should only really be used if you are hosting on IIS 6. I recommend using the httpErrors section in the system.webServer section of the web.config file instead. If you follow this approach, you should remove the customErrors section.

In a Web Forms application, you should create separate pages specifically to handle 404 and 500 errors. In MVC, you will create separate views. The 404 error page/view will be displayed when the resource cannot be found. Therefore it can safely make use of master or layout pages that include dynamic content drawn from a database such as a navigation system etc. The page to be displayed in the event of a 500 error should not include any dynamic content. If the cause of the 500 error is a database connection failure, your custom error page will never be shown if it relies on a master or layout page that tries to connect to the database.

Here's an example of the relevant section of a web.config file in a Web Forms application:

<system.webServer>
  <httpErrors errorMode="DetailedLocalOnly" existingResponse="Replace">
    <remove statusCode="404" />
    <remove statusCode="500"/>
    <error statusCode="404" path="/NotFound.aspx" responseMode="ExecuteURL"/>
    <error statusCode="500" path="/Error.aspx" responseMode="ExecuteURL"/>
  </httpErrors>
</system.webServer>

The errorMode value specifies that the custom page will be displayed to remote users, but that detailed error pages will be displayed if the client is local - e.g. during development. A commonly seen alternative for the errorMode value is custom, which results in the custom pages being displayed every time.Both NotFound.aspx and Error.aspx are located in the root of the site.

The problem with this setup is that by default, an HTTP status code of 200 (Found) is returned when the custom pages are displayed. That's no use from an SEO point of view. It will likely result in the search engine indexing your error page! The easy workaround is to reset the correct status code in the Page_Load event of the relevant page:

public partial class NotFound : Page
{
    protected void Page_Load(object sender, EventArgs e)
    {
        Response.StatusCode = 404;
    }
}
public partial class Error : Page
{
    protected void Page_Load(object sender, EventArgs e)
    {
        Response.StatusCode = 500;
    }
}

The web.config file for the MVC 5 or below version is almost identical:

<system.webServer>
  <httpErrors errorMode="Custom" existingResponse="Replace">
    <remove statusCode="404"/>
    <remove statusCode="500"/>
    <error statusCode="404" responseMode="ExecuteURL" path="/error/notfound"/>
    <error statusCode="500" responseMode="ExecuteURL" path="/error/error"/>
  </httpErrors>
</system.webServer>

Since its MVC, the path values point to URLS that resolve to controller actions. In my example, the controller is called ErrorController:

public class ErrorController : Controller
{
    public ActionResult Error()
    {
        Response.StatusCode = 500;
        return View();
    }

    public ViewResult NotFound()
    {
        Response.StatusCode = 404;
        return View();
    }
}

The same problem concerning undesirable status codes occurs with an MVC site, so the correct values are set just before the views are returned.

Summary

This article looked at how to let search engines know what to index on your site, and what to ignore. It showed how to build a sitemap and expose it in Web Forms and MVC, and how to let search engines know it exists via the robots.txt file. It then covered the main HTTP status codes that search engines are interested in, and how they affect what's indexed. Finally, the article explained how to incorporate custom error pages in your site while ensuring that the correct HTTP status code is returned to the user.