Handling Legacy URLs with ASP.NET MVC

According to Google's Webmaster tools, there are about 15,000 incoming links to my site. 13,000 of those reference a .ASPX file on disk. When I convert to MVC, with new Search Engine Friendly urls, all those links will break unless I do something about it. Presenting users with a 404 - file not found is not an option. I need to show them the content they were expecting, and update Search Engines with the news that things have changed. Here's how I will be managing those Legacy URLs.

Quick Overview of Routing

System.Web.Routing was introduced to ASP.NET with 3.5 SP1 to be used primarily with Dynamic Data applications. Whereas, with the vast majority of technologies, a URL maps to physical files on disk - examples of these include .aspx files, ashx, .asp, .htm, .php. .gif etc - System.Web.Routing's main purpose was to provide a means to map or route requests (URLs) to resources other than these types of physical file. Within an ASP.NET MVC application, these resources are generally represented by methods exposed by classes that inherit from System.Mvc.Controller, or Controller actions. To begin with, understanding how this all works can be as difficult as getting to grips with Regular Expressions. However, we'll give it a try.

All routes derive from an abstract class called RouteBase. This defines the properties and methods that you might need to manage your own route, although in the main, you will use the out-of-the-box System.Web.Routing.Route subclass. A collection of Routes is kept in the RouteTable.Routes property. The default set up within Global.asax of an MVC application contains a class - MVCApplication, which includes a static method (RegisterRoutes) that builds a RouteTable's Routes collection. within the Application_Start() event, this static method is called, and the configuration of routes for the application is persisted throughout its lifetime.

When a request is made to an MVC application, an HttpModule (UrlRoutingModule) checks RouteTable.Routes for an entry that matches the pattern of the requested URL. It starts from the top and checks each entry in turn until it finds one that matches. Once one is found, the matching Route object's GetRouteData() method is invoked which provides information about the route, such as how it is to be handled. The GetRouteData() method takes one parameter - an HttpContextBase object, which holds all the information you need about the HTTP Request that was made, including querystrings values, form values, cookie data, HTTP headers etc. Finally, the HttpModule invokes the RouteData's RouteHandler, which is typically of type MvcRouteHandler. It passes the handler a RequestContext object, which contains an HttpContext object and a RouteData object. The reason I have covered all of this is because this is the point at which we need to interrupt to examine whether a request is for a legacy .aspx file. We then need to provide an alternative RouteData structure so that the correct controller action is invoked in response.

Before looking at how all that is achieved, there are some other problems that need to be addressed. The first is SEO, or Search Engine Optimisation. I have already decided that my replacement for e.g. Article.aspx?ArticleID=100 will be Article/100/Experimenting-with-jQuery-Draggables-and-ASP.NET.In other words, as well as the ID of the article (see later for why), I am using the title. Instead of spaces (or %20) between words in the title, there will be hyphens. It appears that search engines like finding key words in the URL, and will give more weight to their relevance. It also appears that search engines are happy with hyphens, which they see as spaces. Not all of the titles of my articles lend themselves to this approach. For example, anything that currently includes some form of punctuation might look strange. So the first thing I need is a method to clean them up:


namespace MikesDotnetting.Helpers
{
  public static class UrlTidy
  {
    public static string ToCleanUrl(string urlToTidy)
    {
      var url = urlToTidy.Trim();
      url = url.Replace("  ", " ").Replace(" - "," ").Replace(" ", "-").Replace(",", "").Replace("...","");
      return url;
    }
  }
}

The next thing involves the controller that is responsible for managing requests for articles.The Route entry that causes the relevant controller action to be invoked is as follows:


routes.MapRoute(
     "Show",
     "{controller}/{id}/{title}",
     new { controller = "Article", action = "Show", id = "", title = "" }
 );

And the action itself:


public ActionResult Show(int id)
{
  return View(repository.GetArticle(id));
}

If you compare the two, you will see that the {title} parameter is ignored by the controller action. All it looks for is the id, which gets passed to the GetArticle() method of my Repository. The reason for this is that it is a lot quicker to find data according to the primary key of a table than it is to do a string comparison. In addition, I am changing the title that appears in the URL by the addition of hyphens and the removal of other punctuation, so trying to compare an actual title to the representation of one within a URL is going to be problematic. In other words, the title part of the URL is purely decorative as far as MVC is concerned. There is another reason why the title part of the URL is purely decorative, and that is that I may want to edit a title at some stage after links have been published. If I do, the article will still be found so long as the ID appears correctly in a request. Nevertheless, I need a title when handling legacy Article.aspx requests. I explain why when I come to use it, but in the meantime, I'll add the GetArticle method to the Repository:


public IEnumerable<ArticleTitle> GetArticleTitle(int id)
{
  return (de.ArticleSet
             .Where(a => a.ArticleID == id)
             .Select(a => new ArticleTitle
                            {
                              Head = a.Headline
                            }));
             
}

Now to the real business - building my custom Route object. I'll plonk all the code for the LegacyUrlRoute class in one go and then explain it:


using System;
using System.Web;
using System.Web.Routing;
using MikesDotnetting.Controllers;

namespace MikesDotnetting.Helpers
{
  public class LegacyUrlRoute : RouteBase
  {
    public override RouteData GetRouteData(HttpContextBase httpContext)
    {
      const string status = "301 Moved Permanently";
      var request = httpContext.Request;
      var response = httpContext.Response;
      var title = "";
      var legacyUrl = request.Url.ToString();
      var newUrl = "";
      var id = request.QueryString.Count != 0 ? request.QueryString[0] : "";
      
      if (legacyUrl.Contains("Article.aspx"))
      {
        var rep = new ArticleRepository();
        var article = rep.GetArticleTitle(Convert.ToInt32(id));
        foreach (var a in article)
          title = UrlTidy.ToCleanUrl(a.Head);
        newUrl = "Article/" + id + "/" + title;
        response.Status = status;
        response.RedirectLocation = newUrl;
        response.End();
      }
        return null;
    }

    public override VirtualPathData GetVirtualPath(RequestContext requestContext, 
				RouteValueDictionary values)
    {
      return null;
    }
  }

  
}

First thing to notice - all Route objects must derive from RouteBase. LegacyUrlRoute is no different. Both of the virtual methods of RouteBase are overridden - GetRouteData (which returns a RouteData object) and GetVirtualPath() which returns a VirtualPathData object. However, my overridden method does not return a RouteData object. That's because no request that matches the condition within the method will be processed at all.

Initially, some variables and one contant are created. The constant is an HTTP Status code that informs user agents (browsers and search engine bots) that the resource they are looking for has been moved to another location. It should make no difference to existing links on blogs and forums that human visitors follow, but search engines will hopefully update their indexes. and this is why I need a title. I want the search engines to store the whole link. The other variables reference the current HTTP Request and Response "contexts" (Microsoft really love that word, don' t they?), the currently requested URL and a querystring value (where it exists).

If the currently requested URL contains the string "Article.aspx", it's a legacy url. The first thing that happens is that the Article Title is obtained from the method in the Articles Repository that was introduced earlier. (I suspect that when this goes live, I shall map IDs to titles in a XML file and reference that instead of calling the database.) The title is then tidied up by the helper ToCleanUrl() method, and used to construct a new MVC URL. From there, an HTTP Response is prepared and sent. The status code is provided using the constant, and the new location for future requests is passed in to the RedirectLocation property. Response.End() is called, which prevents any further processing for this particular request, and the response is sent back to the user agent. No RouteData structures were built or referenced, and no HttpHandlers invoked.

If the requested URL does not contain "Article.aspx", null is returned so that the UrlRoutingModule can continue to try to match the URL to other routes within the RouteTable.Routes collection.

One final task, and that is to register the LegacyUrlRoute in the application's RouteTable. and that's done right at the beginning of the method in Global.asax:


public static void RegisterRoutes(RouteCollection routes)
{
  routes.IgnoreRoute("{resource}.axd/{*pathInfo}");
  
  routes.Add(new LegacyUrlRoute());

And now, if a request is made to an old URL, such as http://www.mikesdotnetting.com/Article.aspx?ArticleID=100, it is automatically redirected to http://www.mikesdotnetting.com/Article/100/Experimenting-with-jQuery-Draggables-and-ASP.NET with the correct header sent to the user agent.