Mon, 14 Jan 08

Fun with mod_rewrite (and its special variables REQUEST_URI, IS_SUBREQ and THE_REQUEST)

I wasted a few hours of my life, over the weekend, playing with mod_rewrite. I had, what I thought to be, a fairly simple problem to solve: I wanted requests for index.html within any subdirectory of my site to be redirected to the root of that subdirectory. So, a request for example.com/subdir/index.html would be (301) redirected to example.com/subdir/.

I thought I could do this without using RewriteCond. I tried something like this.

RewriteRule ^(.*)index\.html$ $1 [R=301]

I would expect this to match requests for URLs that end with index.html and redirect them to the URL minus index.html. Although this correctly redirected /subdir/index.html to /subdir/ it would also attempt to redirect /subdir/ to /subdir/ resulting in an infinite loop. It took me a very long time to work out, what I believe, is going on. If I understand correctly, apache will send every request (i.e. not just the request you issue from the browser) through mod_rewrite. When we request /subdir/, apache will try, and fail, to match that URL with our regex (/subdir/ doesn’t end with index.html). Internally, apache then ‘requests’ index.html (since index.html is the DirectoryIndex). This request matches our regex which results in apache sending the 301 redirect and us ending up with an infinite loop. If I mis-understand what’s going on please feel free to correct me.

Based on this assumption, I figured I’d need to use RewriteCond to somehow ignore the internal request for index.html. I initially added a condition that checked for the REQUEST_URI ending in index.html (RewriteCond %{REQUEST_URI} index\.html$). This suffered the same problem as my initial attempt. A little more reading led me to the IS_SUBREQ special variable. I thought this might allow me to ignore the ‘internal request’ for index.html (RewriteCond %{IS_SUBREQ} false) but I couldn’t get it to have any effect.

In searching for answers to this problem I kept seeing THE_REQUEST special variable used in the solution, although never an explanation as to why it’s used (specifically in preference to REQUEST_URI). I ended up using THE_REQUEST in my solution (not technically my solution as I stole it from a site that I failed to bookmark) and it seems to work OK.

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.html\ HTTP/
RewriteRule ^(.*)index\.html$ $1 [R=301,L]

I’m guessing that this is because THE_REQUEST contains the full details (the stuff that appears in the log file) of the request from the client (browser) and will therefore be empty for any ‘internal requests’. Anyone know if my assumptions are correct?