MyComputingArt

Articles about computing. What are you interested in?

.htaccess, android, apache, bloxsom, bluetooth, broadcast, case, chat, client-server, command-line, configuration, cool'n'quiet, cooling, cpu, disk suspension, dsl, error, fan, fan controller, file management, firefox, firewall, freeware, google, google earth, gpg, gps, grub, hardware, heatsink, howto, images, internet, jabber, lapping, linux, measurement, messaging, motherboard, mp3, mysql, network, password, pda, perl, phone, programming, programming , qemu, rdp, regex, router, screen recording, script, security, shell, silencing, software, spreadsheet, spyware, system recover, tools, ubuntu, virtualization, visual basic, VMWare, vnc, vpn, web, windows, wireless, xen, xmpp, xp



.htaccess: an example of the Apache per-directory configuration file

Using bloxsom as my weblog designing application, I had to provide some settings to the Apache .htaccess configuration file to fine-tune the usage of my site:
  1. tell Apache to load index.cgi in addition to the default index.htm
  2. rewrite the requested address adding www. if it's omitted
  3. rewrite the requested address adding index.cgi if it's omitted
  4. and for bandwidth reduction, forbid robots from crawling some pages
(Updated May 17 2014)
  1. Tell Apache to load index.cgi in addition to the default index.htm with the DirectoryIndex directive:
    DirectoryIndex index.cgi
  2. Rewrite the requested address adding www. if it's omitted:
    RewriteEngine On
    RewriteBase /
    RewriteCond %{HTTP_HOST} !^www\..* [NC]
    RewriteRule ^(.*) http://www.%{HTTP_HOST}/$1 [R=301]
    
    Using the mod_rewrite module I tell Apache:
    • activate the rewrite engine:
      RewriteEngine On
    • what is the base address that must be rewritten; it means, what is the part of the address that was typed in the browser and must be rewritten:
      RewriteBase /
      If a user types http://www.mycomputingart.com/something/, take the sub-path /something/ and do what I'm going to tell you next on that part only. For instance, if my site was like http://www.somewhere.com/users/Z24/ I would have done:
      RewriteBase /users/Z24/
      meaning that the rewrite must be done on the sub-path under /users/Z24/.
      If http://www.mycomputingart.com/users/Z24/linux/tutorial.htm was requested, strip the /users/Z24/ sub-path, resulting into linux/tutorial.htm, reattach the FQDN (http://www.mycomputingart.com/linux/tutorial.htm) and do the rewrite.
      It's usually not needed to map a URL path to a physical path in the .htaccess file because most times it's done using the Alias directive in the httpd.conf main configuration file. For instance,
      Alias /users/Z24 /home/z24/www
      would map the URL sub-path /users/Z24 to the local directory /home/z24/www, so that the Apache server running on the machine located at the IP address corresponding to the domain www.mycomputingart.com would answer to the previous http request displaying the file tutorial.htm located at /home/z24/www/linux/.
    • now, examine the sub-path under the RewriteBase location in the requested address and see if it satisfies this condition:
      RewriteCond %{HTTP_HOST} !^www\..* [NC]
      The RewriteCond directive lists the condition, %{HTTP_HOST} is a server variable identifying the fully qualified domain name (www.mycomputingart.com), the regular expression !^www\..* means "not (!) starting (^) with www. (www\.) and everything next (.*)" and [NC] means "no case" (consider WWW and www as equal).
    • when the condition is satisfied, rewrite the address:
      RewriteRule ^(.*) http://www.%{HTTP_HOST}/$1 [R=301]
      The RewriteRule directive tells Apache what the address part identified by the RewriteCond directive must be replaced with: take everything (.*) from the begin (^) of the string identified by RewriteCond, store it in memory and replace it with http://www., then the content of the server variable HTTP_HOST and then the content that has just been stored in memory ($1 is the first pattern that was memorized; the pattern are stored in memory if they are enclosed by parenthesis: (.*) in this case).
      The [R=301] tells Apache that the requested URL "has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs", as stated by the rfc 2616 301 error code.
  3. Rewrite the requested address adding index.cgi if it's omitted; it should be done only if the requested URI is:
    • the root (www.mycomputingart.com)
    • a single post (www.mycomputingart.com/configurations/both/Z24.htaccess.html)
    • a category (www.mycomputingart.com/configurations/)
    If someone asks for the URI of an image Apache must not rewrite the URI.
    I don't repeat the
    RewriteEngine On
    RewriteBase /
    
    and the new rewrite rules are:
    RewriteCond %{REQUEST_URI} ^$|txt$|1993$|html$ [OR]
    RewriteCond %{REQUEST_URI} !\.[a-z]{3,4}$ [NC]
    RewriteRule !^index\.cgi.* - [C]
    RewriteRule (.*) index.cgi/$1
    
    • Agreeing with the first condition the requested URI (server variable REQUEST_URI) must be either blank (^$) -- a location -- or (|) ending in txt, 1993 or html (txt$|1993$|html$) -- the bloxsom flavours. The [OR] means exactly or, making this condition alternative to the next condition, which tells that the requested URI must not (!) end ($) with a file extension: a dot (\.) followed by 3 or 4 not case-sensitive ([NC]) alphabetic chars ([a-z]{3,4}).
      In simpler words, the requested URI must be the root, a location, a file with an extension matching a bloxsom flavour or everything else having no extension.
    • If one of the two conditions is satisfied, check if the requested address does not begin (!^) with index.cgi and do nothing (-), then chain ([C]) this rule to the next.
      Chaining two or more rules means that if the rule is not matched all the chained rules are skipped, if the rule is matched the next chained rule is processed.
    • The second RewriteRule attaches index.cgi at the begin of the RewriteBase path. In other words, it replaces everything ((.*)) after the / with index.cgi/ and the same everything ($1).
      Combining the two rules, this is the result: if an address containing index.cgi has been requested, pass the requested address without rewriting; if it doesn't, rewrite the address adding index.cgi before any sub-path.
  4. Keep robots from crawling tags. I experienced overusage of bandwidth due to some robots crawling all possible combinations of tags; to stop them I instruct Apache to redirect robots which crawl URIs containing a tag to a 403 forbidden error page:
    RewriteCond %{QUERY_STRING} tags [NC]
    RewriteCond %{HTTP_USER_AGENT} ^(.*)Googlebot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(.*)Yahoo [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(.*)msnbot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(.*)Yandex [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(.*)bingbot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(.*)Baiduspider [NC,OR]
    RewriteRule .* - [F]
    
    • When a URI contains a question mark, like in http://www.mycomputingart.com/?-tags=.htaccess, the REQUEST_URI server variable contains the address only (the part on the left of ?), without the parameters, while the QUERY_STRING server variable contains the parameters (the part on the right of ?).
      The first condition means to search for tags (case insensitive) in the QUERY_STRING server variable.
    • The following conditions mean to apply the rule only if the URI is requested by one of the listed robots, basing on the user agent string with which the robot identifies itself (HTTP_USER_AGENT server variable). Robots user agent strings can be found on useragentstring.com.
    • The RewriteRule means to not rewrite the URI (-) and return a 403 forbidden status ([F] flag)
At the end of the .htaccess there are these permission rules too:
<Files .htaccess>
order allow,deny
deny from all
</Files>
These rules tell Apache to deny access to the .htaccess file to everyone.

See the rewrite and rewrite flags guides on the Apache site and the Perl Compatible Regular Expressions man page: they are enlightening.
And maybe this .htaccess tester and this robot simulator can be useful too.

   PDF

Posted by: Z24 | Wed, May 04 2011 | Category: /configurations/both | Permanent link | home
Tagged as: , , , ,


About
About
RSS
rss
Donate
Did I save you time or trouble?

Thanks ;-)
Skin
Categories
Archives
Search
Search MyComputingArt

word word = any word
+word +word = all the words
regexp pattern


Search hardware reviews

Visitors

since August 2006

free counters
since September 2009


Powered by Blosxom
FlagCounter Locations of mycomputingart.com visitors Map

Valid HTML 4.01 Strict    Valid CSS!

http://www.mycomputingart.com/

To contact the webmaster and author write to: info<at>mycomputingart<dot>com
© mycomputingart.com, year(today()).