Public Web Stats - Dont let competitors get the upper hand
Anyone who has used a web statistics program should understand that to a competitor the statistical data found in a web stats page is very valuable. Something this valuable should be kept safe from prying eyes right. Well in some cases web stats can be extremely insecure and need further configuration beyond the default settings. This article deals mainly with the Awstats program. I am most familiar with Awstats and I am sure the majority of readers are too. With that said lets get started.
Overview
- How to search for web statistics via Google Hacking
- Ways to prevent your stats from becoming public.
How it works
Ok go to Google and do a search for inurl:”awstats.pl” I came up with 296,000 pages with public stats using just that search string. As you can see everything about these sites is made public thanks to Google and poor web administration practices. So now you see the problem, now I am going to show you some methods to prevent this sort of snooping on your own site.
How to prevent it
First and foremost many websites are hosted on shared hosting accounts. The web stats are installed either automatically or using some sort of installer such as Fantastico etc. For these users you will need to start by contacting your web host. Some other methods of preventing you web stats being made public are to restrict search indexing of the cgi-bin and other sensitive directories within your web space using a Robots.txt file. This will prevent search spiders such as Google and the like from spidering your stats pages. Another way would be to use .htaccess to prevent access to that directory all together. For the purpose of example and simplicity I will explain only the latter. The Robots.txt file mentioned before is a simple text file that search engines look for as a set of guidelines for proper indexing of the site. The Robots.txt file tells the spider where it is welcome and where it is not welcome. Now I am not going to go into major details on how to setup a Robots.txt file. I will give a quick overview on how to protect your cgi-bin with a Robots.txt file. First things first start by creating a plain text document in the root folder of your site and name it robots.txt. Open the file and on the first line type User-agent: * then on the second line type Disallow: /cgi-bin . Explanation : Ok so the first line tells me the user agent (i.e. the search engine reading the file). Using a star or “wildcard” makes the rule apply to all spiders. There are literally thousands of spiders out there so here is a list of spiders I found
Most Well Known Spiders
- GOOGLEBOT - http://www.google.com
- SLURP - http://www.yahoo.com
- MSNBOT - http://www.msn.com
Ok so far we have learned how to make a robots.txt file that will disallow all spiders from spidering out cgi-bin directory(I chose this directory because it is the primary location where most stats scripts are installed). Now let’s learn how to disallow just one spider.
User-agent: Googlebot
Disallow: /cgi-bin
ok the above instructions tell Googlebot the spider to exclude the /cgi-bin directory of the site from its crawl. You can change the user agent to whatever spider you want and set specific instructions for the spider.
This is by no means the most secure way of preventing access to your cgi-bin folder. It is however a simple precaution to make the public viewing of your stats pages less probable. To prevent access using a more secure method use the .htaccess file to restrict access to the directory.
I hope that this tutorial has brought to the attention of web master’s that their competition could very well know their next move or even their latest clients and should take precautionary steps to prevent it.