book graphic unix and linux troubleshooting guide

My Troubleshooting E-Book will show you how to solve tough problems on Linux and Unix systems!







Fixing 404 errors

Tue Jul 13 17:19:31 2004 Fixing 404 errors
Posted by Tony Lawrence
Search Keys: 404 ,error_log, custom 404, web/html
Referencing: /Unix/custom404.html

A 404 error is what you get when your browser tries to access a page that doesn't exist. Maybe you mistyped something, or the link you followed was mistyped by someone else, or maybe the webmaster moved it or renamed it or just deleted it. It's annoying for you, and sites that care about your visit try to avoid it happening.


Hate these ads?

Well, we can't stop 404's 100%, and frankly dealing with it is an annoyance for those of us maintaining the website too. It's bad enough that other sites cause us problems with incorrect links, but it is really annoying when we cause our own problems.

Unfortunately, tracking these things down and fixing them is a bit of a pain. The "Custom 404" page and associated script referred to above corrects a lot of common errors automatically, and tries to offer help when it can't just redirect you to the right page, but I need to keep updating it as I find new sources of errors. Sometimes the fix is as simple as just making a symbolic link, but if it is from an outside source, I want to correct it if I can. Even if it was caused by my own error, I may still want to add correction code in case that original error gets picked up by someone else.

So, to help me find errors, I have a Perl script that reads in the error_log, and compares it to a log of "corrections" already made by the Custom 404 script (this is necessary because the 404 ends up in my logs even though it was corrected). The script ignores pages that have already been corrected, and spits out a list of 404's I need to at least investigate. Many of these will be confused web spiders - it's really amazing how dumb some of these things are. For example, /MacOSX/macosxcupstofile.html contains this text:



sudo lpadmin -p tofile -E -v socket://localhost:12000 -m raw


Dumb spiders regularly think that is a link:



[Sun Jul 11 07:07:05 2004] [error] [client 217.107.152.79] File
does not exist:
/usr/local/www/vhosts/vps.pcunix.com/htdocs/MacOSX/socket://localhost:12000/


I have the script count the number of uncorrected 404 occurences so that I can devote immediate effort to the more serious problems. The output of the script might look something like this:








/blog/b930.html 2
/SCOFAQ/news:comp.unix.admin 1
/cgi-bin/fmail.pl 1
/Books/creatingcoolwebsites.html 10
/e51/SCOFAQ/FAQ_scotec8xsession.html 1


Obviously I need to jump on that "creatingcoolwebsites.html" problem right away.

See that "fmail.pl"? That's a script kiddy trying to break in:



205.158.224.234 - - [12/Jul/2004:12:22:04 +0000] "POST /cgi-bin/fmail.pl HTTP/1.0" 404 2317 " http://aplawrence.com/" "-"


Checking his other attempts proves it:



205.158.224.234 - - [12/Jul/2004:12:21:05 +0000] "POST /cgi-bin/formmail.pl HTTP/1.0" 404 2320 " http://aplawrence.com/" "-"
205.158.224.234 - - [12/Jul/2004:12:22:04 +0000] "POST /cgi-bin/fmail.pl HTTP/1.0" 404 2317 " http://aplawrence.com/" "-"


Nothing to worry about there.

The actual script is pretty simple:



#!/usr/bin/perl
# ck404.pl
open(LOG,"www/logs/error_log");
open(C,"www/data/corrections");
%foo=();
%foo2=();
while(<C>) {
 chomp;
 s/->.*//;
 s/^  *//;
 s/  *$//;
 $foo{$_}=$_;
}
close C;
while(<LOG>) {
  chomp;
  s/.*htdocs//;
  s/.*cgi-bin/\/cgi-bin/;
  s/^  *//;
  s/  *$//;
  next if $foo{$_};
  $foo2{$_}++;
}
foreach (keys %foo2)  {
  print "$_ $foo2{$_}\n";
}



ad

This does generate some extra garbage now and then; it doesn't need to be perfect - it's just a helper script that saves me time.

Well, I've got a few hundred 404's I need to go look at..most of them will probably be spider errors, or things I can easily fix, but invariably there will be some new 404 mixup to deal with, and the Custom 404 code will grow some more.




Comments
ok i'm quite new to website design, etc. anyhow, i just put a new web up and - to get to my question - the site buttons link to the corresponding pages on my personal computer however do NOT seem to work from certain other computers. their web browsers seem to be the same (explorer) as mine. the message "error 404" keeps coming up, saying the page doesn't exist or was moved. suggestions?

-----

Actually, this isn't the place to post a request for help -- newsgroups and forums are where you should to go. Also, if all else fails, RTFM! Whatever software you are using to compose and expose your pages will (or should) have the info you need. If it doesn't, perhaps you shouldn't be using it, eh?

However, to address your particular situation, you might wish to carefully examine how you set up your links. It could be that they point to locations relative to the filespace in which the pages were generated. The result would be that everything would work for you, but when pages are requested by a remote client, the links will not make any sense to that client.

Links are relative to the document base, which if confined entirely to the realm of a PC running Windows (I assume that's what you have, since you mentioned "explorer"), would be c:\somedir\somepage.html. That's not going to mean much to someone coming in over the Internet, eh? That's because the webserver confines its knowledge of the filesystem to a particular area. For example, on my webserver, my website (www.bcstechnology.net) is stored in /bcs00/port80.d/bcs.d (obviously not Windows). That is referred to as the document home. From the server's perspective, the document home is the root of the entire website. So any link you might select on my site will be referring to /<documenthome>/whatever.html, not /bcs00/port80.d/bcs.d/whatever.html.

BTW, you shouldn't be using Internet Exploder to testing a web site. Just because it works with IE doesn't mean that the page structure conforms to W3C standards. You might be dismayed to discover that your site isn't accessible to those who are smart enough to stay away from Microsoft's crappy and bug-ridden browser.

--BigDumbDinosaur

---December 29, 2004

Right. http://aplawrence.com/forum.html is our forum here.

The most common error is just leaving off a slash:

Unixart/ksh.html vs. /Unixart/ksh.html

The non-slash will work if called from /index.html but not from /Unixart/index.html

--TonyLawrence





---December 29, 2004



Click here to add your comments


Enter your email address for automatic notification of new posts here
(be sure to whitelist 'feedburner.com' if you use spam filtering)

Or use any RSS reader

Delivered by FeedBurner


Views for this page
Today This Week This Month This Year  Overall
181414 3,073

Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

pavatar.jpg
More:
       - Code
       - Perl
       - Web/HTML
       - Blog




Unix/Linux Consultants


http://bcstechnology.net Full service Linux & UNIX systems integrator; Windows to UNIX/Linux Client-Server Specialist; Secure E-Mail & Website Hosting; Thoroughbred Software Developer; Custom Industrial Automation; Hardware & Electronics Experts; In Business Since 1985.


http://echo3.net/ Unix/Linux Custom Applications, Web Hosting, C/C++ Programming Courses


http://thatitguy.com Business networking servers, Linux and Unix experts. In business since 1997! Windows and Exchange to Samba and Scalix migration experts.



Twitter
o Trying to get the wife's motor running so we can get to the gym. 6:15 AM Apr 15th 2008








Change Congress

Publish your articles, comments, book reviews or opinions here!