![]() |
After the Storm - Internet Technologies
|
How to Recognize a Human BeingPart of the series "The Zen of Serving Web Pages"
By Christian Treber How can you tell apart man and machine when looking at web logs? All these hits - was it somebody browsing your website, or was it a crawler collecting information for a search engine? An automatized tool scavenging email addresses for the next spam attack? I used to think of a web log as a record of what people have downloaded how often from my server. Not quite so easy! First off, not every request is a download. Maybe just the header has been requested (caching proxies do that to check if something has changed), or form data has been posted, or maybe someone used the web server as a proxy. This all depends on the operation, and, in case of proxy traffic, on the URL (if it starts with "http://", it's a proxy request). Even if the request was a download of an URL (a GET operation) it might not have been successfull. Maybe the URL did not exist, or the user wasn't properly authorized, or the server had a bad day. The result code tells us how things went. And after all that, the URL might not have been requested by a person (with a browser), but by a machine. Search services use crawlers to automatically download whole web sites and index them. Link checkers might probe for the correctness of external links to your site in other web pages. Spammers might try to extract email addresses from your pages. If we want to answer the question; "what have people been looking at", we need to filter for requests that are GET operations of a local URL that were successful and submitted by a browser This is what "user filtered" reports are about. We are of course interested in requests that used other operations, employed the web server as a proxy, failed, or were initiated by a machine. But they are the subject of other, surely interesting reports! |
|
|
© 1998-2005 Christian Treber, ct@ctreber.com . All rights reserved. The author takes no responsability for linked external pages, the content of which by no means reflect his own opinion, convictions etc.
|
|