Knowing site audience is of a paramount importance for every web master. There are lots of tools to aid in gaining that precious information, from simple ones like to amazingly pretty Web 2.0 creatures like . But sometimes all you need is UNIX shell and some tools knowledge, like awk, grep and sed.
In most installations Apache uses so-called "Combined" log format, which is good enough to contain most of the needed info. On most Linux distributions Apache log files are usually stored in /var/log/httpd/ and the one we are interested in is called access_log. "Combined" log format is defined in the following way in main Apache configuration file httpd.conf:
Looks a bit scary, but in fact it's not that awful. If you check in Apache documentation you will find all the glory details (there's also some docs on itself). What's important now is every log line consists of 9 space-separated fields. To get the idea of how it looks like, we should pick a look at the actual file:
We are interested in a few fields only. Say, the first item (10.20.30.40) is an IP address of a client fetching the page. The item in the square brackets is a timestamp. After the GET word there's a name of a file which a user wants. Number 200 is the HTTP return code, which means 'OK' in this case (say, 404 means 'Document not found' etc.). An URL in quotes is so-called referer, this is where a user (user's browser, actually) comes from, it can be either your own site, or some external site like www.google.com. Finally, the last field is User-Agent, i.e. a browser identification string, happen to be Firefox 8 (in a couple of years it's gonna be Firefox 18, or even Firefox 24 -- well, time flies).
So, we have a huge file filled with lines like the one above. What can we get from it (besides eye floaters)? All sorts of things you never ever knew you could dig from a common log! Let's start crunching! First of all, let's assign a shell variable called LOG a value pointing to your log file, so we don't have to type its name a good hundred times:
Code
$ LOG=/var/log/httpd/access_log
Please note that '$' in the beginning of the line means a shell prompt. It means input lines, i.e. you need type in everything that goes after '$', the '$' itself is printed by shell. In case there's no '$' sign at the beginning of a line, this is output. Now, how big is the file, how many lines (let's call them records) do we have?
Code
$ wc -l $LOG
5217775
Pretty big, 5 million lines! And since what time is it maintained?
Here "head -1" means "give us the first line of a file" and the awk statement means "print us the fourth and the fifth fields of the line", which contains timestamps. Using similar awk statements you can get the list of IP addresses of all users who tried to access your website. I'm sure you do not want to view all the five millions of addresses flooding the screen (remember the terminal from the Matrix movie? Oh, good old days!).
I guess you would like to know only unique IP addresses (and even top ten of them, it's nice to know your adoring fans!):
Code
$ awk '{print $1}' $LOG | sort | uniq
We had to sort the list of addresses we have from awk first, then use uniq utility to omit the repeated lines. Still, the list is way too long (for those curious readers who want to know how long exactly, just add "| wc -l" to the end of the above command, still the number is huge, somewhere about 1 million lines). We can sort it once again to view only top 10 customers (their IP addresses actually):
Here we add -c (count) option to uniq command so it also outputs the number of repeated lines. Then, the second sort command ("sort -nr" is a numerical (rather than alphabetical) sort with reverse order (bigger numbers first). This gives us a list of IPs and their frequencies. Finally, the head command is to limit the output to first 10 lines (the default value is 10; if you want it to be top 25, simply use "head -25". You can do the same awk query to get top 10 of most popular files (pages) of your web site. The file name is seventh field, so:
Often the most popular file is /favicon.ico, a small web icon that you usually see in a browser tab Next thing we want is limiting statistics to the last month only (who cares what was going on on your web site in 1812!). grep helps a lot! By adding grep command we only allow lines that have a string "/Nov/2011" in it:
Sometimes it's important to know HTTP return codes. For most sites it's 200 (OK) and 404 (not found). There are still others return codes and it helps to know them to investigate some web server problems:
You can squeeze many other interesting tidbits out of your logs, if you know a bit of that awk-grep-sort-uniq-kung-fu. Actually, all the above stuff was pretty basic, just to show you how easy and simple it is. Much more sophisticated queries can be performed. Finally, feel free to share your own apache log parsing recipes in comments