Session 09 - Standard Linux Tools - Gathering System Information - EXERCISE

Requirements:

  • access_log.tgz (on Blackboard)
  1. Get the file access_log.tgz

  1. access_log.tgz is a tarred and compressed file. Use the file command to verify this.

Click to reveal answer
$ file access_log.tgz

access_log.tgz: gzip compressed data, last modified: Thu Apr 26 07:50:09 2007, from Unix, original size modulo 2^32 768000



  1. Use tar to extract and decompress the file.

Click to reveal answer
$ tar tzf access_log.tgz

access_log



  1. access_log is a log file from an Apache web server. Where would this file normally be located?

Click to reveal answer
$ tar xzf access_log.tgz

$ ls -l

-rw-r--r-- 1 user users 756899 Apr 26 2007 access_log

-rw-r--r-- 1 user users 36991 Nov 28 12:41 access_log.tgz



  1. Is it text or binary? Use file to find out.

Click to reveal answer
$ file access_log



  1. What is the size of the file in bytes?

Click to reveal answer
756899


  1. How many lines are in this file?

Click to reveal answer
$ wc -l access_log

4224 access_log



  1. Look in the file to get some idea of the contents. Use at least three different commands.

Click to reveal answer
$ less access_log (exit less by pressing 'q')

$ cat access_log

$ head access_log

$ tail access_log



  1. Over what period of time did the web server use this file?
    • Look at the beginning of the file (head)
    • Look at the end of the file (tail)

Click to reveal answer
$ head access_log (note the top-most date and time)

$ tail access_log (note the bottom-most date and time)



  1. What is the first column of the log file? How are the columns separated?

Click to reveal answer
The first column is the connecting IP address; the columns are separated by a single space.


  1. Display the first column on your screen using cut

Click to reveal answer
$ cut -d " " -f1 access_log

145.99.163.83

145.99.163.83

62.234.138.228

213.84.219.101

...



  1. Display and sort the first column

Click to reveal answer
$ cut -d " " -f1 access_log | sort

128.30.52.13

128.30.52.13

128.30.52.13

128.30.52.34

128.30.52.34

...



  1. Reuse the previous command and add less to the command chain

Click to reveal answer
$ cut -d " " -f1 access_log | sort | less (exit less by pressing 'q')



  1. Use uniq to filter out duplicates in the first column

Click to reveal answer
$ cut -d " " -f1 access_log | sort | uniq

128.30.52.13

128.30.52.34

131.211.84.18

133.27.228.132

...



  1. Reuse the previous command and add less to the command chain

Click to reveal answer
$ cut -d " " -f1 access_log | sort | uniq | less (exit less by pressing 'q')


  1. How many unique IP addresses are there in the first column?

Click to reveal answer
$ cut -d " " -f1 access_log | sort | uniq | wc -l 117



  1. Use less again to look at the access_log file, and see if you can spot entries that include a mention of 'Googlebot'.

Click to reveal answer
Example:

66.249.65.115 - - [10/Apr/2006:19:09:11 +0200] "GET /robots.txt HTTP/1.1" 404 296 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.65.115 - - [10/Apr/2006:19:09:11 +0200] "HEAD / HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"



  1. This web server has been regularly visited by the Googlebot. Use grep to find the lines which contain the string 'Googlebot'

Click to reveal answer
$ grep Googlebot access_log OR

$ grep -i googlebot access_log (what does -i do?)



  1. When did the Googlebot first visit?

Click to reveal answer
$ grep -i googlebot access_log | head



  1. When did the Googlebot last visit?

Click to reveal answer
$ grep -i googlebot access_log | tail



  1. From how many different IP addresses did the Googlebot visit the web server? Solve in one command using pipes or use temporary files. Use sort, grep, cut, wc, and uniq. The cut command needs the flags -f and -d.

Click to reveal answer
$ grep -i googlebot access_log | cut -d " " -f1 | sort | uniq | wc -l 73



  1. **BONUS: Create a command chain (aka. "oneliner") that displays the top 10 most used Googlebot IP addresses.

Click to reveal answer
$ grep -i googlebot access_log | cut -d " " -f1 | sort | uniq -c | sort -n -r | head