WATCHING OUT FOR YOUR VPS

Introduction

We all love our VPS’s to be in tip-top shape, ready to respond to requests and always on the ready. However, there are some times when you run into troubles, sluggish responses or even timeouts. It is good to always have a “In Case of Emergency” plan, such as re-routing traffic to a backup server, sending notifications to customers, or any other case to protect your VPS from unpredictable situations.

With that immediate fire to put out or at least brought under control, you need to begin investigating what exactly ha gone wrong. We will explore some basic tools that help you identify issues with your VPS.

#!Begin

Uptime

Uptime is a command that provides a quick glance at some useful information. It is gives you the time the system has been up along with the number of users logged in and the system load averages for 1, 5 and 15 minute (rolling) windows.

[missionctrl@orbit ~]$ uptime

 19:14:14 up 71 days,  2:14,  1 user,  load average: 0.08, 0.06, 0.05

This shows that the VPS has been up for 71 days, 2hrs and 14 minutes. There is currently one user logged on to the system. Understanding the load factor is not straightforward. A load average value of 1.00 means that the CPU is 100% utilized. A value greater than 1 is okay as long as you have multiple CPUs. (To get the count of CPUs, use the command grep ‘model name’ /proc/cpuinfo | wc –l). So a load average of 2.00 for a 4 CPU system means that the overall CPU utilization is at 50%.

Here is a result from another VPS

cmd@user:~$ uptime

 14:29:34 up 7 days, 15:41,  1 user,  load average: 8.10, 8.02, 8.01

cmd@user:~$ grep 'model name' /proc/cpuinfo | wc -l

8

For a 8 CPU VPS, we can see each CPU is being utilized fully and it is time to review running processes and begin some investigation. This brings us to the next command

Top

Top is an interactive application that displays running processes and CPU utilization details. Running top gives you a screen similar to this

top - 12:08:58 up 12 days, 12:25,  1 user,  load average: 0.52, 0.38, 0.27

Tasks:  24 total,   1 running,  23 sleeping,   0 stopped,   0 zombie

%Cpu(s):  4.0 us,  9.3 sy,  0.0 ni, 86.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

KiB Mem:    524288 total,    63300 used,   460988 free,        0 buffers

KiB Swap:   524288 total,    29468 used,   494820 free.    16240 cached Mem

 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND

    1 root      20   0   28120    784    576 S  0.0  0.1   0:13.38 systemd

    2 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kthreadd/6726

    3 root      20   0       0      0      0 S  0.0  0.0   0:00.00 khelper/6726

   66 root      20   0   28812   3300   3192 S  0.0  0.6   1:32.32 systemd-journal

  190 root      20   0  186900    592    316 S  0.0  0.1   0:20.31 rsyslogd

  215 root      20   0   12604      8      4 S  0.0  0.0   0:00.00 agetty

  216 root      20   0   12604      8      4 S  0.0  0.0   0:00.00 agetty

 2134 root      20   0   38744      8      4 S  0.0  0.0   0:00.00 systemd-udevd

 2209 systemd+  20   0   25692      8      4 S  0.0  0.0   0:00.00 systemd-resolve

 2848 root      20   0   40584   1268    956 S  0.0  0.2   0:00.00 cron

 2849 hetrixt+  20   0    4280    652    548 S  0.0  0.1   0:00.00 sh

 2850 hetrixt+  20   0   11608   1420   1160 S  0.0  0.3   0:00.07 bash

 3376 www-data  20   0   61232    960    420 S  0.0  0.2   0:57.03 lighttpd

 3545 root      20   0   55132    376    260 S  0.0  0.1   0:18.31 sshd

 3740 root      20   0   82676   3864   3000 S  0.0  0.7   0:00.03 sshd

 4029 cmd       20   0   82676   1944   1056 S  0.0  0.4   0:00.00 sshd

 4030 cmd       20   0   20208   2040   1544 S  0.0  0.4   0:00.00 bash

 4131 cmd       20   0   21924   1528   1100 R  0.0  0.3   0:00.00 top

 4414 hetrixt+  20   0   11608    640    376 S  0.0  0.1   0:00.00 bash

 4415 hetrixt+  20   0    8444    804    672 S  0.0  0.2   0:00.00 vmstat

 4416 hetrixt+  20   0    4212    584    484 S  0.0  0.1   0:00.00 tail

 7578 root      20   0    4196     60     36 S  0.0  0.0   0:10.64 runsvdir

 8378 Debian-+  20   0   53248     84     32 S  0.0  0.0   0:00.24 exim4

22127 root      20   0   25848    184    116 S  0.0  0.0   0:06.81 cron

Use arrow keys to navigate the list. Press q to exit. In many ways, top tops all other commands we explore here as it provides information on running processes, memory consumption and also uptime information

The first line is similar to the output from uptime. We see there are 24 tasks running, not much CPU being consumed by any of the tasks. The two lines just above the process list, gives the available memory and swap details.

Homework assignment, installing and using htop

FREE –M

You will also want to check memory usage in terms of available space, swap allocation etc. Use free -m to give you this information.

[cmd@user ~]$ free -m

              total        used        free      shared  buff/cache   available

Mem:           1024         479         457         180          86         454

Swap:             0           0           0

The -m flag is used to report data in Megabytes. You could change it to -h, which translates to “human readable” form. All values are converted and suffixed by appropriate G/M/K to represent Gigabytes/Megabytes/Kilobytes respectively.

Using the -s flag updates the values are regular 5 second intervals. This will show if the memory consumption is increasing over time.

DF –H

Df stands for “Disk Filesystem” and is used to check disk space utilization. When invoked as is, it displays the disk allocation and utilization of all available filesystems on your node.

Filesystem     1K-blocks    Used Available Use% Mounted on

udev            16407644       0  16407644   0% /dev

tmpfs            3287496    1128   3286368   1% /run

/dev/sda4      954185180 3247948 902397484   1% /

tmpfs           16437460       0  16437460   0% /dev/shm

tmpfs               5120       0      5120   0% /run/lock

tmpfs           16437460       0  16437460   0% /sys/fs/cgroup

/dev/loop0         88704   88704         0 100% /snap/core/4486

/dev/sda2        1998672  147424   1730008   8% /boot

/dev/loop1         89088   89088         0 100% /snap/core/4830

tmpfs            3287492       0   3287492   0% /run/user/1000

/dev/loop2         89088   89088         0 100% /snap/core/4917

If the large number looks hard to read, you could use the -h flag to display data in human readable form.

Filesystem      Size  Used Avail Use% Mounted on

udev             16G     0   16G   0% /dev

tmpfs           3.2G  1.2M  3.2G   1% /run

/dev/sda4       910G  3.1G  861G   1% /

tmpfs            16G     0   16G   0% /dev/shm

tmpfs           5.0M     0  5.0M   0% /run/lock

tmpfs            16G     0   16G   0% /sys/fs/cgroup

/dev/loop0       87M   87M     0 100% /snap/core/4486

/dev/sda2       2.0G  144M  1.7G   8% /boot

/dev/loop1       87M   87M     0 100% /snap/core/4830

tmpfs           3.2G     0  3.2G   0% /run/user/1000

/dev/loop2       87M   87M     0 100% /snap/core/4917

If you want to display used and available inodes, pass the -i flag to give a result like this

Filesystem       Inodes  IUsed    IFree IUse% Mounted on

udev            4101911    472  4101439    1% /dev

tmpfs           4109365    687  4108678    1% /run

/dev/sda4      60661760 145756 60516004    1% /

tmpfs           4109365      1  4109364    1% /dev/shm

tmpfs           4109365      4  4109361    1% /run/lock

tmpfs           4109365     18  4109347    1% /sys/fs/cgroup

/dev/loop0        12819  12819        0  100% /snap/core/4486

/dev/sda2        131072    313   130759    1% /boot

/dev/loop1        12841  12841        0  100% /snap/core/4830

tmpfs           4109365     10  4109355    1% /run/user/1000

/dev/loop2        12842  12842        0  100% /snap/core/4917

DU

On similar lines, the command du is worth exploring. du or Disk usage is used to identify which folders and/or files are consuming the most space. It differs from df as you could drill down to a particular folder and check usage. It is invoked as:

du /path/to/directory

A sample of this output is shown here:

[cmd@user ~]$ sudo du /var/log

4       /var/log/ntpstats

60      /var/log/apt

98316   /var/log/journal/81ab9be955ae4eb489a0d397a990251d

98320   /var/log/journal

4       /var/log/lxd

4       /var/log/landscape

644     /var/log/installer

28      /var/log/unattended-upgrades

4       /var/log/dist-upgrade

119320  /var/log

Common flags tot his, includes -h to print data in human readable form. The above output now looks like

4.0K    /var/log/ntpstats

60K     /var/log/apt

97M     /var/log/journal/81ab9be955ae4eb489a0d397a990251d

97M     /var/log/journal

4.0K    /var/log/lxd

4.0K    /var/log/landscape

644K    /var/log/installer

28K     /var/log/unattended-upgrades

4.0K    /var/log/dist-upgrade

117M    /var/log

-a is used to print sizes of files

-c is for printing a total line at the end of the display

-s is similar to -c, except this is just the final summary line and not the details

Flags can be combined to provide desired level of information

[cmd@user ~]$ du /var/log -s -h

117M    /var/log

NETSTAT

You get a call that there the application running on your server is not reachable. You have checked all the messages, done your ps search. There are no error messages and you can see the process running. As you are wondering what could have gone wrong, you decide to check the port number which the application is listening to. As it ends up, someone had changed the configuration file and the port number has changed. netstat has helped here.

Some key flags used to control netstat output are (they can be combined)







Flag	Description
-l	List only ports that are in Listen state
-t	List only ports that use TCP
-u	List only ports that use UDP
-n	List information, but do not perform any lookups on port, host or user names
-p	List information along with Process ID & Program Name

Combining the flags, we can get some information that can help in debugging. For e.g., identify if lighttpd is listening to TCP requests on port 80, you could enter

netstat -ltpn | grep "lighttpd"

netstat -ltpn | grep ":80"

Either of the above commands should show you the entry (if everything is working)

tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      
3376/lighttpd

tcp6       0      0 :::80                   :::*                    LISTEN      

3376/lighttpd

LAST REBOOT

This informs the times when the system was rebooted. Here is a sample output

[cmd@user ~]$ last reboot

reboot   system boot  2.6.32-042stab12 Sat Apr 28 02:51 - 19:44 (68+16:53)

reboot   system boot  2.6.32-042stab12 Wed Jan  3 20:39 - 19:44 (182+23:04)

reboot   system boot  2.6.32-042stab12 Wed Jan  3 20:39 - 20:39  (00:00)

This command is particularly useful to identify if there has been any shutdowns resulting in downtime on your server. In most cases you should be able to relate to every boot event. Note that it could be a provider initiated reboot (e.g. a Kernel patch that was required as a result of the Meltdown & Spectre bugs), though you would have received advance notification of this.

JOURNALCTL –XE

Systems that run systemd (Ubuntu 16.04+, Centos 6.5+, Fedora, Debian) also run a daemon called journald which keeps logs of boot messages, kernel messages and messages from various services. The journalctl app can be used to query & display the results from journald’s logging.

Just issuing the journalctl command displays all the logs from the beginning. Let us make it easier to filter out messages. To filter messages based on a service, use the -u flag.

journalctl -u mariadb.service

Lists all messages issued by the mariadb.service. Over a period of time, it could be pages of information, let us filter it further to messages since the last boot

journalctl -b -u mariadb.service

The -b flag limits to messages since the last restart. In case you restarted the machine due to a problem and want to identify messages before the most recent reboot, add the -1 flag like so

journalctl -b -1 -u mariadb.service

If you know the time frame around when the error occurred, you can add the –since flag

journalctl --since today                    # Lists journal entries from today

journalctl --since “2018-07-05 13:20:00”    # Lists entries from 13:20 on 5^th July

You can add the -u flag to limit to services you want

Application Log Entries

Most applications keep log entries based on settings such as ERROR/DEBUG level. These logs are in /var/logs/{application-name}, unless the application uses a different setting. Please consult the application’s documentation for exact locations and how the log messages can be diagnosed.

!#Conclusion

Hopefully, this article gives you a basic toolkit to look under the hood and see which bolts need to be tightened. For advanced users or for critical applications, we recommend using monitoring tools (self-hosted or 3^rd party). Though that is an article for a later date.