Do advanced Linux disk usage diagnostics with this sysadmin tool

Use topdiskconsumer to address disk space issues when you're unable to interrupt production.

Posted: March 31, 2023 by Kimberly Lazarski (Red Hat)

pad lock on cdrom — *^{Image by Hebi B. from Pixabay}*

In Free up space on your Linux system with this open source tool, I introduced Top Disk Consumer Report Generator (topdiskconsumer), an open source tool to help find files, directories, and deleted files that are consuming unnecessary storage on your system.

Occasionally, you'll find a discrepancy between your consumed space and what common userland tools report. Or you might be asked to reclaim space but are disallowed by your change advisory board (CAB) from interrupting production. topdiskconsumer, in combination with basic system administration skills, enables you to resolve these issues with relative ease and little risk in production.

[ Learn how to manage your Linux environment for success. ]

In this article, I will cover three advanced disk utilization diagnostic topics, including one that leverages a feature I added in version 0.6.

Case 1: Files hidden by mounted volumes

When diagnosing a disk usage issue, you may struggle to free up space, and then you might notice a seemingly odd situation where disk usage reported by df does not seem to agree with the disk usage. For example, when troubleshooting a system, you notice df reports only 3.2GB available, and you also need to keep the largest disk consumers reported when you run topdiskconsumer on root:

[root@klazarsk root]# topdiskconsumer --limit 5 --path / 
#_# BEGIN REPORT
== Server Time at start: ==
Sun Mar 26 10:17:31 EDT 2023

== Filesystem Utilization on [ / ]: ==
Filesystem                Type  Size  Used  Avail  Use%  Mounted  on
/dev/mapper/RHELCSB-Root  xfs   50G   47G   3.2G   94%   /        

== Inode Information for [ / ]: ==
Filesystem                Type  Inodes   IUsed   IFree    IUse%  Mounted  on
/dev/mapper/RHELCSB-Root  xfs   6908960  228666  6680294  4%     /        

== Storage device behind directory [ / ]: ==
/dev/mapper/RHELCSB-Root

== 5 Largest Files on [ / ]: ==
14G /root/test/java.log
11G /root/rhel-8.5-x86_64-dvd.iso
433M    /opt/BlueJeans/resources/app.asar
293M    /var/lib/rpm/Packages
254M    /usr/lib/vmware/view/html5mmr/libcef.so

== 5 Largest Directories on [ / ]: ==
39G total
39G /
24G /root
14G /root/test
7.8G    /usr
5.1G    /var

== 5 Largest Files on [ / ] Older Than 30 Days: ==
14G /root/test/java.log
11G /root/rhel-8.5-x86_64-dvd.iso
433M    /opt/BlueJeans/resources/app.asar
254M    /usr/lib/vmware/view/html5mmr/libcef.so
208M    /usr/lib/locale/locale-archive

== 5 Largest Deleted Files on [ / ] With Open Handles: ==
Size       COMMAND  File Handle       Filename
11.0313MB  sssd_be  /proc/2239/fd/18  /var/lib/sss/mc/initgroups

== Elapsed Time: ==
0h:0m:5s

== Server Time at completion: ==
Sun Mar 26 10:17:36 EDT 2023


#_# END REPORT
[root@klazarsk root]#

In this hypothetical production case, the CAB says the Java log cannot be rotated at this time, so you must look at other options. However, when looking at directories on this filesystem, as you dive deeper and raise the limit on viewed files, you cannot find where the space has gone. When, out of curiousity, you compare du / to df, you notice a discrepancy in the reported used space: df reports 47G is used, but du reports only 39G:

[root@klazarsk ~]# df -hTP / 
Filesystem               Type  Size  Used Avail Use% Mounted on
/dev/mapper/RHELCSB-Root xfs    50G   47G  3.2G  94% /

[root@klazarsk ~]# du -hscx / 
39G /
39G total
[root@klazarsk ~]#

But wait, this just jogged your memory; wasn't there recently a migration of /home to a separate volume recently? You wonder if the sysadmin who performed this migration overlooked something. However, you cannot stop production to unmount /home to check, so how can you avoid interrupting production while doing a thorough analysis of this volume?

The solution? A bind mount in conjunction with topdiskconsumer –alt-root. A bind mount can remount a directory under an arbitrary directory and enable you to see under the mount points. In this example, bind mount / on /mnt/root:

[root@klazarsk ~]# mount --bind / /mnt/root 

[root@klazarsk ~]# mount | grep -i root
/dev/mapper/RHELCSB-Root on / type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/mapper/RHELCSB-Root on /mnt/root type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)

[root@klazarsk ~]#

If you rerun topdiskconsumer, you confirm your suspicions: When /home was migrated to a separate partition, a large file, /home/olduser/bigfile, was left behind under /home on the / filesystem:

[root@klazarsk ~]# topdiskconsumer --limit 5 --alt-root /mnt/root
#_# BEGIN REPORT
== Server Time at start: ==
Sun Mar 26 11:52:46 EDT 2023

== Filesystem Utilization on [ /mnt/root ]: ==
Filesystem                Type  Size  Used  Avail  Use%  Mounted    on
/dev/mapper/RHELCSB-Root  xfs   50G   47G   3.2G   94%   /mnt/root  

== Inode Information for [ /mnt/root ]: ==
Filesystem                Type  Inodes   IUsed   IFree    IUse%  Mounted    on
/dev/mapper/RHELCSB-Root  xfs   6885984  228667  6657317  4%     /mnt/root  

== Storage device behind directory [ /mnt/root ]: ==


== 5 Largest Files on [ /mnt/root ]: ==
14G /mnt/root/root/test/java.log
11G /mnt/root/root/rhel-8.5-x86_64-dvd.iso
8.1G    /mnt/root/home/olduser/bigfile   <============= THIS FILE
433M    /mnt/root/opt/BlueJeans/resources/app.asar
293M    /mnt/root/var/lib/rpm/Packages

== 5 Largest Directories on [ /mnt/root ]: ==
47G total
47G /mnt/root
24G /mnt/root/root
14G /mnt/root/root/test
8.1G    /mnt/root/home/olduser
8.1G    /mnt/root/home

== 5 Largest Files on [ /mnt/root ] Older Than 30 Days: ==
14G /mnt/root/root/test/java.log
11G /mnt/root/root/rhel-8.5-x86_64-dvd.iso
433M    /mnt/root/opt/BlueJeans/resources/app.asar
254M    /mnt/root/usr/lib/vmware/view/html5mmr/libcef.so
208M    /mnt/root/usr/lib/locale/locale-archive

== 5 Largest Deleted Files on [ /mnt/root ] With Open Handles: ==
Size       COMMAND  File Handle       Filename
11.0313MB  sssd_be  /proc/2239/fd/18  /var/lib/sss/mc/initgroups

== Elapsed Time: ==
0h:0m:6s

== Server Time at completion: ==
Sun Mar 26 11:52:52 EDT 2023


#_# END REPORT
[root@klazarsk ~]#

Remove the file, then verify the space is free and results in the expected use space. Then you can unmount the / filesystem bind-mount /mnt/root:

[root@klazarsk ~]# rm /mnt/root/home/olduser/bigfile
rm: remove regular file '/mnt/root/home/olduser/bigfile'? y

[root@klazarsk ~]# df -hTP / 
Filesystem               Type  Size  Used Avail Use% Mounted on
/dev/mapper/RHELCSB-Root xfs    50G   39G   12G  78% /

[root@klazarsk ~]# umount /mnt/root 
[root@klazarsk ~]#

You've learned it is possible to bind mount a filesystem on an alternate mount point to check underneath other filesystems' mount points for files that were hidden by an active mount, and that these files may be accessed and even cleaned up through the bind mount without ever interrupting production services.

[ How well do you know Linux? Take a quiz and get a badge. ]

Case 2: File is in use, but you need to reclaim the space it is using

When administering application servers in a dedicated environment, it's not unusual to discover very large console .out log files that are being actively written to by Java servers. In this example, topdiskconsumer identifies a 21GB log file, /root/test/java.log:

[root@klazarsk ~]# topdiskconsumer --limit 5
#_# BEGIN REPORT
== Server Time at start: ==
Sun Mar 26 11:26:56 EDT 2023

== Filesystem Utilization on [ / ]: ==
Filesystem                Type  Size  Used  Avail  Use%  Mounted  on
/dev/mapper/RHELCSB-Root  xfs   50G   46G   4.1G   92%   /        

== Inode Information for [ / ]: ==
Filesystem                Type  Inodes   IUsed   IFree    IUse%  Mounted  on
/dev/mapper/RHELCSB-Root  xfs   8699240  228675  8470565  3%     /        

== Storage device behind directory [ / ]: ==
/dev/mapper/RHELCSB-Root

== 5 Largest Files on [ / ]: ==
21G /root/test/java.log    <=============== THIS FILE
11G /root/rhel-8.5-x86_64-dvd.iso
433M    /opt/BlueJeans/resources/app.asar
293M    /var/lib/rpm/Packages
254M    /usr/lib/vmware/view/html5mmr/libcef.so

== 5 Largest Directories on [ / ]: ==
46G total
46G /
31G /root
21G /root/test
7.8G    /usr
5.2G    /var

== 5 Largest Files on [ / ] Older Than 30 Days: ==
11G /root/rhel-8.5-x86_64-dvd.iso
433M    /opt/BlueJeans/resources/app.asar
254M    /usr/lib/vmware/view/html5mmr/libcef.so
208M    /usr/lib/locale/locale-archive
208M    /opt/google/chrome/chrome

== 5 Largest Deleted Files on [ / ] With Open Handles: ==
Size       COMMAND  File Handle       Filename
11.0313MB  sssd_be  /proc/2239/fd/18  /var/lib/sss/mc/initgroups

== Elapsed Time: ==
0h:0m:5s

== Server Time at completion: ==
Sun Mar 26 11:27:01 EDT 2023


#_# END REPORT
[root@klazarsk ~]#

Download now

In this hypothetical example, the engineer who deployed this failed to implement a logrotate profile, and the engineer who coded the application didn't leverage Java's logrotation. Although you are disallowed from interrupting production services by the CAB, you must free up space to avoid an outage.

At a loss, you consult with your escalation point and are reminded that ASCII logs are nearly always safe to truncate. With that knowledge, you decide to copytruncate the file manually, leveraging gzip -c to dump the file to stdout, redirect it to a new file on another filesystem, truncate the file using a redirector, and then move the archive back to the Java directory:

[root@klazarsk ~]# file /root/test/java.log
/root/test/java.log: ASCII text

[root@klazarsk ~]# tail -n 5 /root/test/java.log
Mon Mar 26 11:44:19 EDT 2023
Mon Mar 26 11:44:19 EDT 2023
Mon Mar 26 11:44:19 EDT 2023
Mon Mar 26 11:44:19 EDT 2023
Mon Mar 26 11:44:19 EDT 2023

[root@klazarsk ~]# gzip -c /root/test/java.log -c > /tmp/java.log.20230326.gz 

[root@klazarsk ~]# > /root/test/java.log

[root@klazarsk ~]# tail -f /root/test/java.log
Mon Mar 26 11:45:13 EDT 2023
Mon Mar 26 11:45:14 EDT 2023
Mon Mar 26 11:45:15 EDT 2023
Mon Mar 26 11:45:16 EDT 2023
Mon Mar 26 11:45:17 EDT 2023
Mon Mar 26 11:45:18 EDT 2023
Mon Mar 26 11:45:19 EDT 2023
^C

[root@klazarsk ~]# mv /tmp/java.log.20230326.gz /root/test/

This leaves you with 22GB available, and you've verified that the log is still being written to without issue!

[root@klazarsk ~]# df -hTP / 
Filesystem               Type  Size  Used Avail Use% Mounted on
/dev/mapper/RHELCSB-Root xfs    50G   29G   22G  58% /

[root@klazarsk ~]#

This works with most ASCII log files because almost always, process log files are 100% write-only, read-never. This makes the files safe to copy and truncate in place without ever interrupting the process. In fact, the process never even knows that the file has been manipulated! It is the read-only nature of most logs that makes this possible to do in production when you must keep systems up without interruption and need to reclaim space from large logs.

As an aside, logrotate's copytruncate feature works the same way, only it performs it much faster to retain 100% of log notices: It creates a copy of the file, then truncates it in place without ever unlinking it (hence, "copytruncate").

Case 3: topdiskconsumer reports a very large file you deleted is still using space

In this example, the server alerts you at the end of the day about very low disk space. As with previous examples, the CAB has instituted a hard change freeze, and appealing to them for approval to restart the Java process will take far too long. By the time they even read the request to consider approval, the Java process will have gone down.

Per change-control guidelines, rotating the logs is allowed even under a hard change freeze, so you compress a copy of the log and then try deleting it. But when you run topdiskconsumer, you find the file has been deleted but is reported as held open, and you need to reclaim that space. You try to reach the one emergency contact authorized to allow an exclusion from change control, but that escalation point is unavailable.

You take a closer look at the report and notice that the "deleted files with open handles" section lists a file handle (explained below) with the Java log you deleted, and it's holding on to 23.6GB worth of space:

[root@klazarsk ~]# topdiskconsumer --limit 5 
#_# BEGIN REPORT
== Server Time at start: ==
Mon Mar 27 12:20:14 EDT 2023

== Filesystem Utilization on [ / ]: ==
Filesystem                Type  Size  Used  Avail  Use%  Mounted  on
/dev/mapper/RHELCSB-Root  xfs   50G   49G   1.8G   97%   /        

== Inode Information for [ / ]: ==
Filesystem                Type  Inodes   IUsed   IFree    IUse%  Mounted  on
/dev/mapper/RHELCSB-Root  xfs   3804200  228512  3575688  7%     /        

== Storage device behind directory [ / ]: ==
/dev/mapper/RHELCSB-Root

== 5 Largest Files on [ / ]: ==
11G /root/rhel-8.5-x86_64-dvd.iso
433M    /opt/BlueJeans/resources/app.asar
293M    /var/lib/rpm/Packages
254M    /usr/lib/vmware/view/html5mmr/libcef.so
208M    /usr/lib/locale/locale-archive

== 5 Largest Directories on [ / ]: ==
26G total
26G /
11G /root
7.8G    /usr
5.2G    /var
3.4G    /var/log

== 5 Largest Files on [ / ] Older Than 30 Days: ==
11G /root/rhel-8.5-x86_64-dvd.iso
433M    /opt/BlueJeans/resources/app.asar
254M    /usr/lib/vmware/view/html5mmr/libcef.so
208M    /usr/lib/locale/locale-archive
208M    /opt/google/chrome/chrome

== 5 Largest Deleted Files on [ / ] With Open Handles: ==
Size       COMMAND  File Handle       Filename
23613.9MB  java     /proc/24895/fd/0  /root/test/java.log


== Elapsed Time: ==
0h:0m:5s

== Server Time at completion: ==
Mon Mar 27 12:20:19 EDT 2023


#_# END REPORT

You decide to double check, and you verify the file really was deleted:

[root@klazarsk ~]# ls -lh /root/test/java.log 
ls: cannot access '/root/test/java.log': No such file or directory

[root@klazarsk ~]# ls -lh /root/test/
total 116M
-rw-r--r--. 1 root root 116M Mar 27 01:25 java.log.2

[root@klazarsk ~]#

This is where topdiskconsumer's detailed "deleted files with open handles" report comes in handy. It does the hard work of parsing lsof to identify any unlinked-but-held-open files and find the file handle. A file handle is a lock a file makes to a file; even if you try to force a deletion, if the process is maintaining a handle on it because it is in use, the file will not be removed from the system until the process exits (or in the case of many processes, you send it a SIGHUP).

Normally you would have to identify the PID holding the file open by using lsof, and then go to /proc/[pid]/fd and manually find the file handle to figure out where that file can be accessed. But topdiskconsumer automates this for you, making it much easier.

In this example, if you look at /proc/24895/fd, you can verify the handle and see that the path really is zero:

[root@klazarsk ~]# ls -lh /proc/24895/fd
total 0
lr-x------. 1 root root 64 Mar 27 12:53 0 -> '/root/test/java.log (deleted)'
lrwx------. 1 root root 64 Mar 27 12:53 1 -> /dev/pts/0
lrwx------. 1 root root 64 Mar 27 12:53 2 -> /dev/pts/0
lrwx------. 1 root root 64 Mar 27 12:53 3 -> 'socket:[234187]'

[root@klazarsk ~]#

However, in the time that has passed while trying to obtain an emergency change approval, you realize you lost quite a few log messages. You accept that you will lose some logged notices, but you want to retain as many as possible. (And if you really want to minimize losses because the log is critical, you could dump the tail out to a file with tail -n [number] /proc/24895/fd/0 > [newfile]. But that isn't the case here.)

So, as with Case 1, leverage gzip to stdout and redirect that to another filesystem with space. You find there is enough space on /home, so you can temporarily zip it there (/home/klazarsk. in my case). But first, make sure it is still actively running and logging:

[root@klazarsk ~]# df -t xfs -hTP
Filesystem               Type  Size  Used Avail Use% Mounted on
/dev/mapper/RHELCSB-Root xfs    50G   49G  1.1G  98% /
/dev/nvme0n1p2           xfs   3.0G  461M  2.6G  16% /boot
/dev/mapper/RHELCSB-Home xfs   100G   49G   52G  49% /home

[root@klazarsk ~]# tail -f /proc/24895/fd/0
Mon Mar 27 13:06:59 EDT 2023
Mon Mar 27 13:07:00 EDT 2023
Mon Mar 27 13:07:01 EDT 2023
Mon Mar 27 13:07:02 EDT 2023
Mon Mar 27 13:07:03 EDT 2023
Mon Mar 27 13:07:04 EDT 2023
Mon Mar 27 13:07:05 EDT 2023
Mon Mar 27 13:07:06 EDT 2023
Mon Mar 27 13:07:07 EDT 2023
Mon Mar 27 13:07:08 EDT 2023
Mon Mar 27 13:07:09 EDT 2023
^C

[root@klazarsk ~]#

You verified it's logging as expected. Next, gzip it to an alternate location, tail off the last 1000 lines to capture what you missed while gzipping, and then truncate the file with the > redirector:

[root@klazarsk ~]# gzip -1 /proc/24895/fd/0 -c > /home/klazarsk/java.log-20230427.gz 
gzip: /proc/24895/fd/0: file size changed while zipping

[root@klazarsk ~]# tail -n 1000 /proc/24895/fd/0 > /root/test/java.log

[root@klazarsk ~]# > /proc/24895/fd/0

[root@klazarsk ~]#

Lastly, verify production was uninterrupted:

[root@klazarsk ~]# tail -f /proc/24895/fd/0
Mon Mar 27 13:30:30 EDT 2023
Mon Mar 27 13:30:31 EDT 2023
Mon Mar 27 13:30:32 EDT 2023
Mon Mar 27 13:30:33 EDT 2023
Mon Mar 27 13:30:34 EDT 2023
Mon Mar 27 13:30:35 EDT 2023
Mon Mar 27 13:30:36 EDT 2023
Mon Mar 27 13:30:37 EDT 2023
Mon Mar 27 13:30:38 EDT 2023
Mon Mar 27 13:30:39 EDT 2023
Mon Mar 27 13:30:40 EDT 2023
^C

[root@klazarsk ~] #

You can see the space has been immediately reclaimed, averting an outage, and now you have 25GB available!

[root@klazarsk ~]# topdiskconsumer --limit 5
#_# BEGIN REPORT
== Server Time at start: ==
Mon Mar 27 13:35:34 EDT 2023

== Filesystem Utilization on [ / ]: ==
Filesystem                Type  Size  Used  Avail  Use%  Mounted  on
/dev/mapper/RHELCSB-Root  xfs   50G   26G   25G    52%   /        

== Inode Information for [ / ]: ==
Filesystem                Type  Inodes    IUsed   IFree     IUse%  Mounted  on
/dev/mapper/RHELCSB-Root  xfs   26214400  228531  25985869  1%     /        

== Storage device behind directory [ / ]: ==
/dev/mapper/RHELCSB-Root

== 5 Largest Files on [ / ]: ==
11G /root/rhel-8.5-x86_64-dvd.iso
433M    /opt/BlueJeans/resources/app.asar
293M    /var/lib/rpm/Packages
254M    /usr/lib/vmware/view/html5mmr/libcef.so
208M    /usr/lib/locale/locale-archive

== 5 Largest Directories on [ / ]: ==
26G total
26G /
11G /root
7.8G    /usr
5.2G    /var
3.4G    /var/log

== 5 Largest Files on [ / ] Older Than 30 Days: ==
11G /root/rhel-8.5-x86_64-dvd.iso
433M    /opt/BlueJeans/resources/app.asar
254M    /usr/lib/vmware/view/html5mmr/libcef.so
208M    /usr/lib/locale/locale-archive
208M    /opt/google/chrome/chrome

== 5 Largest Deleted Files on [ / ] With Open Handles: ==
Size       COMMAND  File Handle       Filename
11.0313MB  sssd     /proc/1543/fd/19  /var/lib/sss/mc/initgroups

== Elapsed Time: ==
0h:0m:5s

== Server Time at completion: ==
Mon Mar 27 13:35:39 EDT 2023


#_# END REPORT
[root@klazarsk ~]#

About file handles

You now know how to interpret the "deleted files with open handles" section of the report to reclaim disk space, but you may be wondering: What is a file handle, and why does the process continue to hold it after I've deleted it? Why are these extra steps necessary?

It sounds complex, but it's really quite simple: Processes create "file handles" to hold files open, sort of like having an attorney or consultant on retainer for easy use. If the file is deleted, the system will "delete" the file by removing the hard link to it, but the file itself is not erased from the disk until the process relinquishes the handle. (Note: whether or not the file contents are zeroed out depends on the filesystem type, filesystem options, and mount options.)

The impact is, if you have a large log file, and you delete the file using the filename but a process has an open file handle to it (you can see these via /proc/$PID/fd, or you can lsof [filename] before deleting it), the space will remain tied up by the process. Once you delete the file via rm, the link to it is gone, and it can no longer be accessed through normal tools, except through the /proc/$PID/fd/[filehandle].

The reason these files are held open with the handles before the kernel actually reclaims the space is to prevent user errors from interrupting a running process. If a process tries to read an open file you've deleted, if the system allowed it, and the process expects it to be there and does not have exception handling to handle the missing data gracefully, the process will crash, resulting in an outage.

The side effect of how Unix and Unix-like operating systems such as Red Hat Enterprise Linux work is that these file descriptors inside the kernel may be directly accessed through the /proc virtual filesystem. As a result, you can safely "truncate" log files through file descriptors, since ASCII log files are very nearly (but not always) 100% write-only, read-never by the process, and that is the effect leveraged in this example. There are ways to determine whether a process reads from a file (strace for example), but that is outside the scope of this article.

The process in Case 3 is similar to the copytruncate feature explained in Case 2 above.

[ To learn more about the rm command, see What happens when you delete a file in Linux? ]

Wrapping up

I have been working on topdiskconsumer for several years (my previous article describes the journey). You can find the script at my public GitHub repository. There are still some bugs and an interoperability issue on some Linux distributions has been reported. These bug fixes will be coming as I carve out time to continue enhancing the utility.

Topdiskconsumer is a valuable tool, but no tool can substitute for proper analysis and system administration skills. In the coming months, I will look at potentially adding more proactive features to the utility, but at the end of the day, experience is required to properly interpret the report and take corrective action.