Free up space on your Linux system with this open source tool

Try the Top Disk Consumer Report Generator to help find files, directories, and deleted files that are consuming unnecessary storage on your system.

Posted: January 23, 2023 by Kimberly Lazarski (Red Hat)

df vs du — ^{Christian Riise Wagner, CC BY-SA 2.0}

Whether you're a sysadmin by career or you're just dealing with a home business server in your broom closet that's running out of space, you probably need to find out where disk space is going to waste. Tools like du and df will get you part of the way there. But when you need to clean up space quickly, you need a different type of utility to fill the gap.

[ Learn how to manage your Linux environment for success. ]

This is where my Top Disk Consumer Report Generator utility comes in handy. This open source tool (GPLv3) allows you to clear space with minimum detective work required. By running topdiskconsumer from any directory on the filesystem, it reports the overall free space, the largest files, directories, and older files, and deleted files with space still in use.

This utility can be used for routine housekeeping, troubleshooting disk usage on production servers, and as a training tool for junior sysadmins.

Background

A few years ago, I created this script and have maintained (and rewritten) it since. It was initially poorly coded, as it grew organically. I tacked on each feature as I needed to gather info to include in ticket updates to justify actions taken on servers. My little one-liner eventually grew into a 2.5KB behemoth and was widely used by customers of a large hosting company. However, apart from a variable (intNumFiles) being assignable at the beginning of the script, it was entirely hard-coded and awkwardly parsed utilities like df to get specific stats in a very inefficient way.

I had always intended to rewrite it as a full command-line utility, including switches to turn off specific reports, timeouts, formatting options other than bbcode, and a help screen.

I still have my original version of this script, and if it weren't for how it grew organically and as a teaching tool for my mentees, I'd feel embarrassed at how bad the code is. What makes me proud of it is that I have been able to teach junior engineers how to take information and tools they have and parse and reuse output to drive other tools.

Install Top Disk Consumer Report Generator

The installation process is straightforward. Here are the basic steps:

1. Open the storagetoolkit repository in a web browser.

2. Click the green CODE button above the file listings.

3. Click the COPY button.

Use the GitHub Code button to download the code for storagetoolkit — (Kimberly Lazarski, CC BY-SA 4.0)

Download now

4. Open a terminal and change to the directory where you want to download the files.

5. Type git clone https://github.com/klazarsk/storagetoolkit.git to clone the project.

6. Enter cd storagetoolkit to change to the subdirectory Git created for you.

7. Enter sudo chmod a+x to ensure the script has execute privileges.

8. List the files; your username will own them:

$ sudo chmod a+x topdiskconsumer
[klazarsk@klazarsk storagetoolkit]$ ls -lh
total 16K
-rw-rw-r--. 1 klazarsk klazarsk 3.4K Jan  6 15:29 README.md
-rwxrwxr-x. 1 klazarsk klazarsk  11K Jan  6 15:29 topdiskconsumer

9. Use sudo or root privileges to copy the file to /usr/bin or another system directory in the search path (or ~/.local/bin of the account you wish to run it from, if you do not want it in your system directory):

$ sudo cp topdiskconsumer /usr/bin -v
[sudo] password for klazarsk:
'topdiskconsumer' -> '/usr/bin/topdiskconsumer'
[klazarsk@klazarsk storagetoolkit]$ ls -lh /usr/bin/topdiskconsumer
-rwxr-xr-x. 1 root root 11K Jan  6 15:32 /usr/bin/topdiskconsumer

Now you can execute the file from anywhere on the system.

Find which files and directories are using storage

To use this tool, place the topdiskconsumer file into a directory in your search path, chmod it as executable, and run it from any directory on the filesystem you want to analyze.

The script will output a report of overall disk usage on the filesystem, list the top 20 largest files, the largest 20 directories, and the top 20 largest files older than 30 days. It will also find "ghosts" by identifying unlinked (deleted files) that are still consuming space due to open file handles. It even shows the file handle so you can easily reclaim that space.

I've incorporated HTML, ANSI, and bbcode formatting for bold-face headers, with ANSI as the default formatting. I've also provided command-line options, including timeouts, omitting metadata, and the ability to skip reports to save time. There is a list of command-line options below.

If you want to run it on enormous filesystems, I recommend leveraging the command-line switches to turn off reports you do not care about and running it in a screen or tmux session. Alternatively, you can set a timeout so each report will die after the specified time (it can take days to run on mounts with an enormous number of files).

[ Want to test your sysadmin skills? Take a skills assessment today. ]

Sample output

Try a sample run to view the top five disk consumers on your system. Following is the output, limited to five items per section. (Note: The 5 Largest Directories result displays "total" as an additional entry above the limit count. This is intentional.)

$ sudo ./topdiskconsumer --number 5

#_# BEGIN REPORT
== Server Time at start: ==
Wed Jan  4 13:53:08 EST 2023

== Filesystem Utilization on [ /home ]: ==
Filesystem                Type  Size  Used  Avail  Use%  Mounted  on
/dev/mapper/RHEL-Home  xfs   100G  45G   56G    45%   /home    

== Inode Information for [ /home ]: ==
Filesystem                Type  Inodes    IUsed   IFree     IUse%  Mounted  on
/dev/mapper/RHEL-Home  xfs   52428800  483742  51945058  1%     /home    

== Storage device behind directory [ /home ]: ==
/dev/mapper/RHEL-Home

== 5 Largest Files on [ /home ]: ==
21G     /home/user/VirtualMachines/rhel8.qcow2
618M	/home/user/ceph-common.tar
500M	/home/user/scratch/bigFile
405M	/home/user/Downloads/working/backup.tar
281M	/home/user/.local/shareaaaaaaa1f.file

== 5 Largest Directories on [ /home ]: ==
45G	total
45G	/home
44G	/home/user
21G	/home/user/VirtualMachines
18G	/home/user/.local/share
18G	/home/user/.local

== 5 Largest Files on [ /home ] Older Than 30 Days: ==
21G     /home/user/VirtualMachines/rhel8.qcow2
618M	/home/user/ceph-common.tar
500M	/home/user/scratch/bigFile
405M	/home/user/Downloads/working/backup.tar
281M	/home/user/abc123.file

== 5 Largest Deleted Files on [ /home ] With Open Handles: ==
Size  COMMAND  File Handle          Filename
4MB   chrome   /proc/2728808/fd/14  /home/user/.config/foo.pma

== Elapsed Time: ==
0h:0m:30s

== Server Time at completion: ==
Wed Jan  4 13:53:38 EST 2023


#_# END REPORT

[Cheat sheet: Old Linux commands and their modern replacements ]

Refine the output with flags

The script identifies the mount point containing the current working directory when executed with no arguments. Then it will execute the report starting from the mount point, identifying and listing the top 20 largest files, the top 20 largest directories, and then the top 20 largest files aged over 30 days.

You can refine the output further with command-line arguments, which you can read more about in the documentation (also available with the --help command):

--format (or -f) [format]: Format headings as html, bbcode (bold), or ansi (for terminals and rich text).
--path (or -p) [path]: Set a path, and topdiskconsumer will run from that directory's parent mount point.
--limit (or -l) [number]: Limit the report to the largest files for each report section. The default is 20.
--timeout (or -t) [duration]: Set a timeout for each section of the report. Please note that specifying a timeout may result in incomplete and misleading results.
--skipold (or -o): Skip files more than 30 days old.
--skipdir (or -d): Skip the largest directories.
--skipmeta (or -m): Omit metadata such as reserve blocks, start and end time, or duration.
--skipunlinked (or -u): Skip deleted files with open handles.
--skipfiles (or -f): Skip the largest files.

Wrap up

I like to use Hungarian notation for my variable and function names, I prefer grep -E over egrep (which is just a fancy shell script alias to grep -E), and I usually end Bash statements with semicolons because it makes my scripts easier to collapse and adapt into a one-liner.

You can access the complete source code in my repository.

[ Get the guide to installing applications on Linux. ]