Find lost files with Scalpel

11 de julho de 2019Seth Kenlon5 minutos (tempo de leitura)

As a system administrator, part of your responsibility is to help users manage their data. One of the vital aspects of doing that is to ensure your organization has a good backup plan, and that your users either make their backups regularly, or else don’t have to because you’ve automated the process.

However, sometimes the worst happens. A file gets deleted by mistake, a filesystem becomes corrupt, or a partition gets lost, and for whatever reason, the backups don’t contain what you need.

As we discussed in How to prevent and recover from accidental file deletion in Linux, before trying to recover lost data, you must find out why the data is missing in the first place. It’s possible that a user has simply misplaced the file, or that there is a backup that the user isn’t aware of. But if a user has indeed removed a file with no backups, then you know you need to recover a deleted file. If a partition table has become scrambled, though, then the files aren’t really lost at all, and you might want to consider using TestDisk to recover the partition table, or the partition itself.

What happens if your file or partition recovery isn’t successful, or is only in part? Then it’s time for Scalpel. Scalpel performs file carving operations based on patterns describing unique file types. It looks for these patterns based on binary strings and regular expressions, and then extracts the file accordingly.

This tool isn’t currently being maintained, but it’s ever-reliable, compiling and running exactly as expected. If you’re running Red Hat Enterprise Linux (RHEL) 7, RHEL 8, or Fedora, you can download Scalpel’s RPM installers, along with its dependency, libtre, from klaatu.fedorapeople.org.

Starting with Scalpel

Scalpel comes bundled with a comprehensive list of file types and their most unique identifying features. Sometimes, a file can be identified by predictable text at its head and tail:

htm    n    50000   <html         </html>

While at other times, cryptic-looking hex codes are necessary:

jpg    y   200000000    \xff\xd8\xff\xe0\x00\x10    \xff\xd9

Scalpel expects you to duplicate /etc/scalpel.conf edit your copy to include the file types you hope to recover, and to exclude the file types you know you don’t need. For instance, if you know you don’t have or care about .fws files, then comment that line out of the file. Doing this can speed up the recovery process and reduce false positives.

In the configuration file, the format of a file definition is, from left to right:

The file’s extension.
Whether the header and footer are case sensitive (y or n).
The minimum and maximum file size you want Scalpel to find.
A standard header that identifies the beginning of the file.
A standard footer that identifies the end of the file.

The footer field is optional. If no footer is provided, then Scalpel extracts the number of bytes you set as the file type’s maximum value.

You might find that a recovery effort only rescues part of a file, such as this mostly-recovered JPG:

An incomplete JPG file.

This result means that you probably need to increase the file’s bounds maximum value, and then re-scan, so that the end of the file can be recovered, too:

A repaired JPG file.

Defining new file types

First, make a copy of the Scalpel configuration file. If all your users generate similar data, then you may only need one config file for your entire organization. Or, you might find it better to have one config file per department.

To add your own file types to a Scalpel config, start with some investigative forensics.

For text files, you ideally have some predictable structure you can anticipate. For instance, an XML file probably starts with <xml and ends with </xml. Binary files are similarly predictable. Using the hexdump command, you can view a typical header from the file type you want to define. Here’s the results for an XCF, the default layered graphic file from GIMP:

$ head --bytes 8 example.xcf | hexdump --canonical
00000000  67 69 6d 70 20 78 63 66         |gimp xcf|
00000008

This output is from a Red Hat Enterprise Linux 8 system. On older systems, an older syntax may be necessary:

$ head --bytes 8 example.xcf | hexdump -C
00000000  67 69 6d 70 20 78 63 66         |gimp xcf|
00000008

The canonical output of hexdump displays the address in the far left column, and the decoded values on the far right. In the center column are the hexadecimal bytes of the first 8 bytes of the XCF file’s first line.

Most binary files in /etc/scalpel.conf look pretty similar to that output, except that these values are prefaced with the \x escape sequence to denote that the numbers are actually hexadecimal digits. For instance, a JPG file looks like this in the configuration file:

jpg     y     200000000     \xff\xd8\xff\xe0\x00\x10     \xff\xd9

Compare that value with a test hexdump of the first 6 bytes (because that’s how many bytes scalpel.conf contains in its JPG definition) of any JPG file on your system:

$ head --bytes 6 example.jpg | | hexdump --canonical
00000000  ff d8 ff e0 00 10                    |......|
00000006

Compare the footer with the last 2 bytes to match what the config file shows:

$ tail --bytes -2 example.jpg | hexdump --canonical
00000000  ff d9                        |..|
00000002

These values match up, so you can be confident that valid JPG files probably all start and end in a predictable sequence.

Note: The Ogg entry in the scalpel.conf file is misleading, as it lacks the \x escape sequence. If you need to recover an Ogg file, fix this, or replace its definition.

Getting to work

Now, to obtain the same level of confidence for all files you need to recover (such as XCF, in the previous example). To reiterate, this is your workflow for defining the binary file types common to the victim drive:

Get the hexadecimal values of the first few bytes of a file type using the head --bytes n command.
Get the last few bytes using the tail --bytes -n command.
Repeat this process on several different files of the same type to confirm consistency of this pattern, adjusting the length of your header and footer patterns as required.
Enter the header and footer values into your custom Scalpel config, using the \x notation to identify each byte as a hexadecimal character.

Follow this sequence for each important binary file type you need to recover.

If a file is plaintext, provide a common header and footer, such as #!/bin/sh for shell scripts, # (the space after the # is important) for markdown files with an h1 level title, <xml for XML files, and so on.

When you’re ready to run Scalpel, create a directory where it can place your rescued files:

$ mkdir /run/media/seth/rescuer/scalped

Note: Do not create this directory on the same volume that contains the lost data.

If the victim drive is not yet mounted, mount it, and then run Scalpel:

$ scalpel -c my-scalpel.conf \
  -o /run/media/seth/rescuer/scalped \
  /run/media/seth/victim

You can also run Scalpel on a disk image:

$ scalpel -c my-scalpel.conf \
  -o ~/scalped ~/victim.img

When Scalpel is done, review the files in your designated rescue directory.

All in all, it’s best to make backups so you can avoid doing file recovery at all. But, should the worst happen, try Scalpel and carve carefully.

Sobre o autor

Seth Kenlon

Linux geek

Seth Kenlon is a Linux geek, open source enthusiast, free culture advocate, and tabletop gamer. Between gigs in the film industry and the tech industry (not necessarily exclusive of one another), he likes to design games and hack on code (also not necessarily exclusive of one another).

Read full bio