As originally published in root - journal for the UNIX professional (now defunct)
Volume 2 Number 6 - November/December 1990, pp. 24-30

Raw source files: getfree.c textpieces.c

Feature Article

A Sobering Excursion Into Data Recovery
By Tim Eliseo

When system backups are administered well, users tend to develop a feeling of security that their data are safe and that any loss is generally limited to a day's work. This sometimes leads to a slightly careless attitude about accidentally deleting files. When, in this situation, a backup fails, the result is often catastrophe.

This was the situation I was confronted with recently while working on a software development project. Due to momentary confusion with a symbolic link, one of my team managed to delete most of the source tree for the project (he commented that a particular rm seemed to take a suspiciously long time). We contacted system support and were soon informed that a successful backup had not been made for over two months! Our UNIX disk partition was simply a file to the underlying mainframe operating system. Backups were made of this file as a whole, but during the prior two months the file was inaccessible to the backup software, a fact that somehow got overlooked by the administrative staff.

I realized that our only hope of recovery was to scan the raw deleted data on the filesystem. Considering the magnitude of the loss, it was clearly worth a considerable amount of effort to attempt recovery, so I proceeded to develop the utility software described here. There were a number of factors in our favor, allowing us to make some helpful assumptions:

We were operating with the Berkeley Fast File System. Its allocation strategies tend to localize related data and often keep file blocks contiguous. This minimized the pain of chasing segments of files all over the disk.
Like most filesystems, ours was almost completely full. Out of 500 megabytes, 20 were free before the mishap, and about 3 were lost. This restricted the search to about 73 megabytes (23 plus another 50 megabytes for the 10% filesystem "waste").
The only significant files we cared to recover were text files, mostly C source code. It was fairly easy to figure out the file name from the file contents. Non-text files would have been virtually impossible to recover unless one could figure out how to fit the pieces together by context. (For reasons too esoteric to mention, I am using the term non-text instead of binary to refer to data that is not restricted to a stream of variable-length, line-oriented records of ASCII characters.)
Our disk was used mostly for editing and compiling, so most of the free space before the deletion was filled with non-text data. This reduced the amount of text that had to be analyzed manually.
Since our files were maintained under SCCS, we had two chances of recovery for each file. If we couldn't find the s-file, we could settle for the g-file (the file that get creates).
We had two ways to verify what we had recovered. First, the checksums on the s-files allowed us to verify that file fragments had been spliced correctly. Second, we had a current set of object libraries which allowed us to verify that we had found the latest versions.
The affected filesystem was not the root filesystem, therefore it was not necessary to write on it for the operation of the system.
We had enough space on other filesystems to rebuild the lost files and store temporary data.
We detected the mistake almost immediately, so we were able to prevent the free space from being reallocated by halting our own work, restricting logins, and killing any daemons that could possibly write to the filesystem.

A Quick Look at a Berkeley Filesystem

A Berkeley Fast File System consists of a disk partition divided into a number of cylinder groups. Each cylinder group contains a copy of the superblock (a structure containing the parameters of the filesystem), a cylinder group header (a structure containing information specific to each cylinder group), inodes (information about files, sans hierarchical and naming information), a free block map, and data blocks. The dumpfs utility provides a nice summary of the filesystem information and is a good first step when attempting to recover deleted data.

The free block map is maintained as a bit map with each bit representing the availability of one fragment. A block consists of a number of fragments, and a fragment consists of a number of disk sectors (normally 512 bytes each). The allocation of whole blocks and fragments for a file is fairly complex, but suffice it to say that once a file is deleted, this relationship is lost and the deleted data can only be dealt with at the fragment level.

Of course, this description is just the tip of the iceberg. The definitive reference on the Berkeley file system is chapter 7 (while you're at it, read the whole book!) of The Design and Implementation of the 4.3BSD UNIX Operating System by Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, and John S. Quarterman [Addison-Wesley, 1989] or the older paper (contained in the BSD document set) A Fast File System for UNIX by Marshall Kirk McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry [1983].

Reading the Deleted Data

The first utility I wrote to aid recovery is getfree (shown on pages 27-28). It reads the data from unallocated fragments and writes it to standard output. The syntax is:

getfree

block filesystem

cylinder group #

The filesystem is a file name such as /dev/ra0g, and it is followed by a list of the cylinder group numbers from which you wish to read data. There are two reasons why the program takes a list of cylinder group numbers rather than automatically scanning every group. If you look at the output of dumpfs, you will notice that each cylinder group has a time stamp. This indicates the time of last update of the cylinder group header. Since modifying the free map by deleting files causes a header update, I was assured that we need not bother looking in cylinder groups with time stamps older than the time of the mishap. The other reason is that it was necessary to work on only a few groups at a time due to limited space on the "work" filesystem.

So How Does It Work, Exactly?

Let's take a closer look at getfree. Use of the macros included with <ufs/fs.h> have made the tasks of coding and understanding far easier. Since what we're doing is very similar to the internal algorithms used in the kernel, most of the complex calculations have already been done for us in the macros.

Unfortunately, it's often not clear whether a structure member or macro deals with bytes, device blocks, fragments, allocation blocks, or cylinders. Even more confusing is that just about everything is called a block. We must, therefore, always carefully keep in mind which units we're working in. Often a quick look at the kernel source code can answer a lot of questions (or generate more, depending on how long you look at it).

The first step in getfree is to read the superblock and verify its magic number. The superblock is located at a fixed location in the filesystem. This is fortunate since it must be located before anything else is known about the filesystem. Now in possession of the filesystem parameters, we loop for each cylinder group requested.

The macro cgtod calculates the location of the cylinder group header and free list, expressed in fragments. Using fs_fshift to convert to bytes, we seek to and read the header and free list, verifying the magic number to be sure we've got the right stuff. We then scan the free list for free fragments, reading the data in each one. Note that the location of the data is calculated relative to the beginning of the cylinder group. The cylinder group header and inodes are at a variable offset from the beginning of the cylinder group, with data before and after, so for simplicity the free list covers the entire cylinder group. The cgdmin macro, which would appear to be what we want to use, actually only gives us the location of data after the inodes.

getfree transfers data to stdout without any attempt at interpretation or conversion, but with one important filter. Any fragment consisting entirely of the same repeated byte is not written. Since deleted and unused areas of files are often zero-filled, and disk controllers generally initialize data areas with a repeated byte when formatting (supposedly the "hardest" bit pattern to read, and therefore most likely to cause a format verification error in a marginal disk block), this filtering tends to eliminate a lot of trash, saving space on the recovery filesystem.

Extracting Text

So now I had a big file with our source code in it but also with lots of non-text data. The next utility, textpieces (shown on pages 28-30), scans a file created by getfree for fragments containing data that looks like text and writes each apparently contiguous piece to a separate file, ignoring fragments that appear to be non-text. The syntax is:

textpieces

output file name prefix

The standard input is the output from getfree. The argument is the prefix to be used to name the output files, which will have names of the form prefix.nnn, where nnn is a number starting with 001. A report is written to standard output.

Two heuristics are used to classify the data:

A fragment containing any byte with bit 7 (the high-order bit) set is considered non-text and discarded.
A fragment containing NUL (0) characters within the data, as opposed to only at the end of the fragment, is discarded.

The "good" fragments are written out to a series of numbered files. It is assumed that NULs at the end of a fragment represent the remaining space in the fragment past the end of a file. As long as there are contiguous fragments with no bit 7s set and no NULs, data are written to the same output file. The string /*~~~*/ is written between each fragment for easy identification of the boundaries.

When a fragment with trailing NULs is encountered, the fragment is written without the trailing NULs and the output file is switched to the next file. Files ending this way are generally the ones that contain a complete recovery of a deleted file.

When a fragment to be discarded is encountered, any current output file is reported as "partial," and the output is switched to the next file. It is possible, however unlikely, that this was a file that just happened to end on a fragment boundary. It is more likely that a piece of this file must be found elsewhere.

For our circumstances, the criteria for selecting text and non-text data worked very well. As a precaution against missing something, I reported what was found in each discarded fragment and looked over the report manually for questionable entries. As it turned out, all of the discarded fragments were clearly non-text data; a great majority were composed of either mostly NULs or had bit 7 set on approximately half of the bytes.

Reconstructing The Lost Files

The next and final step in the process was also the most tedious and time consuming. Each of the hundreds of files created by textpieces had to be examined, identified, split (if it contained parts from different files), combined with other incomplete parts (if it did not contain an entire file), and stripped of the /*~~~*/ strings between connecting parts.

In our situation, we were attempting to recover about 3 megabytes of data contained in files ranging in size around 5-20K each. Approximately 70% of the files were each wholly contained in one recovery file. The remaining files had to be spliced together from pieces in two or more recovery files. Our intimate familiarity with the files allowed us to match up parts fairly easily since a small piece was all we needed in order to identify a file. If we had been relatively unfamiliar with the files, this task would have been far more difficult (in the worst case requiring n! attempted matches). Often we ended up with two or three versions of the same file and had to decide which was the most current, although with SCCS files this wasn't a problem.

Conclusion

In the end we recovered all of our source code (even the SCCS s-files), which seemed to us virtually miraculous. All we ended up losing were a couple of Makefiles. For us, the attempt at recovery was well worth the effort.

At several points I had contemplated writing a more interactive utility which would allow the user to split up, tag, and combine pieces easily, possibly while reading directly from the damaged filesystem rather than requiring a copy on the recovery filesystem, but I concluded that nothing short of a major software project was a match for these rudimentary utilities, combined with a good text editor and other UNIX utilities.

textpieces could easily be modified to select different types of data, such as allowing some percentage of the fragments to have bit 7 set (common when an extended character set is used), or look for patterns present in a certain type of binary-encoded and/or fixed-length-record format. If a fixed record length was relatively prime to the fragment size, textpieces could also help to splice, or at least classify, pieces.

Although no one expects this sort of situation to happen to them, it likely will to many at some time. The experience taught us a lot about UNIX file system structure, disk layout policies, and usage patterns. Other than taking more personal responsibility for backups, there is little way to prepare for such a catastrophe. Masochists are, of course, encouraged to intentionally delete their own files for practice.

Tim Eliseo is a consultant who specializes in UNIX system software design and implementation; he is a certified MS-DOS hater. He can be reached at:

4470 Plantation Drive
Fair Oaks, CA 95628-5639
(916) 966-0509
[old email - no longer valid]

#ifndef lint
static char sccsID[] = "@(#)getfree.c 1.3 90/10/25 22:05:42";
#endif

/* Read data from BSD filesystem free blocks

   Copyright (c) 1988, 1990 Timothy R. Eliseo.  All rights reserved.
   Permission is granted to use but not to sell this software provided that the
   above message is not removed and that any modification is noted.
*/

#include <sys/types.h>
#include <sys/param.h>
#include <stdio.h>
#include <ufs/fs.h>	/* <sys/fs.h> on some systems */
#include <sys/file.h>
#include <malloc.h>

#ifndef offsetof	/* ANSI C would define this in <stddef.h> */
#define offsetof(s_name, m_name)	(size_t)&(((s_name *)0)->m_name)
#endif

main(argc, argv)
char *argv[];
{
    int fsfd;		/* file descriptor of the filesystem */
    struct fs fs;	/* filesystem header (superblock) */
    struct cg cg;	/* cylinder group header */
    u_char *freelist,	/* pointer to cg free list */
      *dp;
    static u_char data[MAXBSIZE];	/* a block of data */

    if (argc < 2) {
	fprintf(stderr, "Usage: %s <block fs> <cg #> ...\n", argv[0]);
	exit(1);
    }

    /* open the filesystem */
    if ((fsfd = open(argv[1], O_RDONLY)) < 0) {
	perror(argv[1]);
	exit(2);
    }

    /* read the superblock */
    lseek(fsfd, SBLOCK * DEV_BSIZE, L_SET);
    read(fsfd, &fs, sizeof fs);
    if (fs.fs_magic != FS_MAGIC) {
	fprintf(stderr, "Invalid magic number %lx in superblock\n",
	  fs.fs_magic);
	exit(3);
    }

    /* allocate a buffer for the maximum size of a cylinder group free list */
    freelist = (u_char *)malloc(fs.fs_bsize - offsetof(struct cg, cg_free[0]));

    argc--;	/* past the program name */
    argv++;

    /* loop for each cylinder group # on the command line */
    while (--argc) {
	int cgn;
	long cgs, frn;

	cgn = atoi(*++argv);	/* cylinder group number */
	cgs = cgbase(&fs, cgn);	/* the base of the cylinder group in frags */

	/* read the cg header */
	lseek(fsfd, (long)cgtod(&fs, cgn) << fs.fs_fshift, L_SET);
	read(fsfd, &cg, offsetof(struct cg, cg_free[0]));
	if (cg.cg_magic != CG_MAGIC) {
	    fprintf(stderr, "Invalid magic number %lx in cylinder group %d\n",
	      cg.cg_magic, cgn);
	    exit(3);
	}

	/* read the free list, located right after the fixed size header info */
	read(fsfd, freelist, (cg.cg_ndblk + 7) / 8);

	/* loop for each free block */
	for (frn = 0; frn < cg.cg_ndblk; frn++) {
	    if (freelist[frn / NBBY] >> frn % NBBY & 1) {  /* block free? */

		/* read the block's data */
		lseek(fsfd, (long)(cgs + frn) << fs.fs_fshift, L_SET);
		read(fsfd, data, fs.fs_fsize);

	      /* write the data only if it's not filled with a repeated byte */
		for (dp = data + 1; dp < data + fs.fs_fsize; dp++)
		    if (data[0] != *dp)
			break;
		if (dp != data + fs.fs_fsize)
		    write(1, data, fs.fs_fsize);
	    }	/* if free */
	}	/* for each block */
    }	/* for each cg */
    return 0;
}

#ifndef lint
static char sccsID[] = "@(#)textpieces.c 1.4 90/10/25 22:08:17";
#endif

/* Break getfree output into continuous-text pieces and discard non-text data

   Copyright (c) 1988, 1990 Timothy R. Eliseo.  All rights reserved.
   Permission is granted to use but not to sell this software provided that the
   above message is not removed and that any modification is noted.
*/

#include <stdio.h>
#include <strings.h>

/* The size of a fragment is defined here to avoid cluttering this program
   with filesystem dependencies.  It is shown by dumpfs as "fsize" */
#define FG_SIZE 1024

int isopen = 0;
FILE *outf;
int fragno = 0, firstfrag;
char outfn[200],	/* used to construct an output filename */
  *endpos,		/* points past the prefix */
  newfn[200];		/* used to rename the file to a "partial" */

main(argc, argv)
char *argv[];
{
    unsigned char buf[FG_SIZE], *bp;
    int nnuls, curnuls, ngrps, nbit7s, curfn = 0;

    if (argc != 2) {
	fprintf(stderr, "Usage: %s <prefix>\n", argv[0]);
	exit(1);
    }

    strcpy(outfn, argv[1]);
    strcpy(newfn, argv[1]);
    endpos = outfn + strlen(outfn);

    while (fread(buf, FG_SIZE, 1, stdin)) {
	nnuls = ngrps = nbit7s = curnuls = 0;

	/* check each byte in the fragment */
	for (bp = buf; bp < buf + FG_SIZE; bp++) {
	    if (*bp & 0x80)
		nbit7s++;
	    if (!*bp) {
		/* scan for the first non-NUL */
		while (curnuls++, ++bp < buf + FG_SIZE) {
		    if (*bp) {
			nnuls += curnuls;
			curnuls = 0;
			ngrps++;
			break;
		    }
		}
		bp--;	/* since the outer loop will increment bp */
	    }
	}
	if (nnuls || nbit7s) {		/* any embedded NULs or high bits? */
	    close_partial();
	    printf(
	      "Fragment %d rejected: %d NULs in %d groups (%d at end), %d high bits\n",
	      fragno, nnuls, ngrps, curnuls, nbit7s);
	} else if (isopen || curnuls != FG_SIZE) {
	    /* if the data is good and it's not just another NUL-filled
	       fragment at the end of a block */
	    if (isopen) {
		fputs("/*~~~*/", outf);	/* write a separator string */
	    } else {
		/* start a new file */
		firstfrag = fragno;
		sprintf(endpos, ".%03d", ++curfn);
		if (!(outf = fopen(outfn, "w"))) {
		    perror(outfn);
		    exit(2);
		}
		isopen = 1;
	    }

	    /* write all but any trailing NULs */
	    if (fwrite(buf, FG_SIZE - curnuls, 1, outf) != 1) {
		perror(outfn);
		exit(2);
	    }
	    if (curnuls) {
		/* if there were trailing NULs, close the current file and
		   report it */
		fclose(outf);
		isopen = 0;
		printf("File %s: fragments %d-%d (%d NULs)\n", outfn,
		  firstfrag, fragno, curnuls);
	    }
	}	/* if good data */
	fragno++;
    }	/* while read not eof */
    close_partial();
    return 0;
}

/* close a file of "partial" data, rename it to end in "p", and report */
close_partial()
{
    if (isopen) {
	fclose(outf);
	isopen = 0;
	strcpy(newfn + (endpos - outfn), endpos);
	strcat(newfn, "p");
	rename(outfn, newfn);
	printf("File %s: fragments %d-%d (partial)\n", newfn, firstfrag,
	  fragno - 1);
    }
}

Back to my home page