A while ago, my NAS bit the dust with a bad boot flash. Swapping out the boot flash and getting it back online was fairly simple thanks to incremental backups to the storage array, but it got me thinking about why the drive had died. Flash’s lifespan is closely tied to how much it gets written to and in what patterns, so an analysis of who’s writing where should be helpful in figuring out whether anything on my system is being a bad citizen.

There are a few potential options. Plenty of tools will tell you about disk I/O, but breaking it down by process is the tricky part. Since there’s presumably all kinds of system activity, raw disk I/O numbers would be a pain to work with. Some kernels (Linux >= 2.6.20) have decent explicit support for breaking down I/O by process and userland tools like iotop to take advantage of it. Unfortunately, it doesn’t appear that there’s a BSD equivalent. top -m io shows up in searches, but it doesn’t distinguish I/O by mountpoint. On a busy system with a lot of drives, this is of questionable use.

The lack of purpose-built options leaves us with DTrace. For the uninitiated, DTrace is a dynamic tracing framework that allows arguments, return values and thread/process contexts to be collected at the entry and exit points of syscalls and kernel functions. It’s very neat and gets a lot of use in my day job. For a starting point I’d recommend Brendan Gregg’s excellent book, although it does focus more on Solaris/Illumos than the BSDs.

There are some existing DTrace scripts designed for analyzing I/O. However, the ones I’ve tried don’t work on FreeBSD due to unported features (e.g. fsinfo provider, fds[] array) or differences in kernel structures. For example, some versions of rwsnoop don’t work on my system (assignment errors like fd_ofiles is not a member of struct filedesc due to different kernels) and others, like sysutils/DTraceToolkit, simply remove the features I want, presumably because they lack support and don’t work:

Ports DTraceToolkit rwsnoop:

        printf("%5d %6d %-12.12s %1s %7d \n",
            uid, pid, execname, self->rw, (int)self->size);

Original Brendan Gregg DTraceToolkit rwsnoop:

	printf("%5d %6d %-12.12s %1s %7d %s\n",
	    uid, pid, execname, self->rw, (int)self->size, self->vpath);

Super-naive dtracing won’t be very useful. write(2) takes an FD rather than a path, and returns the number of bytes written. Neither of these (nor the other arg, the number of bytes to write) on its own says anything about where, in a filesystems sense, the write is destined to, so it’s not as simple as writing a probe to print an argument or return value. To get useful info about where a write is headed and what process it’s coming from, examining kernel internals is a must.

Happily, this isn’t the first time a problem like this has been tackled. In this thread, a hacker details his partially successful approach to mapping an FD to a path name. This would definitely work in my case - all I really need is some indicator of the filesystem being written to, since I don’t care where on the device the write landed, only what device. Another person built on Sergey’s work in this forum thread, where they note that you can get at the struct statfs through the vnode, by way of the thread struct DTrace has access to in the probe’s context.

Bad luck, though: none of these snippets work on my FreeBSD 11.1-RELEASE system. Looks like the kernel structures have shifted since 8. This necessitates taking a stroll through the kernel code to figure out what the path from our active thread struct to per-vnode filesystem info looks like. Given the deeply nested nature of kernel structures, this can go on for a while and some familiarity with big C codebases helps a lot. What follows are snippets from each defininion in the chain, with certain long ones trimmed for clarity. Tools like Opengrok or cscope are super helpful in this sort of navigation.

thread https://github.com/freebsd/freebsd/blob/master/sys/sys/proc.h#L223

/*
 * Kernel runnable context (thread).
 * This is what is put to sleep and reactivated.
 * Thread context.  Processes may have multiple threads.
 */
struct thread {
	struct mtx	*volatile td_lock; /* replaces sched lock */
	struct proc *td_proc; /* (*) Associated process. */ <===

proc https://github.com/freebsd/freebsd/blob/master/sys/sys/proc.h#L558

/*
 * Process structure.
 */
struct proc {
	LIST_ENTRY(proc) p_list;	/* (d) List of all processes. */
	TAILQ_HEAD(, thread) p_threads;	/* (c) all threads. */
	struct mtx	p_slock;	/* process spin lock */
	struct ucred	*p_ucred;	/* (c) Process owner's identity. */
	struct filedesc *p_fd; /* (b) Open files. */ <===

filedesc https://github.com/freebsd/freebsd/blob/master/sys/sys/filedesc.h#L79

struct filedesc {
	struct	fdescenttbl *fd_files; /* open files table */ <===

fdescenttbl https://github.com/freebsd/freebsd/blob/master/sys/sys/filedesc.h#L67

struct fdescenttbl {
	int	fdt_nfiles;		/* number of open files allocated */
	struct	filedescent fdt_ofiles[0];	/* open files */ <===
};

filedescent https://github.com/freebsd/freebsd/blob/master/sys/sys/filedesc.h#L55

struct filedescent {
	struct file	*fde_file;	/* file structure for open file */ <===
	struct filecaps	 fde_caps;	/* per-descriptor rights */
	uint8_t		 fde_flags;	/* per-process open file flags */
	seq_t		 fde_seq;	/* keep file and caps in sync */
};

file https://github.com/freebsd/freebsd/blob/master/sys/sys/file.h#L170

struct file {
	void		*f_data;	/* file descriptor specific data */
	struct fileops	*f_ops;		/* File operations */
	struct ucred	*f_cred;	/* associated credentials. */
	struct vnode *f_vnode; /* NULL or applicable vnode */ <===

vnode https://github.com/freebsd/freebsd/blob/master/sys/sys/vnode.h#L100

struct vnode {
	...

	/*
	 * Filesystem instance stuff
	 */
	struct	mount *v_mount; /* u ptr to vfs we are in */ <===

mount https://github.com/freebsd/freebsd/blob/master/sys/sys/mount.h#L189

struct mount {
	struct statfs mnt_stat; /* cache of filesystem stats */ <===

statfs https://github.com/freebsd/freebsd/blob/master/sys/sys/mount.h#L73

struct statfs {
	...
	char	  f_fstypename[MFSNAMELEN]; /* filesystem type name */    <===
	char	  f_mntfromname[MNAMELEN];  /* mounted filesystem */      <===
	char	  f_mntonname[MNAMELEN]; /* directory on which mounted */ <===

From the above, I threw together the first working version:

syscall::write:entry
/ execname == "python2.7" /
{
        /* struct filedescent */
        this->filedescent = curthread->td_proc->p_fd->fd_files->fdt_ofiles[arg0];
        /* struct file */
        this->file = this->filedescent.fde_file;
        /* struct vnode */
        this->vp = this->file != 0 ? this->file->f_vnode : 0;
        /* struct mount */
        this->mount = this->vp != NULL ? this->vp->v_mount : 0;
        /* mnt_stat is a struct statfs */
        this->fi_fs = this->mount ? stringof(this->mount->mnt_stat.f_fstypename) : "<unknown>";
        this->fi_mount = this->mount ? stringof(this->mount->mnt_stat.f_mntonname) : "<unknown>";
        this->fi_mountfrom = this->mount ? stringof(this->mount->mnt_stat.f_mntfromname) : "<unknown>";

        printf("%s %d %s %s %s\n", execname, pid, this->fi_fs, this->fi_mount, this->fi_mountfrom);
}

Initially, I’m filtering for my python process so I don’t blow up my shell with every write on the system.

>>> f = open("foo", "w+")
>>> f.write("bar")
>>> f.close()
  1  63227                      write:entry python2.7 58053 ufs / /dev/da0s1a

Cool!

Does it work correctly on writes to other filesystems?

  3  63227                      write:entry python2.7 58509 zfs /media media

Yep. Now we’ve got the fundamentals, all that’s left is cleaning it up and doing some data collection.

Revised:

syscall::write:entry
{
        /* struct filedescent */
        this->filedescent = curthread->td_proc->p_fd->fd_files->fdt_ofiles[arg0];
        /* struct file */
        this->file = this->filedescent.fde_file;
        /* struct vnode */
        this->vp = this->file != 0 ? this->file->f_vnode : 0;
        /* struct mount */
        this->mount = this->vp != NULL ? this->vp->v_mount : 0;
        /* mnt_stat is a struct statfs */
        self->fi_fs = this->mount ? stringof(this->mount->mnt_stat.f_fstypename) : "<unknown>";
        self->fi_mount = this->mount ? stringof(this->mount->mnt_stat.f_mntonname) : "<unknown>";
        self->fi_mountfrom = this->mount ? stringof(this->mount->mnt_stat.f_mntfromname) : "<unknown>";
}

syscall::write:return
/ self->fi_mount == "/"/
{
        @writes[execname] = count();

        printf("%s %d %s %s %s %d\n", execname, pid, self->fi_fs, self->fi_mount, self->fi_mountfrom, arg1);
        self->fi_fs = 0;
        self->fi_mount = 0;
        self->fi_mountfrom = 0;
}

Since DTrace doesn’t have conditional flow control quite like C does, I had to use a return probe for filtering logic. This way, we only analyze writes to mountpoint /, which is the flash device. An added bonus of this structure is the ability to also capture write size. I also moved some variables to clause-local from thread-local to minimize probe impact.

After running it for a while during a period of average system operation, these were the results:

  winbindd                                                          4
  bash                                                              9
  dd                                                               11
  smbd                                                             24
  transmission-daemon                                             348
  nmbd                                                           7492

When I was setting everything back up I noted that although the actual data lives on the zpool, transmission’s homedir is on /. I figured that it might be writing resume data, stats and freshly added torrents, so I expected moderate write activity. These results confirmed that, and I’m definitely relocating transmission.

The surprising part is the nmbd activity. This is a NetBIOS name server that’s presumably part of my Samba stack. This activity is a lot more concerning and makes me quite glad I decided to look into write activity.

Checking the saved printf output, it looks like there’s a constant drumbeat of small writes:

  1  63228                     write:return nmbd 65947 ufs / /dev/da0s1a 205

  0  63228                     write:return nmbd 65948 ufs / /dev/da0s1a 205

  2  63228                     write:return nmbd 65949 ufs / /dev/da0s1a 205

  2  63228                     write:return nmbd 65950 ufs / /dev/da0s1a 205

  2  63228                     write:return nmbd 65951 ufs / /dev/da0s1a 205

These happen every couple of seconds, always from a new PID. Since the system is fairly quiet, this manifests as the PID increasing monotonically most of the time. A quick glance at top confirms that this is actually happening and not some kind of strange artifact of the tracing. Incidentally, I guess that’s why this system runs through PIDs fairly quickly.

Now, I need to have a good look at my Samba config and see if I can figure out what exactly is causing this. I have no reason to believe that current system behavior significantly differs from what it was like pre-recovery, so given these results I’m not at all surprised that the first boot flash didn’t last as long as I had originally predicted.

I hope this post was illustrative of some of the utility and fun that can be had with DTrace. Getting a look into what’s actually happening on a live system at a level that’s deeper and more customizable than the usual crop of system monitoring tools can be an enlightening experience.


Addendum: it eventually came to my attention that a better way of doing this may be to use the vfs provider. There’s a script that does this on the FreeBSD forums that almost works out of the box on my system - the path translation is fairly borked, but with some tweaking I could probably have gotten my desired data out of this with less effort than my original approach. Oh well: this process was still instructive and now I’ve got another fun tool to play with.