Sys Army Knife – What’s in list x but not list y?

Sys Army KnifeIt’s time once again to pull out your sys army knife and explore how to best use some of the tools available to system administrators out there! These “sys army knife” posts explore how to use common Linux/Unix command line tools to accomplish tasks that system administrators may encounter day-to-day.

I’m regularly involved in large-scale data center migration projects, so I quite commonly have to look at two different lists of things and figure out which entries are unique to each list.

For instance, I might have a list of machines that we’re planning to migrate. If someone gives me an updated list of machines in the data center. I have to figure out if there are machines we don’t have to migrate after all, or if there are new machines we have to plan for.

Sysadmins do it with one line

If each of your lists contains only unique values, this task can be done with a simple one liner, like this:

cat file1 file2 file2 | sort | uniq -u

For example, let’s say that I have two lists. The first list is in a file named x and looks like this:

appserver01
appserver02
dbserver01
webserver01
webserver02
webserver03

The second list is in a file named y and looks like this:

appserver02
appserver03
dbserver01
webserver01
webserver03
webserver04

This shows the values unique to x:

$ cat x y y | sort | uniq -u
appserver01
webserver02

… and this shows the lines unique to y:

$ cat y x x | sort | uniq -u
appserver03
webserver04

How does it work?!?

What the commands above do is this: They take one copy of one file, two copies of a second file, sort the results, and then only print out lines that occur a single time.

You start with one copy of the first file, which means you have one copy of every line in that file. Then you add two copies of the second file. This means that you will have three copies of any line that is in both files, and two copies of any line that only occurs in the second file, but you’ll still only have one copy of any line that only exists in the first file. Thus, if you search for lines that only occur once in the final results, you’ll only find lines that are unique to the second file.

Here’s a little more detail:

The first part of the command (cat file1 file2 file2) concatenates together one copy of file1 and two copies of file2 and spits that out.

We then take the output of that cat command and pipe (‘|’) it to the sort command, which will produce a sorted copy of the data it receives. We need to do this because the next command we use expects its input to be sorted, and won’t produce correct results if the input it receives isn’t sorted.

Finally, we pipe the sort output to the ‘uniq’ command. The ‘-u’ option to the uniq command tells it to only print unique lines (i.e., lines that only exist once).

There can be only one…

You may encounter situations where the contents of your lists have duplicate values. If you have no duplicate values in file1, but duplicate values in file2, the command chain will still work as expected. However, if you have duplicate values in file1, all of those values will be ignored even if they only exist in file1. This is because the ‘uniq -u’ command looks for lines in the file that only exist once in either file.

The quick and easy way around this is to simply create a copy of file1 that removes any duplicates before starting:

sort -u file1 > file1.nodupes

Then use that file without the duplicates in the command chain:

cat file1.nodupes file2 file2 | sort | uniq -u

The beauty of it all

This may seem like an esoteric problem that you’re not likely to encounter very often, but you might be surprised how often this problem comes up. Here are just a few examples off the top of my head:

  • Find files that are unique between two servers
  • Find installed packages that are unique between two servers
  • Using old and new server lists figure out which servers are gone and which are new

These commands are all very simple standard commands that exist on pretty much any Unix or Linux system out there: I started using these commands way back in the late 80’s on a MicroVax II running ULTRIX and have since used them on multiple versions of AIX, BSD, HP-UX, IRIX, Linux, and SunOS/Solaris.

Leave a Reply

Your email address will not be published. Required fields are marked *