Unix: How to find files lacking certain strings
So, I’m working on a convoluted web site, and a problem comes up. It seems that some vitally important code was not included in some pages (for the sake of argument, let’s say it’s a copyright string). This particular site has an ungodly mix of files, including .htm, .html, and .jsp files. Some of the .jsp files are actual pages, and others are stubs to be included in other .jsp pages. The majority of the full .jsp pages include a “footer.jsp” that has the desired string, so they’re good. But I need to generate a list of the full pages, of whatever sort, that lack this string.
The inverse of this problem is easy, and is the kind of thing I use all the time:
find . -name \*.htm -o -name \*.html -o -name \*.jsp -exec grep -il "myString" {} \;
Initially, I thought using the -v flag to grep would work for me, but grep -vl returns all files it sees, because -v returns the lines that match the invert expression, not the files that match the invert expression. Then there’s the problem that I need to match “full” pages rather than included .jsp stubs.
So here’s how the Mighty Power of Unix came to my rescue:
find . -name \*.htm -o -name \*.html -o -name \*.jsp | xargs grep -il "</html>" | sort -u > full_pages.txt
provides me with a list of pages that are not mere inclusions, if you accept my assumption that an inclusion won’t match the closing HTML tag.
Then I generate a list of full pages that contain the magic string and or include the footer.jsp that would contain the magic string:
find . -name \*.htm -o -name \*.html -o -name \*.jsp | xargs grep -il "</html>" | xargs grep -le "uniqueCopyrightTag\|footer\.jsp" | sort -u > pages_no_string.txt
Then I compare the files to find out which full pages lack both the magic string and the include:
comm -3 pages_no_string.txt full_pages.txt
Wow. There it is!
I bet there’s an easier way. Post an example in the comments if you know of one!
NOTE: All commands are on a single line, regardless of whether they wrap in this particular display.