Iterating a directory in command line Tika

Apache Tika is best used as a library to wrap your own code around. Its GUI application is a toy, and its command line version isn’t all that great either. The command line can be improved with a little scripting, though.

If you run Tika with a list of files for arguments, through a wild card or explicitly, like this:

 java -jar tika-app.jar -d *

You’ll get something like this:

application/java-archive
application/x-sh
application/x-sh

That’s not useful, since it doesn’t tell you which file goes with which output. To make things worse, if the file names have spaces, Tika will get confused.

I’ve come up with a script that will iterate Tika over the files in a directory and identify them, skipping over subdirectories. Various improvements and changes are possible, but this is working well enough to add to the Tika segment of my file format tools course course, letting the students do a lot more than they’d be able to otherwise.

The best way I’ve found to echo the arguments cleanly to Tika is to build a command and pass it to a subordinate shell. This means significant overhead, and hitting control-C generally gets you out only from the subordinate shell, so the loop continues unless you control-C a lot of times. Other approaches I’ve tried didn’t succeed in escaping spaces correctly for Tika. I’m no shell wizard, so I may have missed a trick.

Suggestions for improvement are welcome.

UPDATE Nov. 27: I’ve made a couple of improvements to the script. It can now be run from your path and pick up the Tika Jar file in a directory of your choice, and it traps control-C so you only have to hit it once.


#!/bin/sh
# Tika directory iteration script
# Author: Gary McGath
# Takes a directory as an argument and runs Tika to identify MIME type
# of each file in the directory.
# Skips over subdirectories.

# Edit TIKADIR to the path to your Tika jar file.
TIKAJAR=~/Software/Tika/tika-app.jar

trap "echo Script terminated; exit;" SIGINT SIGTERM

if [ ! -d "$1" ] ; then
	echo $1 is not a directory
	exit 2
fi

for v in $1/*
do
	if [ -d "${v}" ] ; then
    	continue
    fi
	echo $v:
    vv="java -jar "${TIKAJAR}" -d "$(echo $v | sed 's/ /\\ /g')
	echo $vv | sh
done

Comments are closed.