The name of the NLNZ (National Library of New Zealand) Metadata Extraction Tool suggests getting metadata more than identifying files, FITS uses it as part of its set of format identification tools. It employs a set of adapters to access the following file formats: BMP, GIF, JPEG TIFF, MS Word, Word Perfect, Open Office, MS Works, MS Excel, MS PowerPoint, PDF, WAV, MP3, BWF, FLAC, HTML, XML, and ARC. It also has a generic adapter to report basic file system information about other files. It’s available as open source on SourceForge under the Apache Public License. Output is in XML, with a choice of schemas. Like many other identification tools, it’s written in Java and can run on any desktop system that supports Java applications. It has command line versions for Unix and Windows, as well as a GUI version. The most recent update was in June 2014. A brief Developer’s Guide and an installation guide are available.
Like JHOVE, the NLNZ tool has its own code for processing various file formats, some of which are complicated, and like JHOVE, it’s met with varying degrees of success. The source code of the Word adapter says that it “adapts all Microsoft Word files from version 2.0 to XP/2003.” The PDF adapter says it handles versions 1.1 through 1.5 (the latest, ISO version is 1.7).
The NLNZ tool adapters check if a file meets some basic tests for the format, and if it doesn’t then other adapters will be tried, so it certainly qualifies as an identification tool within the range of formats it handles.
The source code is available only within the ZIP files for each version; this makes it difficult to tell how actively specific parts have been maintained. A spot check, though, suggests that many of the adapters haven’t been kept up to date.
I wasn’t able to run it. The launch script, metadata.sh, seems to make assumptions about the METAHOME directory that are inconsistent with the file structure, and I gave up after some diddling with it. If I get more information, I’ll update this post.
Often software projects in the library world come out of an initial burst of funding, after which it’s hard to maintain the programming staff time to do all the necessary updates. I think the NLNZ Metadata Extraction Tool may be a case in point.
Next: JHOVE2. To read this series from the beginning, start here.
Microsoft Word files can betray your privacy
When you create a Microsoft Word file, you may think that all the information you’re giving is what you type into it and keep in the final version. If you’re seriously concerned about confidentiality, you can’t count on that. A file’s metadata can include information about its source and history which you never realized was there. Redaction may not remove all the information it’s supposed to chop out.
When you or somebody else installed Word on your computer, you were asked to enter information about yourself. It gets put into every file you create. If multiple people edit a document, the information on all of them gets into the metadata. Most people don’t mind, but in some cases it could be revealing too much information. If you entered gibberish or silly comments, they go into your documents.
Continue reading →
1 Comment
Posted in commentary
Tagged metadata, Microsoft, privacy