Tag Archives: software

Article on PDF/A validation with JHOVE

An article by Yvonne Friese does a good job of explaining the limitations of JHOVE in validating PDF/A. At the time that I wrote JHOVE, I wasn’t aware how few people had managed to write a PDF validator independent of Adobe’s code base; if I’d known, I might have been more intimidated. It’s a complex job, and adding PDF/A validation as an afterthought added to the problems. JHOVE validates only the file structure, not the content streams, so it can miss errors that make a file unusable. Finally, I’ve never updated JHOVE to PDF 1.7, so it doesn’t address PDF/A-2 or 3.

I do find the article flattering; it’s nice to know that even after all these years, “many memory institutions use JHOVE’s PDF module on a daily basis for digital long term archiving.” The Open Preservation Foundation is picking up JHOVE, and perhaps it will provide some badly needed updates.

Song identification on GitHub

The code for my song identification “nichesourcing” web application is now available on GitHub. It’s currently aimed at one project, as I’d mentioned in my earlier post, but has potential for wide use. It allows the following:

  • Users can register as editors or contributors. Only registered users have access.
  • Editors can post recording clips with short descriptions.
  • Contributors can view the list of clips and enter reports on them.
  • Reports specify type of sound, participants, song titles, and instruments. Contributors can enter as much or as little information as they’re able to.
  • Editors can modify clip metadata, delete clips, and delete reports.
  • Contributors and editors can view reports.
  • More features are planned, including an administrator role.

This is my first PHP coding project of any substance, so I’m willing to accept comments about my overall coding approach. It’s inevitable that, to some degree, I’m writing PHP as if it’s Java. If there are any standard practices or patterns I’m overlooking, let me know.

Update on JHOVE

I’ve updated the UTF-8 module in the JHOVE source on Github to include the new code blocks for Unicode 7.0.0. Also, I’ve recently fixed the pom.xml file so it will put both the command line and the GUI JAR files into the local repository.

I need more input before I’m comfortable with creating a release 1.12 of JHOVE. I don’t have any prior experience with creating a public, open-source project that’s built with Maven, and I don’t know how much of the baggage of the SourceForge project really needs to be kept. There are some specialty JARs in the old project, but I don’t know if anyone uses them. Most importantly, there still needs to be a distribution in Zip and Tar formats. New features would be interesting, but the first thing is to make a JHOVE that was as useful as it was before.

Comments, suggestions, and code contributions are welcome, as always.

Mavenized JHOVE

I’m not a Maven maven, but more of a Maven klutz. Nonetheless, I’ve managed to push a Mavenized version of JHOVE to Github that compiles for me. I haven’t tried to do anything beyond compiling. If anyone would like to help clean it up, please do.

This kills the continuity of file histories which Andy worked so hard to preserve, since Maven has its own ideas of where files should be. The histories are available under the deleted files in their old locations, if you look at the original commit.

JHOVE, continued

There’s been enough encouragement in email and Twitter to my proposal to move JHOVE to Github that I’ll be going ahead with it. Andy Jackson has told me he has some almost-finished work to migrate the CVS history along with the project, so I’m waiting on that for the present. Watch this space for more news.

The state of JHOVE

As you may have noticed, I’ve been neglectful of JHOVE since last September, when 1.11 came out. Issues are continuing to arise, and people are still using it, and I’m not getting anything done about them.

The problem is that my current job has rather long hours, and when I come home from it, looking at more Java code isn’t at the top of my list of things to do. I’m very glad people are still using JHOVE, close to a decade after I started work on it as a contractor to the Harvard Library, but I’m not getting anything actually done.

It would help if there were more contributions from others, and its being on the moribund SourceForge isn’t helping. I think I could undertake the energy to move it to Github, where more contributors might be interested. There’s already a Mavenized version by Andy Jackson there, which doesn’t include the Java source code but provides some important scaffolding and pom.xml files. It probably makes sense to start by forking this. This migration should also make the horrible JHOVE build procedure easier.

If this is something you’d like to see, let me know. I’d like some reassurance that this will actually help before I start.

FITS website

Last spring, I attended a Hackathon at the University of Leeds, which resulted in my getting a SPRUCE Grant for a month’s work enhancing FITS, a tool which at the time was technically open source but which the Harvard Library treated a bit possessively. After I finished, it seemed for a while that nothing was happening with my work, but it was just a matter of being patient enough. Collaboration between Harvard and the Open Planets Foundation has resulted in a more genuinely open FITS, which now has its own website. There’s also a GitHub repository with five contributors, none of which are me since my work was on an earlier repository that was incorporated into this one.

It really makes me happy to see my work reach this kind of fruition, even if I’m so busy on other things now that I don’t have time to participate.

Ninjas, samurai, and artists

Lately I haven’t been posting as much on this blog. My professional responsibilities have shifted, and much as I still love the issues of file formats, I don’t have as much time to give attention to them. There are still general programming issues that are worth blogging about, though, and I’ll occasionally address these issues here, hopefully along with occasional file format posts.

cover thumbnail, Secrets of the JavaScript NinjaThis weekend I borrowed a book from the company library called Secrets of the JavaScript Ninja. It’s a better book than I expected from the title.

In an inspired error, the cover shows not a ninja but a samurai, with colorful armor, a banner, and a long sword. A ninja is a hit-and-run assassin; he shows up out of nowhere, attacks, and vanishes. A samurai is a dedicated soldier, the Japanese equivalent of a knight, and he follows a code of honor. The software development world has too many ninjas. The samurai is a better, if not ideal, model.

From the title I was expecting a cookbook, one of the many books that provide formulas to follow but no deep understanding. Instead, Secrets of the JavaScript Ninja is about understanding the language. Such books are even rarer for JavaScript than for most programming languages. I’d always tended to think of JavaScript as a half-baked derivative of Java, one where features such as classes, packages, and inheritance were left out to cut it down to a scripting language. This book, though, shows that it’s really quite a different language, a very powerful one in its own right. I still think the language has serious problems, the biggest being lack of standardization, but reading through the book, I’m learning how to think in JavaScript’s terms and to make use of features which other languages don’t have.

It’s not the language I want to talk about here, though; it’s the approach to any language or software technology. Too many programmers don’t have any deep understanding of their craft; they just have a bag of tools that they expect to solve problems for them. The worst can’t do much more than run a Web search for the code they need or beg on Stack Overflow for a solution to their problem. Once I actually had to fix some PHP that consultants from a big-name company had pasted from a website and didn’t know how to adapt to the problem at hand — and I don’t even know PHP! Those are your ninjas.

Even the samurai isn’t a great metaphor. They were a part of Japan’s entrenched feudal culture, and their opposition to capitalism promoted the warlike mindset that culminated in Japan’s role in World War II. Our metaphors should be based on creativity, not war and violence. We should think of ourselves as architects, sculptors, artists. We learn a craft and master the tools that go into it. There’s a real psychological affinity; I know a lot more software developers who are skilled musicians than are skilled fighters.

When I see a book called Secrets of the JavaScript Artist, then I’ll be pleased.

Update: I just noticed that the book itself says: “Ninjas were chosen for their martial arts skills rather than for their social standing or education. Dressed in black and with their faces covered, they were sent on missions alone or in small groups to attack the enemy with subterfuge and stealth, using any tactics to assure success; their only code was one of secrecy.” Just what you want in a programmer, right?

The FITS Blitz

Back in May, after an enjoyable trip to the University of Leeds, I worked for a month on improving the Harvard Library’s FITS tool for combining the results of several file format identification and validation tools. The results were well received and the Harvard Library incorporated some of my work in the main line of FITS. Still, there were a lot of loose ends left and more work to be done.

Things are picking up again with a “FITS Blitz” that’s starting this week. Paul Wheatley writes that “in partnership with Harvard and the Open Planets Foundation (with support from Creative Pragmatics), SPRUCE is supporting a two week project to get the technical infrastructure in place to make FITS genuinely maintainable by the community. ‘FITS Blitz’ will merge the existing code branches and establish a comprehensive testing setup so that further code developments only find their way in when there is confidence that other bits of functionality haven’t been damaged by the changes.”

I’ve moved on to other things, so I won’t be able to participate, but I wish them every success.

Charles Stross on Microsoft Word

Not many people are brilliant writers and also have the technical knowledge to comment on file formats intelligently. When it does happen, it’s worth reading. So I recommend to you Why Microsoft Word Must Die by Charles Stross.

I’ve been on a digital preservation panel with Stross, and he can talk as expertly on the subject as I can. When it comes to Word, he knows a lot more about the format than I do, and he can demolish it more eloquently than I could even if I had the same level of knowledge.