Monthly Archives: March 2016


logo, 'DRM' with XIf anything causes more controversy than DRM (digital rights management), it’s joining DRM with an open standard. The World Wide Web Consortium’s Encrypted Media Extensions Working Draft is generating controversy in plenty.

Cory Doctorow has declared: “The World Wide Web Consortium’s decision to make DRM part of HTML5 doesn’t just endanger security researchers, it also endangers the next version of all the video products and services we rely on today: from cable TV to iTunes to Netflix.”
Continue reading

The Java file format API graveyard

If you look for Java libraries to support specific file formats, you’ll soon come upon the gloomy graveyard of Java APIs. Sun and Oracle have a history of devising nice packages for reading and writing different kinds of files, only to abandon their maintenance. You can still find pages for them, and it takes a close look to figure out that they aren’t supported any more.

Java Advanced Imaging (JAI) was nice in its time. It still has a page on Oracle’s website, but the latest “what’s new” item is dated 2007. The page brags about customer success stories as if it were still usable code. I’ve tried working with it. It’s out of sync with the current com.sun classes, and I got only limited use out of it. In its time it was a good way to read and write image files.

Java Media Framework (JMF) runs on a 166 MHz Pentium or 160 MHz PowerPC. The downloaded jars are dated May 1, 2003. It had a nice list of supported formats.

If you’re working with audio files, javax.sound looks more encouraging. Its API is listed with Java 8. The class java.sound.sampled.AudioSystem supports reading and writing of audio files. I can’t find a list of the supported formats.

Java does reliably support some formats. Its handling of text encodings is versatile, and handles ZIP and GZIP.

Third-party code can come to the rescue. For reading and writing PDF, Apache PDFBox looks like the best bet. You can use Apache Tika with lots of formats, if you just need to extract metadata. Another alternative is to use ImageMagick, but it runs natively rather than under the JVM, so you have to invoke it with exec calls. im4java and JMagick can save some of the tedium. There are open source Java libraries for reading and writing specific file formats. Some may be good, some not.

If you need to deal with the guts of file formats in Java, you’ll usually have to find some good third-party code or start writing your own.

Security risk in “target=_blank”

I’ve often used “target=_blank” in my posts so that people can click on a link without leaving the original page. So do many people. This turns out to be a seriously risky practice, though. When you open a window with an anchor tag specifying “target=_blank”, you give the target window control of the original window’s location object! This means that the target window can modify the content of the original window, possibly redirecting it to a phishing page.

We could also call this a security hole in the HTML DOM, or perhaps in the whole idea of allowing JavaScript in Web pages. I use NoScript with Firefox so that unfamiliar pages won’t run JavaScript, preventing them from exploiting this hole. I can’t expect everybody reading this blog to do that, though. To protect against exploits, I’d need to add “rel=noopener” for some browsers and “rel=”noreferrer” for others. That would require custom JavaScript, which won’t let me do, and would be a lot of work just to modify link behavior. Starting with this post, I’m not using “target=_blank” in my links. The sites I’ve linked to in the past are reputable, as far as I know, so the risk from existing links should be minimal. At least I hope so; supposedly trustworthy websites allow advertisers to include unvetted JavaScript, allowing malware attacks.

Closed captioning formats

CC logoAn online discussion led to my learning about Udemy’s support for closed captioning and to the formats available for it. Since I hadn’t heard about these formats before, I’m guessing a lot of other people haven’t. They can be useful not only for accessibility but for preservation, since they provide a textual version of spoken words in a video. These are just some notes on what I’ve found in a cursory investigation. In general, sites that support closed captioning expect a text file in one of several formats, which has to have at least the text of the caption, its starting time, and its duration or ending time.
Continue reading

The (information) machine stops

The “Digital Dark Age” discussion has started up again on Twitter, and again I find myself in the minority position. It really is possible to have Twitter discussions on complex topics and say something intelligent, but it isn’t easy. More than 140 characters at a time are needed, and it’s been a while since I last wrote about the subject at length, so let’s get back to it. The last post that I wrote on this was “Dataliths vs. the Digital Dark Age”, and I hope you’ll read that before continuing here, since I don’t want to just repeat myself.

Maybe the question needs to be turned around. Let’s not ask what could trigger a Digital Dark Age, but what conditions are necessary and sufficient for the really long-term preservation of information, what will minimize the risk of widespread loss of today’s history, literature, and daily news?
Continue reading


The weather’s been great lately, so here’s a special offer, just through March 13, on my Udemy course on file format identification tools: Just $12 with the coupon MARCH11! The list price is $28. This also celebrates Udemy’s fixing a … Continue reading

Photoshop’s PSD file format

Photoshop’s native format, PSD, doesn’t get a lot of discussion. It’s Photoshop’s default format, and people use it for projects if only for that reason, so we really should know something about it. A lively place to start is “Fun Photoshop File Format Facts” on the Postlight blog. For serious investigation, look at Adobe’s specification. There’s also a short article on, with some information about the format’s history.
Continue reading

Update on my Udemy courses

Udemy has made some serious changes to its pricing rules. This will result in some price changes in my courses, starting on April 4.

In one respect, this is a good thing. Currently, any course participating in Udemy’s marketing programs is periodically subject to huge discounts on zero notice. A $300 course might suddenly be offered for $10. If students enroll in the course through the marketing program, the instructor may get as little as 25% of that. On the other hand, if students enroll using my coupon codes, I get to keep 97% of the money. It’s not hard to see how this can put instructors in a price war against themselves. I want to sell courses through coupons so that Udemy doesn’t gobble up most of the money you pay, but this encourages instructors to set a high price and then discount it heavily so students will use the coupons.

This wasn’t making anybody happy, so Udemy has changed its policies, promising not to discount courses by more than 50%. But this comes with a new set of price restrictions on the courses. All prices have to be between $20 and $50 and — I don’t know why — be a multiple of $5. We can’t give discounts of more than 50% with our own coupons. If a coupon violates this limit, we can’t change it; it will just expire on April 4.

This means I’ll be making the following changes in my prices:

  • Managing metadata with ExifTool: The list price will drop from $36 to $30.
  • Personal digital preservation: The list price will go up from $16 to $20.
  • How to tell a file’s format: Five open source tools: The list price will go down from $28 to $25.

If you’re here, the list prices are irrelevant, since you’ll be buying using the coupon code unless you like spending more and letting me have less. But there are also changes in the coupons. Until April 4, you’ll be able to enroll in the ExifTool course with the code EXIF14 for $14.00. Starting April 4, you’ll have to use the code EXIF15 with a price of $15.00.

The introductory offer for Personal Digital Preservation expired at the end of February. The new code PRESERVE lets you enroll for $11. This won’t change.

The coupon code TOOLKIT for How to Tell a File’s Format: Five Open Source Tools continues to get you a $20 price.

The biggest annoyance is that I like to give students a really deep discount for a course that builds on another one (e.g., on the ExifTool course for those who’ve taken the file identification tools course), and I’ll be limited in what I can do there.

By way of compensation, I’m offering a special rate on Personal Digital Preservation till April 4: Just $8 with the coupon code MARCHAIR! After April 4, you won’t be able to get that low a price for any paid Udemy course.

Hopefully this will all work out well. I’m looking into adding another course, though it’s too soon to give specifics.

JHOVE PNG module, progress report

There’s now a JHOVE PNG module on my GitHub site. The relevant new classes are com.mcgath.jhove.module.PngModule and everything in the package com.mcgath.jhove.module.png. I could have continued from Lauri’s code as I mentioned in my previous post, but I like a more factored approach, so I continued with my own code, which has a separate class for each chunk type. Take a look at the top-level file FORKNOTES for what I’ve been doing.

It does a pretty decent job of validating files and extracting metadata now, but some chunk types are still ignored, and there are some design decisions on the extracted metadata that I’m not sure about yet. Also, JHOVE modules usually have a lot of metadata about themselves, and that’s not complete yet. If anyone wants to play with it, keeping in mind that it’s not stable code yet, please do and submit issue reports for bugs and suggestions.