Category Archives: commentary

Identifying files by programming language

Most of today’s programming languages look vaguely similar. They’re derived from the C syntax, with similar ways of expressing assignments, arithmetic, conditionals, nested expressions, and groups of statements. If the files have their original extension and it’s accurate, format identification software should be able to classify them correctly.

The software should do some basic checks to make sure it wasn’t handed a binary file with a false extension, which could be dangerous. A code file should be a text file. regardless of the language. (This isn’t strictly true, but non-text languages like Piet and Velato are just obscure for the sake of obscurity.) The UK National Archive recognizes XML and JSON (which is a subset of JavaScript) but doesn’t talk about programming languages as file formats. Exiftool identifies lots of formats but makes no attempt to discern programming languages.
Continue reading

Zip bombs: Blown up out of proportion?

A Vice.com article has brought fresh publicity to an old trick. The so-called “Zip bomb” is a Zip file with a fantastically high compression ratio. Researcher David Fifield created a 46-megabyte file that expands into 45 petabytes. That’s a compression ratio of about a billion. Fifield’s own article provides a lot more technical information.

The article says such files are “so deeply compressed that they’re effectively malware.” That strikes me as a bit of an exaggeration. “Nuisanceware” seems more accurate, if there’s such a word. However, they could be used in a denial of service attack. They could crash a server or browser, and the work removing the expanded files could cause some downtime. A Zip bomb might be a setup for another attack, tying up system resources and distracting administrators.
Continue reading

The tape obsolescence problem

An ABC News Australia article calls attention to the problem of archives on magnetic tape. Author James Elton clearly knows something about digital preservation issues, as the article goes beyond the usual generalities and hand-wringing.

Tapes, on the other hand, can only be read by format-specific machines.

And dozens of formats of magnetic tape were created through the last century — one-inch, two-inch, various versions of Betamax.

Continue reading

Web archiving and languages

Web archiving is difficult. Few sites consist entirely of static, self-contained content. Most use JavaScript, often from external sites. Responsive pages are designed to look different in different environments. An archive needs to make a snapshot that reflects its appearance at a given point in time, but what exactly does that mean? Should an archive pick an appearance for one reasonable set of parameters, or should it try to keep the page’s dynamic nature? Will the fact that it’s an archive rather than an interactive browser affect what the server gives it?
Continue reading

When ebooks die

Microsoft’s eBook Store is closing. According to the announcement, “starting July 2019 your ebooks will no longer be available to read, but you’ll get a full refund for all book purchases.” This shows a basic truth about DRM book purchases: you don’t actually own your copy. You can use it only as long as the provider supports it. It was honest of Microsoft to refund all “purchases,” but digital oblivion eventually awaits all DRM-protected materials.

Andy Ihnatko once told me that DRM is safe because “Amazon will be around forever.” It won’t. The fact that a company as big and stable as Microsoft is abandoning support for its DRM-protected products reminds us that all such products exist only as long as the provider has sufficient motivation and ability. It’s questionable whether Amazon’s protected ebooks from today will be readable in 2050, let alone “forever.”
Continue reading

HTML mail is a terrible idea — but at least please do it right

Originally email consisted just of text messages. They were straightforward to read. It was very hard to send malware in a convincing way, since the recipient would have to extract any malicious attachment and run it by hand. There was a hoax in 1994 warning of the alleged “Goodtimes virus”, which caused a lot of merriment among the computer-literate. The only “virus” was the hoax email itself, which the less computer-literate forwarded to all their friends.

Then came HTML mail, a huge advance in email insecurity. Now malicious URLs could hide behind links or even be opened automatically. It could include JavaScript to exploit client weaknesses and trick recipients. Today, almost everyone recognizes these advantages, and malware and phishing by email are multi-billion-dollar businesses.

Doing it right, or not doing it at all

Even so, there are good and bad ways to create HTML mail. Continue reading

PDF/A-4

It looks as if I’ll have a little input into the upcoming PDF/A-4 standardization process; earlier this month I got an email from the 3D PDF Consortium inviting me to participate, and I responded affirmatively. While waiting for whatever happens next, I should figure out what PDF/A-4 is all about.

ISO has a placeholder for it, where it’s also called “PDF/A-NEXT.” There’s some substantive information on PDFlib. What’s interesting right at the start is that it will build on PDF/A-2, not PDF/A-3. A lot of people in the library and archiving communities thought A-3 jumped the shark when it allowed any kind of attachments without limitation. It’s impossible to establish a document’s archival suitability if it has opaque content.
Continue reading

Path traversal bugs in archive formats

Malware has shown up which takes advantage of a path traversal bug in the WinRAR archiving utility. The bug, which reportedly existed for 19 years, is fixed in the latest version. The problem stems from an old, buggy DLL which WinRAR used. It allowed the expansion of an archive with a file that would be extracted to an absolute path rather than the destination folder. In this case, the path was the system startup folder. The next time the computer was rebooted, it would run the malware file.
Continue reading

What part of “No Flash” doesn’t Microsoft understand?

If you disable Flash on Microsoft Edge, Microsoft ignores your setting — but only for Facebook’s domains. It sounds too conspiratorial to be true, but a number of generally reliable websites confirm it.

Bleeping Computer: “Microsoft’s Edge web browser comes with a hidden whitelist file designed to allow Facebook to circumvent the built-in click-to-play security policy to autorun Flash content without having to ask for user consent.”

ZDNet: “Microsoft’s Edge browser contains a secret whitelist that lets Facebook run Adobe Flash code behind users’ backs. The whitelist allows Facebook Flash content to bypass Edge security features such as the click-to-play policy that normally prevents websites from running Flash code without user approval beforehand.”
Continue reading

The police body camera data problem

The Washington Post reports that some police departments are dropping body camera programs because of the expense. I’ll admit that my first gut reaction on seeing the story was that it’s just an excuse. In some cases it probably is. But it’s a fact that while the cameras are cheap, storing and managing large amounts of video data isn’t. The question needs objective examination.
Continue reading