Web archiving and languages

Web archiving is difficult. Few sites consist entirely of static, self-contained content. Most use JavaScript, often from external sites. Responsive pages are designed to look different in different environments. An archive needs to make a snapshot that reflects its appearance at a given point in time, but what exactly does that mean? Should an archive pick an appearance for one reasonable set of parameters, or should it try to keep the page’s dynamic nature? Will the fact that it’s an archive rather than an interactive browser affect what the server gives it?

What’s my language?

One of the less obvious issues is what language the page is archived as. This has two aspects. One is the language which the lang attribute declares. The <html> tag should have a lang attribute, and other elements may also have one, perhaps for a different language. The other issue is the language used for the content. Archives shouldn’t be totally English-centric, and the accepted languages in the HTTP request will affect the content that comes back.

An additional complication, as I saw during my recent trip to Germany, is that some sites ignore the Accept-language header in favor of the client IP address. Most sites do it right, but some big sites like Google use the IP address. This means that the apparent location of an archive client will affect what it receives. So do other factors.

Oh, Kannada!

Systems for archiving the Web face another complication. They crawl through links, and following some of them can have side effects. This turned out to explain a puzzle which archivists at Old Dominion University faced. When they looked at five different archives of Barack Obama’s tweets, they discovered that only 53% of them were in English. That is, his Tweets were in English as he wrote them, but the labels, links, titles, etc. provided by Twitter were in another language. The most common language after English was Kannada. Millions of people in India speak Kannada, but it’s not well-known elsewhere, and even the most fanatical birther wouldn’t claim it was Obama’s native language.

The article explores a lot of the ways language selection can go wrong. Authors Sawood Alam and Plinio Vargas think the explanation is in the list of alternate links (using <link rel="alternate"> for different languages. These aren’t intended for display on the browser, but to help software look for a suitable version of the page.

Following these links sets a cookie telling Twitter to prefer a language. Following the link for French sets a cookie asking for subsequent content in French. There’s a long list of these alternate links, and the last one is for the Kannada language! If a client crawls each of these links, it will get a different language cookie for each one, replacing the previous one. At the end of the list, it will have a cookie asking to deliver content in Kannada.

This cookie behavior isn’t unreasonable in normal use. Someone who views Twitter in Polish probably wants to keep viewing it in Polish, regardless of geolocation or browser settings. It’s only when the site is crawled that things get strange.

The article suggests a couple of possible solutions, such as quickly expiring cookies or putting each request in a separate sandbox. Disabling cookies isn’t likely to work well; the article says that some sites expect cookies to be retained at least through a redirect sequence. Sandboxing requests, so that each one is the equivalent of a separate curl operation, might be the best bet.

Nailing the Jello to the wall

The problem for archivists is that today’s typical website isn’t a fixed document. It’s more of a service or application. What you see depends on your computer model, operating system, browser, screen, geographic location, time, and the server you happen to reach. Getting a consistent snapshot of a site is an art, especially if it’s one like Twitter, which is designed never to look the same twice. The snapshots which people find in the Internet Archive and other archives are sometimes confusing or disappointing.

Heraclitus said that you can’t step into the same river twice. That describes the Internet pretty well, too.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s