In April of 2010, Twitter and the Library of Congress announced they had partnered to create an archive of public Twitter — something Twitter does not offer directly to users. “Expect to see an emphasis on the scholarly and research implications of the acquisition,” the LOC said on its blog, touting the plan as a sign of the Library’s tech savvy: “if you’re looking for a place where important historical and other information in digital form should be preserved for the long haul, we’re it!”
Today, the Library has released a white paper explaining why it hasn’t been able make an archive available. As of today, it says, the “Library has not yet provided researchers access to the archive,” despite “approximately 400 inquiries.”
In fact, it appears that the federal institution — famed for its preservation of physical documents — has bitten off more than it can technically chew on a big data project of immense scope and expense. The Library also says its agreement with Twitter prevents it from making the archive widely accessible, but that goal now appears to have been infeasible since the start.
An archive exists, but it’s raw, private, and functionally unsearchable. A single query, the Library says, “could take 24 hours.” The Library blames lack of off-the-shelf software for its troubles, saying that using existing technology, an instant search system “would require an extensive infrastructure of hundreds if not thousands of servers,” which it says is cost prohibitive.
“Cost prohibitive” might not mean much: Deputy Librarian of Congress Robert Dizard told the Washington Post that Library has spent “tens of thousands” of dollars on the project so far, which isn’t much in the context of a large-scale data project like this — it’s less than it would have cost to have retained, say, a single top-notch freelance developer since the day of the announcement. It’s also a tiny sum relative to the Library’s budget (it requested $643m dollars for 2013).
Twitter’s terms in the deal, which had not been fully outlined to date and prohibit “[providing] a substantial portion of the collection on its web site in a form that can be easily downloaded,” offer more clues that this was never destined to be a fully open, free-to-search archive; Twitter never said it would be, but implied some level of public access. At the time, however, it was still working on another archive product with a company that actually could build software for a task like this — Google. That relationship has since dissolved.
Unless Twitter (or someone else) pitches in quite a bit of engineering effort, the LOC project is going to be an archive in a very narrow sense: An overflowing lockbox of tweets, with just a few sets of keys.