Inside The Race To Capture Government Websites Before They Vanish Forever

   

In a small, windowless room in San Francisco, rows of computers whir with an intensity that borders on a scream. This may be an ordinary scene for a data center located less than an hour’s drive from Silicon Valley, but these machines are engaged in an extraordinary task.

With the Nov. 5 election just two weeks away, they’re harvesting vast amounts of government data before the White House welcomes new residents or former ones in January. The information will live on in the End of Term Web Archive, a giant repository of federal government websites preserved for the historical record as one administrative term ends and a new one begins. Librarians, archivists and technologists across the country join forces every four years to donate time, effort and resources to what they dub the end-of-term crawl, with the resulting datasets available to the public for free.

“It is important to capture these websites because they can provide a snapshot of government messaging before and after the transition of terms,” reads a description of the project from the Internet Archive, a nonprofit that provides free access to digitized materials including websites, software applications, music and audiovisual and print materials.

Data collected in end-of-term crawls lives on on the organization’s Wayback Machine, which contains billions of webpages and other cultural artifacts that can be searched and in this case downloaded as bulk data for machine-assisted analysis. Researchers have used the datasets to examine, among other things, the history of climate change policy, the reuse of suspended U.S. government accounts on social media platform X and how PDFs can most effectively disseminate government information.

It’s here at the Internet Archive’s cavernous San Francisco headquarters, and at data centers elsewhere around the Bay Area, where Internet Archive computers have begun harvesting government domains, such as .gov and .mil, from the legislative, executive and judicial branches. The initiative aims to safeguard the annals of history — and, say project participants, democracy itself.

“Citizens have a right to access information about what their government is doing in their name,” says James Jacobs, a government information librarian at Stanford University, an End of Term Web Archive project partner. “That’s why libraries have long collected these materials to make sure they are organized, preserved and easily accessible for the long term.”

Today, with misinformation flooding the web at an alarming rate , it’s a job some project team members view as more critical than ever.

Why Does Web Content Disappear?

Over the past two decades, government information, like that of nearly every sector, has overwhelmingly moved online. But there’s no guarantee it will remain there undisturbed. Web content disappears from view at an alarming rate, according to a Pew Research Center study on “digital decay” released in May. It found that 38% of webpages that existed in 2013 are no longer accessible a decade later and that 23% of news webpages contain at least one broken link, as do 21% of webpages from government sites.

Government documents vanish for reasons other than so-called link rot.

With the dawn of a new administration, “the new management will often rearrange things,” says Mark Graham, director of the Internet Archive’s Wayback Machine. “Often they just move things around and they don’t necessarily move things around in a predictable fashion or with proper thought about redirection.” A researcher seeking original source material from the Environmental Protection Agency during the Obama administration, for example, might have no idea where to look.

Information also can get intentionally wiped out for political reasons, Graham notes. Following Donald Trump’s win in 2016, some government observers feared the new administration might delete or censor key environmental data, a worry that mobilized an influx of volunteers eager to participate in the end-0f-term web crawl. The Trump campaign didn’t respond to a request for comment.

The End of Term Web Archive aims to allay such citizen concerns by making publicly available documentation that may no longer be found on the live web for open access. There is no clear federal mandate to preserve data, Jacobs notes, and agencies can apply or ignore the law as they see fit.

“By preserving as many federal websites as possible every four years and making them easily accessible, we make sure that everyone can still get to all of that information online,” says Malea Walker, a reference librarian in the serial and government publications division of the Library of Congress, an End of Term Web Archive partner. Others include the University of North Texas Libraries, the U.S. Government Publishing Office, the National Archives and Records Administration and Common Crawl, a nonprofit that crawls the web and makes its archives and datasets available to the public for free.

Through The Web Wormhole

Exploring prior End of Term Web Archives feels like stepping through a portal to the past. Pages scraped from the official White House website during George W. Bush’s two terms, for instance, capture the moment Samuel Alito got sworn in as a U.S. Supreme Court justice and messages Bush delivered to U.S. troops on the ground in Iraq during the war there. Official pages from the Bush era also offer a fascinating look at the web’s evolution. They feature fewer and far smaller images than current webpages typically do, and they have narrower columns of smaller, more crowded text than today’s airier web design fashions dictate.

The National Archives and Records Administration, the government agency charged with preserving and documenting government and historical records, conducted the first large-scale capture of the federal web in 2004 at the end of Bush’s first term, preserving a little under 6.5 terabytes of data.

In 2008, when NARA decided not to continue the archival work, other organizations stepped in. That year’s end-of-term crawl collected more than 15 terabytes — and the data grabs have steadily grown since. The 2020 crawl amassed more than 260 terabytes of data, and the 2024 crawl, which started just a few weeks ago, has already archived about 150 terabytes (roughly the equivalent of 25,000 HD movies streaming on Netflix).

“And we’re just getting started,” Graham tells me. Due to the ever-ballooning web and the project expanding to emphasize video, with its larger files, “I’m confident this one will be larger than any of them,” he says. “It may even be larger than all of the prior archives put together.”

How You Can Join The Effort

Open-source Internet Archive web-crawling software called Heritrix will continue indexing government websites through next year. Once Heritrix crawls a site, pages appear on the Wayback Machine within hours, though cyber attacks on the Internet Archive this month have led to stalls and intermittent hiccups.

A digital data grab this massive couldn’t happen without machines, but there’s a very human component at play here. Via an online tool, project members nominate links they think should be preserved — URLs buried deep within government websites, social media feeds, PDF files. Through March 31, 2025, the general public can offer suggestions, too, and recommend how data that’s already been nominated should get prioritized.

These nominations “allow us to empower individuals, whether they are librarians, researchers, activists or journalists, to identify resources that are important to their work,” says Mark Phillips, associate dean for digital libraries at the University of North Texas and a longtime member of the End of Term Web Archive squad. The team’s been focused on outreach to convey a central message: Human involvement, and discernment, matters.

Once an archive is assembled, the Library of Congress holds one copy, and the University of North Texas hosts another. This year’s archive may prove the largest yet, but the effort to collect it, so far at least, hasn’t been driven by the urgency that followed Trump’s 2016 victory.

“That was the time that people understood why libraries see this as so important,” Brewster Kahle, the Internet Archive’s founder. “When there was a declaration that the new administration was going to take a very different stance on women’s health, climate issues, all of these things, there was an uprising of people to try to help the end-of-term crawl.”

One way they did so was at “data rescue events” organized by project partner Environmental Government and Data Initiative, a network of professionals who contextualize and analyze changes to environmental data and governance practices.

“They would get people together and they would look through the government web and try to find environmental data that they knew needed to be collected and preserved for the long term,” Jacobs says, “and so we collected a lot more data than in other years because it was targeted.”

While this year hasn’t seen data collection events like the ones that proliferated in 2016, the mission of the End of Term Web Archive remains clear.

“We’re building on civil discourse,” Kahle says. “We’re building on an understanding of the land, of the rivers, of our history, of our proceedings. Let’s make sure that it’s able to be accessed.”