The Internet Archive on the future of the web - Protocol — The people, power and politics of tech
yesAnna KramerNone
×

Get access to Protocol

Will be used in accordance with our Privacy Policy

I’m already a subscriber
People

The internet is splitting apart. The Internet Archive wants to save it all forever.

The Internet Archive has grand ambitions for preserving the internet. But in order to do that, Big Tech has to stay out of the way.

The internet is splitting apart. The Internet Archive wants to save it all forever.

Brewster Kahle, the founder of the Internet Archive, worries about how the splintering internet could end a golden age for the Internet Archive.

Photo: Internet Archive

The internet's first librarian likes to reminisce. The early internet is like a fantasy for the founder of the Internet Archive, a place he returns to over and over again in conversation when questions about the present turn dark or depressing. Brewster Kahle might know more about the early years of the web than anyone else.

He has occasion to talk about the Archive's beginnings perhaps more than he should these days. Discussing its future can at times be grim, or, at the very least, uncertain. The glories of the Wayback Machine, the petabytes of data capturing every day of human existence online in warehouses scattered across the world, the smooth system of crawlers marching from my Twitter to the 凯发k8官网下载手机版homepage for the Russian government to Clubhouse in China — in the grand scheme of history, all of this could be an ephemeral golden age.

The so-called balkanization of the internet isn't just a theoretical problem for the Internet Archive. If internet firewalls stay up in China, Iran and Russia, new content continues to move mostly behind paywalls and passwords, and U.S. political leaders decide it's finally time for Section 230 to go, the crawlers whose simple formulas have preserved the last few decades for future historians might not do the same for more than the next few decades.

"There are more and more walled gardens where you can't go. We just have crawlers going at a crazy scale, and they can get blocked just like anybody can get blocked," said Jefferson Bailey, the Archive's director of web archiving and data services.

But even still, until someone or something fundamentally changes the rules of the web, the Internet Archive will keep doing what it's been doing since 1996: preserving every fragment of text you or I are ever likely to read. Tech's walled gardens might make it harder to get a perfect picture, but the small team of librarians, digital archivists and software engineers at the Internet Archive plan to keep bringing the world the Wayback Machine, the Open Library, the Software Archive, etc., until the end of time. Literally.

The balkanization of the internet

When Kahle was a student at MIT in the early '80s, he used a professor's ID to break into the Harvard Law library to access cases for a project. If there was a moment in his lifetime that encapsulated the closed nature of access to information before the internet, it was that.

But today, anyone can find the information he needed back then without so much as a library card. "Usually, things are very closed and locked down. Historically, this is a very rare moment," he said.

That could soon change, however. "Are we at risk of locking down? Yes, absolutely," he said. The Internet Archive is currently blocked in China, and occasionally as well in Russia, India and Turkey, and that's just at the whim of nation-state governments that have the tools to make that work. According to Kahle and Bailey, corporations are just as capable of fracturing the web in ways that make it harder to access and archive; even "user lock-in" to a specific browser and products could one day create internet bubbles, and then walls, based on the products people pay for.

"The Facebooks and the Googles are taking over, and they want to make money," Bailey said. The more people act on the internet behind a password and the more the web becomes corporate, the more the open internet ethos fades away from the public consciousness, easing the way toward that splintering that Kahle fears.

"That's a strategic concern for everyone. Of course, it impacts archiving, too," Bailey said. The archive does its best to capture Twitter, Tumblr, Instagram, YouTube, Vimeo, Facebook and others. Facebook is the hardest, because the company is archiving-unfriendly in general, according to Bailey. But in reality, if any of these social companies decided they wanted to stop the Internet Archive from doing its job, they probably could, he said.

"We're embedded in the community," Bailey said. "At the end of the day, we're just a library."

Kahle fears that the eventual "walling" of the internet could develop in an incongruous place: from tech companies eager for regulation that would cement their own status by stifling future innovation. For example, almost any proposed change to Section 230 — which protects website owners from legal liability for content created and posted by its users — would destroy the delicate legal framework that protects the Internet Archive's work (as well as Wikipedia and user-contributed projects), according to Kahle. Facebook's Mark Zuckerberg is among the many tech leaders to express support for a rewrite.

And tech companies, book publishers and even the music industry have lobbied to limit, change or even remove general copyright fair use exceptions, as well as specific copyright and use exemptions for libraries. Changes to these laws could (accidentally or intentionally, depending on who you ask) make it much harder for people to share their creative work online, and for groups like the Internet Archive to save them.

"Why are they doing this? Some people say it's money. But when you have oligarchies, it's really about protecting against new entrants in the market," Kahle said. At the end of the day, large companies have adapted to the current legal regimes, and they have the money and technical know-how to be able to advocate for stricter regulations that would allow them to preserve their monopolies while changing or limiting fair-use protections.

How the Internet Archive decides what to archive

Until the day these more existential problems firm into something Kahle can fight with more than words, the Internet Archive's day-to-day struggle is preserving the constantly transient web. Web pages have an average lifespan of about 90 days before they change or disappear, and so the Archive needs to capture those pages at a minimum of every 90 days to preserve a full picture of the web over time.

The archivists employ three main strategies to capture most of what might be important for future historians. Bailey wouldn't guess exactly what percentage of the web they manage to preserve — "I'd look like an idiot," he said — because no one really can guess the size or scale of the internet. (Don't get there in your head, if you can avoid it. How would you even measure: by data size? Number of objects? Number of distinct URLs?) "There's no use being anxious over what's outside your control," he said.

The archivists start by considering the entirety of the web and seeking out the most important fraction. They capture a shallow outline of the entire internet (every single URL and associated 凯发k8官网下载手机版homepage that's accessible), and then they dive deep into as many pages as possible for the top 5 million or so most-visited websites. This creates a fairly flat, bird's-eye view of the internet.

To get a more three-dimensional picture, they seek other signals of importance, ranging from news aggregators to the entirety of a national domain (like Cuba, France, Somalia, etc.) when there is an important event, and even every single YouTube URL ever shared on Twitter (they can't capture all of YouTube, but at least they can capture the videos people deem important enough to share elsewhere).

And finally, other institutions can use the Internet Archive to build their own archiving services, usually creating specialized collections around topics like human rights or bioengineering. All of those collections are then copied back into the Wayback Machine, which is the publicly accessible version of the web archive.

Abbie Grotke, the web archiving team lead at the Library of Congress, has been involved in this work in one way or another for over 20 years. The Library of Congress's own archive is one of the special collections built in collaboration with Bailey, and it contains about 2.4 petabytes and over 18 billion objects, ranging from U.S. government websites to the most culturally important memes. Grotke has given her life to preserving the internet for the Library of Congress.

The work itself is technically an enormous task, but it boils down to one simple goal. "We're just trying to capture changes over time," she said.


Brewster Kahle is the internet's first librarian.Photo: Internet Archive


The Library of Congress began capturing websites in 2014, focusing mostly on political collections and at-risk websites and collections that might be taken down before they can be captured. "We're always sort of worried about, are we collecting everything we need to be collecting? Is there something we're missing?" said Amber Paranick, one of the Library of Congress's reference librarians. But this problem isn't that different because it's digital: "That's always the dilemma of the librarian."

The web archive alone is about 45 petabytes — 4,500 terabytes — and the Internet Archive itself is about double that size (the group has other collections, like a huge database of educational films, music and even long-gone software programs).

It's impossible to conceptualize actually usable, accessible data at that scale, let alone make it text-searchable. So while the Archive has some projects to use machine learning to identify some images, like pictures of horses, Bailey likes to think about the odd, unimaginable applications that have emerged and how they foretell grander uses in the future.

The Wayback Machine has evolved to play an important role in patent litigation, for example. People fighting over patent ownership look for what's called "prior art," which indicates who might have first thought of a product. In one case, when two people were disputing who first created a specific design for hubcap rims, one was able to prove their ownership by finding an old website that had been archived in the Wayback Machine.

And there are other use cases, too: The people building open-source translation tools at Mozilla have also found the internet archive's collection of websites in multiple languages useful for training their translation tools. There is very little printed or digitized material that has large amounts of the same text in two languages, but many official websites do, which can help build quality translation tools for "minor languages," like English-Swahili translations, according to Bailey.

The future of our histories

When I asked Kahle how he thinks about preserving today for historians centuries away, he grew philosophical. He sent links in the Zoom chat, first to the Google doc for a book he wrote, then a Nation piece, then a long blog post he wrote in 2015. By the time we hung up the call, I had piles for reading material, most of it dense, most of it dated.

There's value to all of this history, he told me. "What we're able to do now is know about your individual history. We're able to get to the specificity of the historical record. Which I think is going to really be engaging in 100 years' time. What would you give for a video of your great-grandmother? It would just give you this ballast, it would give you an anchoring, that we right now lack," he said. "We're living in the perpetual present, and that is dangerous." Kahle believes our history makes us better people, and gives us better knowledge. But history isn't financially lucrative.

Social media companies want us to focus on tomorrow, not on the posts we made a year ago. Publishers do, too. HarperCollins is suing the archive to try to prevent it from sharing out-of-print books in its digital library, arguing that publicly sharing out-of-print books is a massive violation of copyright laws. While at first it might seem odd that publishers would care about books that aren't in print anymore, for companies whose business depends on people buying new things, archiving so that people can focus on the past is not in their financial interest.

"They are erasing the past through every legal and political means they can," Kahle said.

If the balkanization of the internet can be prevented, the Internet Archive could transform the way we learn about larger historical moments, Kahle said. History books and historians are limited to a few textual works, mostly by the powerful people of the time. With the Internet Archive, the everyday history will become suddenly accessible to those studying our time. Imagine if each of us could look back on our great-grandparents and know what they said or thought at age 15, and then 25, and 50. The Archive would allow that.

The Archive could also force historians to become professional data miners. "There will be a lot of these comparison studies at a much larger scale in the future — every tweet from every president in 30 years. Longitudinal analysis could be done with petabytes of data," Bailey said. The research questions themselves may not change much; they will just stretch over bigger timelines and larger comparisons.

"We're in the process of building macroscopes," Kahle said.

Caught in a golden age

More than 1 million people use the Internet Archive every day. Most of them seek out the Wayback Machine, but people also read the digitized books in the archive's open library, or watch movies from the huge archive of public domain films.

"We love the dreamers, the people who come to this new medium with their ideas. The dreams are important to archive, whatever happens," Kahle said. Despite the existential threats to his work and to the values of the open internet, Kahle wants to be hopeful.

"Those who want to monopolize the internet are very well-funded. We need to communicate and deliver the value of openness. Am I optimistic we can do that? I'd say yes. But it's based on an enormous number of people wanting it to happen," he said.

"Some believe that people will only do things if you pay them, others that people are just sheep," Kahle said. "None of that is true. They may not be interested in the same things, but when we look at what people produce on the internet, if it's about the things they care about … They'll prove you wrong in a nanosecond."

Anna Kramer

Anna Kramer is a reporter at Protocol (@ anna_c_kramer), where she helps write and produce Source Code, Protocol's daily newsletter. Prior to joining the team, she covered tech and small business for the San Francisco Chronicle and privacy for Bloomberg Law. She is a recent graduate of Brown University, where she studied International Relations and Arabic and wrote her senior thesis about surveillance tools and technological development in the Middle East.

Image: Protocol

This week on the Source Code podcast: Issie Lapowsky joins the show to talk about why researchers and social platforms want to work together, and why that's a lot more complicated than it sounds. Then, Joe Williams explains why the digital signature industry is so hot right now, and where it goes from here.

For more on the topics in this episode:

Keep Reading Show less
David Pierce

David Pierce ( @pierce) is Protocol's editor at large. Prior to joining Protocol, he was a columnist at The Wall Street Journal, a senior writer with Wired, and deputy editor at The Verge. He owns all the phones.

While debate has raged in recent months about the pros and cons of raising the minimum wage, there's one particular set of people in agreement regarding the benefits: hourly wage workers, like Remington*.

In September 2020, Remington began working in an Amazon fulfillment center in DuPont, Washington, after his hours were cut at his restaurant job because of COVID-19. A few months into his new job, which has a starting wage of $15 an hour plus benefits, he said he feels more secure earning a steady paycheck at a rate higher than the federal minimum wage — especially with benefits. "The consistent pay is amazing," he said. "Instead of just wondering 'Am I going to make rent this month?' I'm like, 'Yeah, I got the bills this month. I can even go on a trip.'" Thanks to his job, he's finally been able to get new glasses and contacts and take his daughter skiing and fishing, all while saving for a house he's planning to buy with his siblings — two of whom work with him at Amazon.

Keep Reading Show less
Kate Silver
Kate Silver is an award-winning reporter and editor with 15-plus years of journalism experience. Based in Chicago, she specializes in feature and business reporting. Kate's reporting has appeared in the Washington Post, The Chicago Tribune, The Atlantic's CityLab, Atlas Obscura, The Telegraph and many other outlets.
Power

Google wants to help you get a life

Digital car windows, curved AR glasses, automatic presentations and other patents from Big Tech.

A new patent from Google offers a few suggestions.

Image: USPTO

Another week has come to pass, meaning it's time again for Big Tech patents! You've hopefully been busy reading all the new Manual Series stories that have come out this week and are now looking forward to hearing what comes after what comes next. Google wants to get rid of your double-chin selfie videos and find things for you as you sit bored at 凯发k8官网下载手机版home; Apple wants to bring translucent displays to car windows; and Microsoft is exploring how much you can stress out a virtual assistant.

And remember: The big tech companies file all kinds of crazy patents for things, and though most never amount to anything, some end up defining the future.

Keep Reading Show less
Mike Murphy

Mike Murphy ( @mcwm) is the director of special projects at Protocol, focusing on the industries being rapidly upended by technology and the companies disrupting incumbents. Previously, Mike was the technology editor at Quartz, where he frequently wrote on robotics, artificial intelligence, and consumer electronics.

Protocol | Policy

Tech spent years fighting foreign terrorists. Then came the Capitol riot.

"Nobody's going to have a hearing if a platform takes down 1,000 ISIS accounts. But they might have a hearing if you take down 1,000 QAnon accounts."

Photo: Roberto Schmidt/Getty Images

On a Friday in August 2017 — years before a mob of armed and very-online extremists took over the U.S. Capitol — a young Black woman who worked at Facebook walked up to the microphone to ask Mark Zuckerberg a question during a weekly companywide question-and-answer session.

Zuckerberg had just finished speaking to the staff about the white supremacist violence in Charlottesville, Virginia, the weekend before — and what a difficult week it had been for the world. He was answering questions on a range of topics, but the employee wanted to know: Why had he waited so long to say something?

Keep Reading Show less
Issie Lapowsky
Issie Lapowsky (@issielapowsky) is a senior reporter at Protocol, covering the intersection of technology, politics, and national affairs. Previously, she was a senior writer at Wired, where she covered the 2016 election and the Facebook beat in its aftermath. Prior to that, Issie worked as a staff writer for Inc. magazine, writing about small business and entrepreneurship. She has also worked as an on-air contributor for CBS News and taught a graduate-level course at New York University’s Center for Publishing on how tech giants have affected publishing. Email Issie.
Protocol | Policy

Bad news for Big Tech: Bipartisan agreement on antitrust reform

Democrats and Republicans found common ground during the first House hearing on antitrust of the new Congress. Here's what that means for tech giants.

The House Judiciary antitrust subcommittee held their first hearing of the 117th Congress.

Photo: Tom Williams/Getty Images

During the first House antitrust hearing of the new Congress, Democratic chairman David Cicilline and Republican ranking member Ken Buck made it clear they intend to forge ahead with a series of bipartisan reform efforts that could cut into the power of the largest technology companies.

"We will work on a serious bipartisan basis to advance these reforms together," Cicilline said during his opening remarks Thursday.

Keep Reading Show less
Emily Birnbaum

Emily Birnbaum ( @birnbaum_e) is a tech policy reporter with Protocol. Her coverage focuses on the U.S. government's attempts to regulate one of the most powerful industries in the world, with a focus on antitrust, privacy and politics. Previously, she worked as a tech policy reporter with The Hill after spending several months as a breaking news reporter. She is a Bethesda, Maryland native and proud Kenyon College alumna.

Latest Stories