Meet Common Crawl, the Nonprofit That Could Reshape the Web

January 23, 2013; Source: MIT Technology Review

Move over, Google? Back away, Bing? Innovative researchers and business developers seeking to mine data deeply on the Web require access to a robust, intelligent Web crawler and gargantuan storage power to create a mirror of the Web, something only a few tech giants like Google and Microsoft have mustered—until now. The new nonprofit Common Crawl’s powerful Web crawler harvests more than five billion pages held in 81 terabytes of stuffed servers, which opens the door—at a minimal cost—to entrepreneurs and curious minds who look at the Web as one giant database and a golden opportunity.

University of California-Santa Barbara professor Ben Zhao, who uses Web data to study social sites, notes that Common Crawl compiles data in a new and useful way. “Fresh, large-scale crawls are quite rare, and I am not personally aware of places to get large crawl data on the Web,” he says. Common Crawl founder Gilad Elbaz is considered a visionary entrepreneur with several inventions to his credit. He co-created Applied Semantics, whose ubiquitous AdSense technology is now responsible for a large chunk of Google’s revenue and profits.

Elbaz saw a fundamental need for affordable access to data and filled it. He tells the MIT Technology Review, “The Web represents, as far as I know, the largest accumulation of knowledge, and there’s so much you can build on top…But simply doing the huge amount of work that’s necessary to get at all that information is a large blocker; few organizations…have had the resources to do that.” Note that the Internet Archive, also a nonprofit, is somewhat similar in that it offers users a copy of the Web at a chosen point in time via its “Wayback Machine,” however the Internet Archive, which has been described as a “clunky” database, doesn’t enable users to analyze an enormous database of the Web in the way that Common Crawl does.

For $25, anyone with an Amazon cloud computing account can tap into Common Crawl data and start building with information in any number of unforeseen ways, perhaps fulfilling needs that Web users would not have thought possible. For instance, TinEye is a “reverse search engine” using Common Crawl to find images that relate to others that users upload. Another Common Crawl-based startup, Lucky Oyster, helps people make sense of their social data.

Last year, Common Crawl’s code contest recognized several other innovators making sense of the inexhaustible Common Crawl data in unique ways, including the Online Sentiment Towards Congressional Bills project, which contest judge Peter Warden lauded for a sort of “Occupy Congress” aggregation of disparate conversations about the Hill on the Web: “This work can highlight how ordinary Web users think about bills in Congress, and give their views influence on decision-making,” Warden said. “By moving beyond polls into detailed analysis of people’s opinions on new laws, it shows how open data can ‘democratize’ democracy itself!”

According to Zhao, proprietary concerns about usurpation of data by social media may hobble one of the more promising aspects of Common Crawl’s utility. “Social sites are quite sensitive about their content these days,” Zhao says, “and many implement anti-crawling mechanisms to limit the speed anyone can access their content.” Social networking sites such as Facebook and Twitter, as well as other sites whose value is based on data, could see Common Crawl-based startups as competitors. What do you think about Common Crawl’s future? Will it be squelched by powerful Internet interests or will it enable us to use the Web in more interesting and helpful ways? –Louis Altman