NYU and DARPA Dive Deeper Into the Web

By Samuel Greengard  |  Posted 2015-04-02 Email Print this article Print
Exploring the Deep Web

NYU received a grant from DARPA to develop more sophisticated methods to locate and explore hard-to-find information on the Web surface and on the deep Web.

A growing problem for individuals, businesses and government agencies that are conducting research on the Internet is locating desired information quickly and easily. In many cases, search engines like Google and Bing are fine for basic tasks, but they may fail when a topic is complex or esoteric.

As Juliana Freire, a professor of computing science and engineering at the Polytechnic School of Engineering at New York University puts it: "If you are looking for specific information, you often wind up with millions of results, and none of it is very relevant."

Freire is among a group of researchers attempting to change this. NYU recently received a $3.6 million grant from the U.S. Defense Advanced Research Projects Agency (DARPA) to develop more sophisticated methods to locate and explore hard-to-find information on the Web surface and on the deep Web, which standard commercial search engines typically do not index.

"An important focus of this project is to make it possible to search out information on the dark Web," she adds.

Program Focuses on Three Primary Areas

The three-year program, called Memex, focuses on three primary areas: domain-specific indexing of open, public Web content; domain-specific search capabilities; and Department of Defense (DoD) applications that could aid in battling crime or sharing strategic information within the military.

The project could deliver huge benefits for analysts, writers, researchers, attorneys, and government and law enforcement officials. But accomplishing the task will require a huge effort.

The sheer enormity of the Web, as well as its flat and broad structure, makes certain content difficult to find. What's more, it's often tough to determine who is publishing or producing dark Web content, the quality of the information and the motivations of the site operators.

Memex is not a search engine. It's open-source Java software that's designed to run on all major platforms, including Windows, Mac and Linux. (A mobile tool may come later.) The program focuses on improved algorithms and analytics for Web crawling.

Freire, who previously developed DeepPeep, a search engine for the National Science Foundation, says that one goal is to avoid the intensive development requirements of the vertical search engines that are used by specific industries or fields.

The system uses machine learning to classify and categorize topics and gauge their relevance. "As we crawl the Web, we discover new beta sources for information, and we update the crawler," Freire says.

One area the project is initially focusing on is human trafficking. Researchers are attempting to stretch beyond basic news and listings of non-governmental organizations by diving into escort sites, Web pages dedicated to delivering brides and other similar content. Memex could also deliver valuable information in areas such as antiterrorism, child pornography and drug enforcement.

"The task our group is addressing is how to make this process more efficient and less costly. [We want to] make the technology more usable to a wider group of individuals."

NYU is among four universities—along with Carnegie Mellon University, Stanford and the University of Southern California—that are tackling different aspects of the same project. Several private companies are also contributing to this research and development effort.


Samuel Greengard, a Baseline contributor, writes about business, technology and other topics. His forthcoming book, The Internet of Things (MIT Press), will be released this spring.


Submit a Comment

Loading Comments...
Manage your Newsletters: Login   Register My Newsletters