Understanding Semantic Web Technologies
As the amount of information in enterprise databases and online data stores expands exponentially each year, enterprises face the very real problem of sifting through it all and sharing it among disparate systems and end users.
Enter semantic web search technology.
The problem is that as the amount of information and number of systems increases, the more ineffective traditional index search methods become. A cadre of technologists sees hope in the form of semantic technology, a non-proprietary way of categorizing and connecting data with contextual information to make it easier to organize and search. However, many executives simply do not know what semantic technology really is, and the idea of implementing it is about as complex and indecipherable as hieroglyphics was before the Rosetta Stone was discovered.
“Semantic technologies are early in their maturity and market adoption,” Gartner analysts wrote in a report. “Many organizations will struggle to understand semantic approaches and view such technology as ‘bleeding edge,’ avoiding it because they are risk-averse.”
Though some executives are scratching their heads over what semantic technology is exactly, Gartner believes it has the potential to help mainstream enterprises with the growing information management problem. At the Gartner Emerging Trends and Technologies Roadshow in May, analysts with the research firm said semantic technology would be among the 10 most disruptive technologies in the next four years.
The Need for Something Different
As Ted Friedman, analyst with Gartner explains, IT organizations are increasingly being tasked to help users share information—on the Web and within the enterprise. “Sharing of information is getting more and more important all the time for organizations as they try to achieve greater levels of productivity, agility and simply make their organizations more effective,” he says. Unfortunately, information sharing remains a major challenge for organizations, Friedman adds. “They’ve got these silos of information across the business. People don’t really understand where information resides, what it looks like, what it means, or the semantics around it. And as such, they find it difficult to share. In effect, they are talking different languages.”
Within the enterprise, information might be stored in different forms with different contexts and different terminologies surrounding it. For example, a widget might be referred to with two different SKUs in two different databases, or an address might be written in two formats. “Basically, the amount of data is growing, not only in volume but the number of different sources it is coming from,” says Irene Polikoff, CEO of TopQuadrant, which produces tools for developing semantic applications. “It’s beyond the capabilities of the average organization to handle all of this. The pain level is pretty high and growing. It’s a balance between that pain level and any pain level associated with doing something new.”
A growing number of organizations is coming to the conclusion that they can’t really solve this problem using the same methods as before, so new methods are needed, Polikoff says. John Giannandrea of Metaweb Technology puts it simply. “One of the truisms of life is that human knowledge is massive,” says Giannandrea, who researched semantics for years at Netscape and has been helping to develop Metweb’s open, shared database of freely available information. “It’s both messy and highly creative.”
The big word in semantics is “context,” says Lynda Moulton, analyst for Gilbane Group. The ultimate semantic engine, she notes, will allow you to pose a question in natural language and give you the precise answer to the questions: How does it do that? Is the technology or the search logic built into these search engines, not for just the context, but also the question that’s coming through.
“It’s got to say, first, what is she really asking about? Then, it has to say what do we have out here that’s going to match this inquiry?” Right now, there are two ways computers can search for something, Moulton explains. “One is sequentially–literally taking what you’re looking for and going through until they find a match. Then, they say ‘Here it is,’ and it keeps going and going,” she says. “Of course, that’s really slow. And, then, there’s the old technique of indexing. So, all the instances of a given word are indexed in one list with pointers going back to the document and placed in the document where that word occurred—which makes the computer find stuff a lot faster.”
As search technology has matured, it has progressed from using simple indexing techniques to more sophisticated algorithms based on linguistics, she says. For example, the verb “to rise” might appear in past tense, future tense and so on, but the index search engine will still be able to recognize it. So an index search could search the word “rose” and assume that anything related to the verb “to rise” is appropriate, she explains.
“That has obvious problems because someone who types ‘rose’ might be looking for a flower. You go through all of these unsatisfactory results because the search engine made assumptions about how you were using language and it isn’t always appropriate, so you get results that aren’t relevant,” Moulton says. “This has been the trouble of computer scientists providing search engines to find smarter and better ways of indexing things contextually.”
Leaning on Ontologies
Semantic technology finds better ways of indexing contextually by creating what Moulton calls a “hierarchy of terminology,” which semantics experts call an ontology. “You’ve probably heard of taxonomies, which are just tree-structures of language. An ontology takes it to another level, rather than just having two dimensions─broader and narrower─you have an unlimited number of dimensions of how words or charts are related to each other,” Moulton explains. “I can say a steering wheel is a part of a car, a wheel is a part of a car, and an engine is a part of a car, [but] a carburetor is a part of an engine. That is a broader-narrower concept. If you take it to another dimension and say engines are systems within cars that drive cars forward, and carburetors are one of the components of engines, now, you have a whole new layer of relationships between the words.”
Ontologies create a web of connections that can act as shorthand for the engine users employ to search quickly through information across data sources.
“We have quite a bit of interest from people who have lots of diverse content,” says Polikoff of TopQuadrant. “Let’s say you have a retailer with a catalog of many different items, ranging from refrigerators to carpets to electronics and so on. They need to be able to build this catalog and integrate other sources into it in a quick way. Ontologies can act less rigidly than data structures and this allows us to quickly build the model of their catalog and put data into the model and search quickly.”
This creates a new layer of metadata to make it easier to navigate through the information, Polikoff explains.
“The idea is that you have all these different data sources with different formats, and that’s a problem for people. So you have, let’s call it, the data-sources layer that exists in all kinds of organizations and enterprises,” she says. “This technology allows you to put a new layer on top of that. We can call it the semantic web layer that consists of certain kinds of models that allow one to map these different data sources to a common vocabulary. That gives you the power to provide very rich information spaces of many different sorts on which you can build many different applications.
Making Data More Usable
The flexible, more robust connections between data created by ontologies is especially appealing to specialized verticals that must rely on making connections between disparate collections of data to make breakthroughs in their work. For example, life sciences and drug research company employees could do wonders if they had easier access and knowledge about little-known studies and information hidden in archives scattered around the organization.
“In life sciences, there is great need to integrate data sources. They have so much information about biological data, drug data, chemical data and so on,” Polikoff says. “Our customers there use our product to integrate their data and allow researchers and various scientists to search connections in data in a free-form way. So it is not determined how things could be connected; they are discovering connections as they browse and search things. [That’s] because lots of connections in science are discovered by chance. You collect so much data you have to bring it together and let people look at it critically.”
Other early adopters of semantics include law firms and other companies seeking to sift through mountains of court documents, government and intelligence agencies in need of a way to find a needle in the haystack of public and top-secret information, and even the banking industry. Some fraud-prevention companies have been using semantics to take transaction information collected by their systems to get a clearer picture of when and where fraud may be occurring.
“Semantic technology can give you the agility to take a lot of siloed data points and give you a holistic view across an enterprise,” says Ken Harris, vice president of product development for ACI Worldwide, a fraud-detection software developer that recently partnered with a semantics company called Metatomix to better integrate information collected by its antifraud software. “The real power in it is being able to take beyond just a taxonomy approach of just standard data or business-process flow and actually being able to take that to the next level of conceptual or theory-type models and applying those to the data that you actually have.”
Semantics helps ACI Worldwide create a “view of fraud and interact with the ever-changing environment of fraud in a way that is very powerful,” Harris says. The best way to solve this problem is to take fraud and apply logic to fraud and understand what the meaning of that is, he says. “This is the technology that is going to change that overnight”
Many experts, however, do believe that it will take a little longer than overnight for semantics to make a real difference within most mainstream organizations.
“If I were to put a number of years─it depends on the general economic situation and a number of factors─I would definitely say it’s less than five years, probably less than three years before we see mainstream adoption,” Polikoff says. Gilbane Group’s Moulton is less enthusiastic about semantic technology’s near-term prospects. “We’re talking a decade or more for it to really work well. It’s like voice recognition. It’s just kind of creeping along and creeping along and it’s getting a lot better, but it’s still not everywhere. It doesn’t always work really well,” she says. “It’s the interface that’s the real issue. It’s not the technology. These are design problems more than technology problems.”
Many other obstacles must be conquered for semantic technology to really be picked up by the average enterprise. Foremost is the issue of developing ontologies.
“Ontologies really have to get built up. They get built up in two ways. One is through humans creating them for these sophisticated applications, and there are government agencies and professionals who do this,” Moulton says. “And the other way is through machines calling context and by learning how language is being used.”
A number of ontology languages and standards have already been created to help organizations and tools developers build up ontologies in a uniform manner. But some observers are critical of the current semantic ecosystem and language structure. Giannandrea of Metaweb believes these languages are unnecessarily complex. “It’s great that the schema can be malleable and contributed to,” he says. “But it’s not OK if, in order to do that, you’ve got to know ontology languages. There’s nothing fundamentally wrong with these ontology languages, and I understand the value. It’s just that you’re being asked to buy more than you need in order to get the benefits. You’ve got these markup languages like RDF or N3 or OWL, and you’ve basically got to buy into a whole tool chain before you can get these up.”
Metaweb is championing better integration with existing APIs and markup languages, and the company is using a wiki-style base of volunteers to build up a web of connections between public information in order to bypass the complication of these ontology languages. Limitations in most organization’s database infrastructure create the biggest roadblock to the use of semantics within the enterprise, Giannandrea says.
“While there are some business standards being created for invoicing, and this and that transaction, the majority of the semantic meaning the companies are capturing within their databases is locked up in the database,” Giannandrea says. “If I have a personnel database, and it has people’s names, date of birth and managerial position, the schema for that and the meaning of those terms for a field value is basically unique to that database.”
A whole industry has been created to aid database cross-referencing and schema matching because that’s what it takes anytime an organization would like to connect two databases together.
“We talk to many CIOs who say this is complete madness. We have these tools that let the data flow back and forth, but we don’t know that the field might have the same meaning,” Giannandrea says. To create numerous connections between the data, there is a need for what’s called a “triple predicate-based” database.
“You’d have Arnold Schwarzenegger in the system and Maria Shriver in the system, and you’re going to represent that they’re married by adding ‘Arnold is married to Maria. Maria is married to Arnold.’ These sorts of triple predicate-based systems have existed since the 1960s, and there’s general agreement that if you’re going to represent structured but open-ended human knowledge, you’ve got to use them,” he says. ]
The problem is that most relational databases are not appropriate for storing and operating on these scales. Column database stores, which are used for data warehousing, are also not appropriate for queries on these relational stores. So, the problem is, you need a new kind of database. And that’s a little a bit of a problem if you’re an enterprise and have all of your data in a relational data store. People are beginning to recognize that. There are a lot of database researchers, and well-known people in the field are beginning to write about this.”
The vision of semantics is great; it just needs to be simplified in its execution, Giannandrea says. “You have to make this stuff acceptable,” he says. “We think the underlying idea is fantastic. Our computers need to be able to understand the concepts so that they can then do more for you. That’s a great vision. It’s just that the current tool chain is a little too academic and based too much in the realm of artificial intelligence.”