Just as most of recognize "good looks" but can't explain what they're made of, most of us can't explain what constitutes a "language". Language is obviously important. The fundamental thing that sets apart a society or community is a shared language. Ask a parent of young children; their job gets much easier once the toddler can express its wants and needs in a language that does not include pointing, tears, screams or tantrums.
Is the world of work much different? Our productivity depends on language. Once we learn the language of a profession, supplier or workplace, we know how to partner and procure the resources we need to be productive. Sadly, language is also an inhibitor. Steep learning curves prevent others from contributing to our success, and the curves get steeper as businesses grow the amount of information as well as the number of sources that provide it. IDC just completed two important studies that illustrate this point. In the first study, they attempted to size the digital universe and came to the conclusion that it had reached 281 million TeraBytes and would grow a factor of ten by 2011. They also concluded that the growth in the number of information sources is increasing 50% faster than the amount of information they contain. In the second study, they determined that employees spend 25% of their time searching for information, at a cost to their employer of $14k per employee per year. If we fail to apply technology to this problem, it is easy to see how the growth of information will be an increasing drag on the productivity of the workforce. BUT, what technology?
Enterprise Search solutions such as Autonomy, FAST, Endeca or the Google Search Appliance are commonly applied to this problem. The state of the art allows businesses to connect a wide range of information sources, including business applications, email servers, shared file systems and internal websites – a good thing to be sure. In fact, we're starting to see consolidation and commoditization with Microsoft acquiring FAST and the advent of Open Source Search Solutions such as Nutch and Solr. However, commoditizing as they may be, these solutions commonly don't understand the nouns, verbs and relations that are specific to the task or function a particular group needs to perform. In other words, the opportunity has shifted from the ability to connect a wide range of information sources to making that information productive.
We believe the key to making that information productive is the understanding of language. Fortunately there is a strong, worldwide community of (computational) linguists and text engineers who have been building powerful and freely available technology. We can use this technology to mechanically analyze vertical or community languages used by specific groups of users of a search system. Using Cloud Computing and Software as a Service (SaaS) principles, we can build applications that are cost effective and can tackle many business opportunities. Openwater is one of a growing list of companies focused on this opportunity. We provide services to improve enterprise search. This includes: Text Engineering services to build dictionaries of terms and common ways to refer to them (acronyms); Structurization services to discover structure in weakly structured or unstructured information, Contextual Analysis services to derive meaning from context; and Usage Analysis services track the interaction (through applications) between users and search to tune and enhance relevancy. Using these services, businesses dramatically improve the productivity of customers, partners and employees. By delivering as a service and by taking advantage of open standards, we eliminate barriers (cost and complexity) and apply the technology to many business problems.
As an example, consider forums delivered over the internet. Members of a forum use a language with varying degrees of precision to ask and answer questions. As an application, the forum has a structure and an established set of rules or protocols to provide context. However, most of the important information is locked up in dialog expressed in the (natural) language of the community. Openwater applies text engineering services to discover the particular terms and acronyms used by this community. Does "CA" mean "California" or "Computer Associates"? It depends on the forum and text engineering discovers the right answer. Structurization services discover the abstract, topic, question, responses and other meta data of any forum thread. Is a user posting or responding to a question about a specific product or problem? Index enhancers extract people, product names, problem descriptions, error codes, concepts and/or solutions. Contextual Analysis of all this information further enhances the description of the dialog. Is a given post a request for more configuration information? Is it "I have the same problem, anybody have a solution"? Is it a response to a request for additional information? Is it the solution? Is it a "Thank You" note? We're not going to claim that the machine always gets it right, so we allow user interaction (authoring or tagging) to further refine the description of a post or thread. We have observed the combination of these services increase the efficacy of search by over 400%. Equally important, these enhanced indexes allow for seamless inclusion of information in a knowledge infrastructure without the expensive and time consuming editing, transformation or loading of content into a knowledge base. In short, technology makes the forum more productive and useful for those that are participating in the forum or those that would like to benefit from the work and experience.
Of course an expert or moderator could mark up a forum in this way. However, this is an expensive use of very skilled resources that can't scale with more activity. In fact we did mark up a forum to create a benchmark for our algorithms. The results were nothing short of amazing. Without any help, our algorithms were able to classify forum posts into ten (!) different types, identifying the questions with 86% accuracy, the solutions with 78% accuracy and the other 8 types with 79% accuracy. These numbers are high enough that average forum members will correct the algorithmic mistakes. This application of technology makes better use of the information produced by users, and makes it easier for users to produce information. Leveraging these assets increases productivity and delivers competitive advantage. Let's explore the technology we use.
Within our content management, directory and business systems, we have important information about our products, versions, components, customers, employees, groups, opportunities and problems; these are valuable elements of a language. Text engineering leverages these systems to build dictionaries of terms and common ways to refer to them. These lists can be applied to any document (structured or unstructured) to enhance its description and link it to other documents with shared properties. For example a product catalogue exists on most websites, most IT organizations have an LDAP directory of employees, and most documents contain authors as well as names of products. Analyzing these three sources allows us to create a web of links between documents, products and people -- not unlike the linking structure that Google uses to drive its PageRank algorithm.
Documents and applications assign structure to content, most documents have at least a title and an author and this blog has headers and posters. The structure may be very explicit in Microsoft Office metadata or it may be inferred from the presence of page boundaries and font changes. This structure provides valuable information about the document and can be used to enhance its description. When combined with text engineering, we get to ask more insightful questions – what products has this person blogged about? Cleary, transactional systems are useless without their structure. As we combine structured and unstructured information, structurization is a critical component to any search system.
Text engineering and structurization provide information that drives Contextual Analysis. For example, if the 2nd to the last post of a forum thread was created by an expert (karma >1000) and was followed by a post with words stating or implying "thank you", Contextual Analysis infers that the 2nd to the last post is the answer. The same type of analysis enables far better Spam Detection than conventional methods.
Search is not a passive activity. Users refine searches and decide what documents are useful by using them. We add this information to our Contextual Analysis and share the results. We provide rich yet simple authoring tools based on Web2.0 technology to make it really easy for users to interact with search and improve the results for their community.
Language is an underutilized asset in business. In fact, language is most often a barrier that must be overcome. Consider our toddler: once she learns to express herself, she will spend the next 20 years learning the basics of English BEFORE learning the language of a specific trade or profession. We should consider this the next time we deploy a system to service employees or customers. Is language a barrier? How can we use technology to bring down this barrier and better serve our employees, partners and customers?
At Openwater, we are working hard to bring this technology to your business. We built our service using open standards so that you benefit from the best minds in the business, allowing you to focus on leveraging your assets and servicing your employees or customers efficiently. To learn more, please visit www.openwaternet.com.