The Alembic Workbench Environment for Natural Language Engineering
2. Project Overview
Information extraction is emerging as an important technology for the intelligence community in processing open source text. Information extraction (IE) can be described as the summarization of events and facts of interest into detailed database entries from unformatted text. The "Alembic" project at MITRE has developed a substantial initial capability towards constructing an effective IE system that has demonstrated impressive performance levels at the most recent Message Understanding Conference (MUC-6) and the Multilingual Entity Task evaluation organized as part of the May, 1996 TIPSTER conference. The Alembic system, like others, makes use of specialized heuristics that recognize and extract the relevant facts from texts. What is different and important about the Alembic system is its ability to mix hand-coded heuristic rules with machine-learned heuristic rules. The Alembic Workbench project is developing an information extraction porting environment in which the training data that drive this learning process are generated quickly by a mixed-initiative annotation process involving a human user and the machine learning mechanism itself. This project aims to construct an environment in which the human can have his or her knowledge of the domain transferred quickly and effectively into the form of training data and extraction heuristics by the cooperative activity of machine learning and evaluative feedback to the user.
Portability remains a major problem for information extraction systems. It takes multiple person months for the best IE systems to be modified and customized for a new domain. This process must include the creation of annotated training and test materials to support the development and refinement of the IE system. The objective of the Alembic Workbench project is to reduce the amount of time to customize an IE system for a new domain to a matter of days. Many of today's systems are becoming not only more easily adaptable to support this process, but automatically (self-) adaptive using machine learning techniques. However, there is an important ingredient that still requires a human in the loop: adaptive/adaptable systems require training and test data against which to measure and improve their performance. The knowledge engineering bottleneck is being transformed to an ever larger degree into a training data acquisition bottleneck. Our approach is to use an iterative boot-strapping procedure interleaving human annotation and machine learning early on in the process of generating training and testing data. By using partially accurate rules to automatically annotate new text, we believe that the process of human data preparation can be simplified and streamlined to one of review and editing. In addition, the rules generated can themselves be scrutinized and improved by the annotator to further quicken the pace of performance improvement. In fact, the process of creating and reviewing the training data turns out to be the same process by which the information extraction heuristics are defined and refined. High-quality annotated text and high-quality information extraction heuristics are developed simultaneously.
Our intent is to develop an environment in which there is synergy among the following activities:
All four of these components will be equipped with an API to enhance the value of the components to other systems. The text annotation component will be carefully designed to optimize the ease with which text can be tagged with different kinds of markup. The API for this component comes in the form of a TIPSTER-compliant document representation, in which parallel sets of annotation files are maintained separately from a source text file that is never modified directly. Further, this API will support the parsing and generation of SGML-formatted documents for representing this same annotation set. The rule sequence module may be invoked from external tools, and may be applied either to the parallel tag file format used by the tailoring environment, or any SGML document. The machine learning mechanism for inducing new rule sequences will eventually also be equipped with an API that will allow it to be applied to SGML training corpora developed independently of the tailoring tool. The evaluation and performance reporting tools built for use within the GUI will also be callable from external tools.
To date, after four months of work, we have an initial workbench capability. This is now being used to annotate close-captioned text from video. We are also using the Alembic Workbench to perform named-entity tagging (people, organizations, locations) for Portuguese news texts. Early indications are quite positive. In the case of named-entity tagging, an amount of training text equal to that provided to participants in MUC-6 (approximately 35,000 words, with 3,500 tagged entities) has been annotated in 6-7 hours of effort over a period of a week. We intend to begin a careful evaluation of the efficiency of the tool in the near future.
While Alembic Workbench project is still in its early stages, we can already see directions in which we would like to further enhance this suite of synergistic language processing tools. A developing focus for Alembic is the rapid development of all of its modules (pre-processing, part-of-speech tagging, phrase-tagging, parsing) to many different languages. Not only is this an important interest for intelligence agencies, it also allows us to concentrate our development on empirical methods for the development of language processing capability. Below are some of the ways in which the workbench could be enhanced to better support this multi-lingual emphasis.