Release Notes and Manual for Version 1.0.
Marc Verhagen, November 2007
The Tarsqi Toolkit (TTK) is a set of components for extracting temporal information from a news wire text. TTK extracts time expressions, events, subordination links and temporal links; in addition, it ensures consistency of temporal information. See http://tarsqi.org for more general information on the TARSQI project and for descriptions of TTK and its components. The Tarsqi Toolkit comes bundled with the Tango annotation tool (http://timeml.org/site/tango/) and a graphical user interface.
This manual contains the following sections:
|The Tarsqi Toolkit is copyright ©2007 of Brandeis University and is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.|
The Tempex module is copyright of The MITRE corporation and is distributed under the license in tempex-license.pdf.
The Python wrapper for the TreeTagger (treetaggerwrapper.py) is copyright ©2004 of CNRS and distributed under the GNU-GPL Version 2. It was developed by Laurent Pointal.
The data in
data/in/TimeBank are copyrighted by the
various content providers and can be used for academic purposes only
The toolkit requires at least version 2.3 of Python and version 5.8 of Perl. Older Perl versions may work. The toolkit has been tested on the following platforms:
XML::Parsermodule. This is problematic for those who use OS X because the version of Perl that is bundled with OS X does not contain
XML::Parser. You have several options here. One is to download and install XML::Parser from CPAN. Another is to install ActivePerl from ActiveState.
The toolkit graphical user interface requires the wxPython package.
Note to Windows users.
There currently is no neatly packaged Windows version of TTK. However, most code is written to be cross-platform and the toolkit can be made to run on Windows (albeit with some effort). We are currently working on a neat package for Windows by integrating or better integrating the part-of-speech tagger and the MaxEnt classifier. A Windows-friendly version will be released asap.
This is a three step process: (i) unpacking the archive, (ii)
installing the part-of-speech tagger and (iii) setting up TTK for your
platform and environment.
This will unpack TTK into a directory named
% gunzip -c ttk-1.0.tar.gz | tar xp
and this directory needs to have sub directories
Other POS taggers can be used instead of the IMS TreeTagger. The
easiest case would be a tagger that uses the same input format as the TreeTagger and creates
files in the same output
format. In that case, only the
tag_fragment method in
will need to be edited according to your needs. A future version of the toolkit will make it easier to install other taggers.
ttk-1.0/codeand running the
setup.pyscript. The toolkit comes bundled with classifier binaries and the setup script installs the correct version. The only two platforms for which classifier binaries are included are Mac OSX and Linux. The script also makes a guess as to where to find a Perl executable that is sufficient for TTK (that is, it includes an XML parser). In general, it will simply use
perlas the Perl command unless it finds an ActivePerl distribution. This default can be overruled by either supplying arguments to the setup script or by editing the file
settings.txt. The two most likely ways to use the script are:
The first example sets up the classifier for linux, and sets the Perl path to
% python setup.py platform=linux
% python setup.py platform=osx perl=/usr/local/ActivePerl-5.8/bin/perl
perl(unless specified otherwise in
settings.txt). The second installs the classifier for osx and sets the Perl path to the given value. You could also use the perl switch to point to other non-standard perl locations. See the documentation in
setup.pyfor more details. A future version of TTK will have added Windows as a supported platform.
codedirectory of the distribution and type
The following input types are defined:
python tarsqi.py <input_type> [flags] <infile> <outfile>
python tarsqi.py <input_type> [flags] <indir> <outdir>
Flags are feature-value pairs where the feature and value are separated by an equals sign. The following flags are defined:
simple-xmlAn input type that should be used for default XML. It assumes that a document contains a tag named TEXT that wraps the data that need to be parsed and that the data have not yet been processed in any way. The value of the tag that wraps the data can be overridden using the content_tag flag (see below). This input type can be used for the files in
timebankThis input type is very similar to
simple-xml, the main difference is that it activates a component that parses document creation times for the various TimeBank formats. This input type can be used for the files in
code/data/in/TimeBank, which consists of all TimeBank files with all tags (except some document level tags) stripped out.
rte3Use this input type when processing pre-processed data from
code/data/in/RTE3, which contains data from the Third Pascal Textual Entailment Challenge.
extension=StringPuts a restriction on what files are processed, this is useful when processing an entire directory. The default is the empty string, which matches any extension.
trap_errors=(True|False)Determines whether errors inside of components are trapped. The default is that errors are not trapped.
content_tag=StringCan be used to overrule the default content tag of the input type.
pipeline=StringCan be used to overrule the default pipeline determined by the data source identifier. A pipeline is a comma-separated string of component names. Allowed component names are
PREPROCESSOR, GUTIME, EVITA, SLINKET, S2T, BLINKER, CLASSIFIERand
LINK_MERGER. The order of the components in the pipeline specification is significant. Some examples are:The first example instructs TTK to take a file, preprocess it and add time expressions and events. For the second example, preprocessing, times and events are taken for granted and only links are added.
% pythonw gui.py
% python gui.py
pythonotherwise. In both cases the wxPython package needs to be installed. The GUI has three advantages over using the command line version:
There is no separate manual for the GUI, but usage should be pretty straightforward. Functionality can be summed up as follows:
data/in/Userdirectory, which is then selected as the input file.
pydoccommand. Unfortunately, this command crashes on many of the toolkit modules. To create browsable documentation in
ttk-1.0/docs/codeyou can use the
This creates an
% cd ttk-1.0
% python make_documentation.py
index.htmlwith a list of links to all modules. For each module, and each class and function in that module, the documentation strings are printed. There are also links to the source code of each function.
The Tango annotation tool was developed by Linda van Guilder, Andrew
See, Bob Knippen and Alex Baron.
If you have problems installing the toolkit or if you want to report a
bug, please send an email to email@example.com. When reporting a
bug, please tell us what platform you're using (including Perl and
Python versions) and include a file that illustrates the errant
behavior. A database with known issues will be made available on the
Suggestions, criticisms, disappointments, feature requests and kudos
are also welcome at the above address.
9. Future Work
The next major revision of the TARSQI Toolkit will be numbered 1.1 and will
be released in early 2008. Minor revisions will be released with
version number 1.0.X and will concentrate on bug fixes (as well as on
providing a non-problematic Windows version).
The following major changes to the code base are now in progress or under consideration: