Tempeval-2 Training Data — Release Notes
Version 1 - Marc Verhagen - March 11th, 2010
In this document, we describe the Tempeval training data. The description focuses on the English data, but the format for all other languages, with the exception of the French data, is essentially the same. Available in this release is the first batch of training data for Chinese, English, French, Italian and Spanish.
The unavoidable updates to this document will be published on the Tempeval2 pages at http://www.timeml.org/tempeval2/, which also contains a link to the Tempeval Google group.
Data description
Training data are given for five languages. Not all language provide data for all tasks, below is a list with the contents of this distribution, the letters refer to the tasks in the Tempeval proposal.
| language | tokens | A | B | C | D | E | F |
| Chinese | 23,000 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| English | 17,000 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Italian | 26,000 | ✓ | ✓ | ||||
| French | 10,000 | ✓ | ✓ | ||||
| Spanish | 47,000 | ✓ | ✓ | ✓ | ✓ |
In addition, for French there is a small set of 3000 tokens with event, timex and temporal link annotation, but the temporal links are not split into sub tasks.
This is the first of two training data releases, the next release will be March 21st and will, as far as we can see now, contain the following extras:
| Chinese | no more training data expected |
| English | data for all six tasks for an additional 35,000 tokens |
| French | increases in the set of 3000 tokens with event, timex and temporal link annotation |
| Italian | temporal relations from tasks C, D and E added to the current set of 26,000 tokens |
| Korean | 10,000 tokens with events and times |
| Spanish | temporal relations from tasks E and F added to the current data set |
The data are given in a set of vertical files with tab-separated rows. One file, named "base-segmentation.tab", contains the tokenized text. The other files are (1) an optional file with a lexical category for each token, (2) files with extents for events and timex3 tags, (3) files with attributes of events and timexes, and (4) files with temporal relations. Below is a fragment of the file with the tokenized text (throughout, we use the fragment "... of an impudent American whom Sony hosted for a year while..." as an example).
wsj_0037 27 5 of wsj_0037 27 6 an wsj_0037 27 7 impudent wsj_0037 27 8 American wsj_0037 27 9 whom wsj_0037 27 10 Sony wsj_0037 27 11 hosted wsj_0037 27 12 *T*-58 wsj_0037 27 13 for wsj_0037 27 14 a wsj_0037 27 15 year wsj_0037 27 16 while
Each token is uniquely defined by a file name, a sentence offset and a token offset. Both sentence offset and token offsets start at 0. For example, in the fragment above, "impudent" is the eighth token in the 28th sentence of the file wsj_0037. The files in this example were derived from the Penn treebank and contain the empty categories from the Treebank, as with "*T*-58" above. Some languages provide a file with lexical categories for each token, the format is the same as the file with the tokens.
Events and their attributes are stored in two files: "event-extents.tab" and "event-attributes.tab".
wsj_0037 27 11 event e561 1
wsj_0037 27 11 event e561 1 aspect NONE wsj_0037 27 11 event e561 1 modality wsj_0037 27 11 event e561 1 polarity POS wsj_0037 27 11 event e561 1 tense PAST
Again, the first three columns uniquely identify a token position in the source. The other columns of "event-extents.tab" contain the tag name (always "event" in this file), the tag id, and the instance id. The tag id is unique to the file, and usually, but not necessarily, unique to the corpus. The instance id is there to deal with cases like "She plays soccer on Monday and Wednesday", where there are two instances of the "play soccer" event. In the vast majority of cases, there is only one instance and the value in the last column is "1". The file "event-attributes.tab" contains two more columns: an attribute name and an attribute value.
Timexes and their attributes are stored in "timex-extents.tab" and "timex-attributes.tab".
wsj_0037 27 14 timex3 t2 1 wsj_0037 27 15 timex3 t2 1
wsj_0037 27 14 timex3 t2 1 type DURATION wsj_0037 27 14 timex3 t2 1 value P1Y
These files use the same columns as their eventive counter parts. But note that the attributes are associated with the first token of the timex extent. Also note that the instance number is always "1" for timexes.
The four classes of Tempeval temporal relations are all stored with the same format. The associations between task identifier and file name are in the table below.
| Task C | tlinks-event-timex.tab |
| Task D | tlinks-dct-events.tab |
| Task E | tlinks-main-events.tab |
| Task F | tlinks-subordinated-events.tab |
The vertical files for the temporal relations all have four columns: (1) file name, (2) tag id of the first element in the relation, (3) tag id of the second element in the relation, and (4) the relation type. Here is a fragment of "tlinks-event-time.tab", with the temporal relation between "host" and "a year".
wsj_0037 e561 t2 overlap
For Task D ("tlinks-dct-events.tab"), the third column is always "t0", where "t0" is a special id that refers to the Document Creation Time. Note that the exact value of the DCT is not specified in the file with timexes since it is not necessarily associated with an offset, instead, the DCT timex is given for each file in a separate file named "dct.tab".
Annotation process
Annotation of the trial data proceeded in several phases:
- Annotation of event and timex extents
- Annotation of event and timex attributes
- Annotation of temporal links
The output of phase 1 was input to phases 2 and 3. For each phase, dual annotation followed by an adjudication phase was used. Annotators were given several documents to guide them in their tasks. What documents were used is language specific, see the docs directory for each language for guidelines and other documentation. Some documentation is still forthcoming.
Browsing the data
The tabular data format is rather uninviting for manual inspection, therefore, we created HTML dumps for all tables. Links are available in the list below.
There is no browser for French because the French data are in XML format and can be viewed as is. Note that the Italian data give a glimpse into some of the temporal relations even though those relations are not part of this official release. Consider this a sneak preview on the final data delivery.
Evaluation data
The evaluation data will be held out from the entire corpus, but otherwise be identical in format to the training data. However, depending on the task some data will be left out:
| Tasks A and B | Only the file "base-segmentation.tab" will be provided. The task would be to create the files "timex-extents.tab", "timex-attributes.tab", "event-extents.tab", and "event-attributes.tab". |
| Tasks C through F | All files will be provided, but the relation types in the files with temporal links will be set to "NONE". The task here is to replace "NONE" with one of the acceptable relation types. |
The evaluation data will be released on March 28th. Scorer scripts will be made available on the Tempeval website before March 21st.
Language-specific remarks
The French data are not provided as vertical files, but as XML files. Likewise, the evaluation data for French will be in XML format.
The Document Creation Time is given in listings in the data directory for three languages: Chinese, English and Spanish. For French the DCT is given as a timex inside the XML document where the attribute functionInDocument has value PUBLICATION_TIME, in most cases, the DCT is also reflected in the file name. For Italian, the DCT is typically given as the first timex in each document, marked with t1. There are a couple of exceptions though:
- cs.morph015 and cs.morph016 have as DCT 1995-08-07T18:00; signaled by t0
- els.morph041 has as DCT 1986-09-30T18:00; signaled by t0
- els.morph042 has as DCT 1986-09-29T18:00; signaled by t0
Contacts
For general questions on Tempeval and questions on this document, email marc@cs.brandeis.edu. For questions on the data sets, send an email to the Google group. You can also email one of the people below, depending on what language your question is on:
| Chinese | Nianwen Xue | xuen@brandeis.edu |
| English | Marc Verhagen | marc@cs.brandeis.edu |
| French | Andre Bittar | andre.bittar@linguist.jussieu.fr |
| Italian | Tommaso Caselli | t.caselli@gmail.com |
| Korean | Seohyun Im | ish97@cs.brandeis.edu |
| Spanish | Roser Saurí | roser.sauri@barcelonamedia.org |