Tempeval-2 Trial Data — Release Notes
In this document, we describe the English trial data. The format for all other languages is essentially the same, although small differences are possible. Currently available are the trial data for English and Italian
Data description
The trial data are given in a set of vertical files with tab-separated rows. One file, named "base-segmentation.tab", contains the tokenized text. The other files are (1) a file with a lexical category for each token, (2) files with extents for events and timex3 tags, (3) files with attributes of events and timexes, and (4) files with temporal relations. Below is a fragment of the file with the tokenized text (throughout, we use the fragment "... of an impudent American whom Sony hosted for a year while..." as an example).
wsj_0037 27 5 of wsj_0037 27 6 an wsj_0037 27 7 impudent wsj_0037 27 8 American wsj_0037 27 9 whom wsj_0037 27 10 Sony wsj_0037 27 11 hosted wsj_0037 27 12 *T*-58 wsj_0037 27 13 for wsj_0037 27 14 a wsj_0037 27 15 year wsj_0037 27 16 while
Each token is uniquely defined by a filename, a sentence offset and a token offset. Both sentence offset and token offsets start at 0. For example, in the fragment above, "impudent" is the eighth token in the 28th sentence of the file wsj_0037. The files in this sample are derived from the Penn treebank and contain the empty categories from the Treebank, as with "*T*-58" above.
Events and their attributes are stored in two files: "event-extents.tab" and "event-attributes.tab".
wsj_0037 27 11 event e561 1
wsj_0037 27 11 event e561 1 aspect NONE wsj_0037 27 11 event e561 1 modality wsj_0037 27 11 event e561 1 polarity POS wsj_0037 27 11 event e561 1 tense PAST
Again, the first three columns uniquely identify a token position in the source. The other columns of "event-extents.tab" contain the tag name (always "event" in this file), the tag id, and the instance id. The tag id is unique to the file, and usually, but not necessarily, unique to the corpus. The instance id is there to deal with cases like "She plays soccer on Monday and Wednesday", where there are two instances of the "play soccer" event. In the vast majority of cases, there is only one instance and the value in the last column is "1". The file "event-attributes.tab" contains two more columns: an attribute name and an attribute value.
Timexes and their attributes are stored in "timex-extents.tab" and "timex-attributes.tab".
wsj_0037 27 14 timex3 t2 1 wsj_0037 27 15 timex3 t2 1
wsj_0037 27 14 timex3 t2 1 type DURATION wsj_0037 27 14 timex3 t2 1 value P1Y
These files use the same colomns as their eventive counter parts. But note that the attributes are associated with the first token of the timex extent. Also note that the instance number is always "1" for timexes.
The four classes of Tempeval temporal relations are all stored with the same format. The associations between task indentifer and file name are in the table below.
| Task C | tlinks-event-timex.tab |
| Task D | tlinks-dct-events.tab |
| Task E | tlinks-main-events.tab |
| Task F | tlinks-subordinated-events.tab |
The vertical files for the temporal relations all have four columns: (1) filename, (2) tag id of the first element in the relation, (3) tag id of the second element in the relation, and (4) the relation type. Here is a fragment of "tlinks-event-time.tab", with the temporal relation between "host" and "a year".
wsj_0037 e561 t2 overlap
For Task D ("tlnks-dct-events.tab"), the third column is always "t0", where "t0" is a special id that refers to the Document Creation Time. Note that the exact value of the DCT is not specified in the file with timexes since it is not nexessarily associated with an offset, instead, the DCT timex is given for each file in a separate file named "dct.tab".
Annotation process
Annotation of the trial data proceeded in several phases:
- Dual annotation of event extents
- Judgement of event extents
- Dual annotation of timex extents
- Judgement of timex extents
- Annotation of event and timex attributes
- Annotation of temporal links
The output of phases 2 and 4 was input to phases 5 and 6. Note that for the trial data, no dual annotation was attempted. This will be different for the training and evaluation data.
Annotators were given four documents to guide them in their tasks: TimeML overview, event annotation guidelines, timex annotation guidelines, and TLINK annotation guidelines. The annotators were also asked to refer to additional descriptions of the ISO format for timex values given in the TIDES Temporal Annotation Guidelines Version 1.0.2 document.
Trial data statistics
| English | Italian | |
| files | 2 | 8 |
| tokens | 5111 | 3355 |
| task A: timexes | 55 | 57 |
| task B: events | 764 | 519 |
| task C: event-timex links | 52 | 48 |
| task D: event-dct links | 764 | 519 |
| task E: links between main events | 193 | 87 |
| task F: links in subordination contexts | 173 | 94 |
Training and evaluation data
The English training data will be different in at least the following respects:
- Size. The training corpus will be an order of magnitude larger.
- Document selection. The documents in the trial data are very large compared to most documents expected to be in the training data. In addition, the first document in the trial set is a book description, these kinds of documents will be excluded from the training data.
- Annotation quality. The trial data have not been checked rigurously and annotation guidelines were written while the annotation was in progress. For the training data, we use fixed guidelines and dual annotation.
- Temporal relations. The set of temporal relations may be changed for some tasks.
- Eliminating little glitches. For example, not all events and timexes have attributes due to a hickup in the annotation procedure. This will be fixed for the training data.
The evaluation data will be held out from the entire corpus, but otherwise be identical in format to the training data. However, depending on the task some data will be left out:
| Tasks A and B | Only the file "base-segmentation.tab" will be provided. The task would be to create the files "timex-extents.tab", "timex-attributes.tab", "event-extents.tab", and "event-attributes.tab". |
| Tasks C through F | All files will be provided, but the relation types in the files with temporal links will be set to "NONE". The task here is to replace "NONE" with one of the acceptable relation types. |
Scorer scripts will be made available with the training data.