Tempeval-2 Trial Data — Release Notes

In this document, we describe the English trial data. The format for all other languages is essentially the same, although small differences are possible. Currently available are the trial data for English and Italian

Data description

The trial data are given in a set of vertical files with tab-separated rows. One file, named "base-segmentation.tab", contains the tokenized text. The other files are (1) a file with a lexical category for each token, (2) files with extents for events and timex3 tags, (3) files with attributes of events and timexes, and (4) files with temporal relations. Below is a fragment of the file with the tokenized text (throughout, we use the fragment "... of an impudent American whom Sony hosted for a year while..." as an example).

wsj_0037	27	5	of
wsj_0037	27	6	an
wsj_0037	27	7	impudent
wsj_0037	27	8	American
wsj_0037	27	9	whom
wsj_0037	27	10	Sony
wsj_0037	27	11	hosted
wsj_0037	27	12	*T*-58
wsj_0037	27	13	for
wsj_0037	27	14	a
wsj_0037	27	15	year
wsj_0037	27	16	while

Each token is uniquely defined by a filename, a sentence offset and a token offset. Both sentence offset and token offsets start at 0. For example, in the fragment above, "impudent" is the eighth token in the 28th sentence of the file wsj_0037. The files in this sample are derived from the Penn treebank and contain the empty categories from the Treebank, as with "*T*-58" above.

Events and their attributes are stored in two files: "event-extents.tab" and "event-attributes.tab".

wsj_0037	27	11	event	e561	1
wsj_0037	27	11	event	e561	1	aspect	NONE
wsj_0037	27	11	event	e561	1	modality	
wsj_0037	27	11	event	e561	1	polarity	POS
wsj_0037	27	11	event	e561	1	tense	PAST

Again, the first three columns uniquely identify a token position in the source. The other columns of "event-extents.tab" contain the tag name (always "event" in this file), the tag id, and the instance id. The tag id is unique to the file, and usually, but not necessarily, unique to the corpus. The instance id is there to deal with cases like "She plays soccer on Monday and Wednesday", where there are two instances of the "play soccer" event. In the vast majority of cases, there is only one instance and the value in the last column is "1". The file "event-attributes.tab" contains two more columns: an attribute name and an attribute value.

Timexes and their attributes are stored in "timex-extents.tab" and "timex-attributes.tab".

wsj_0037	27	14	timex3	t2	1
wsj_0037	27	15	timex3	t2	1
wsj_0037	27	14	timex3	t2	1	type	DURATION
wsj_0037	27	14	timex3	t2	1	value	P1Y

These files use the same colomns as their eventive counter parts. But note that the attributes are associated with the first token of the timex extent. Also note that the instance number is always "1" for timexes.

The four classes of Tempeval temporal relations are all stored with the same format. The associations between task indentifer and file name are in the table below.

Task C tlinks-event-timex.tab
Task D tlinks-dct-events.tab
Task E tlinks-main-events.tab
Task F tlinks-subordinated-events.tab

The vertical files for the temporal relations all have four columns: (1) filename, (2) tag id of the first element in the relation, (3) tag id of the second element in the relation, and (4) the relation type. Here is a fragment of "tlinks-event-time.tab", with the temporal relation between "host" and "a year".

wsj_0037	e561	t2	overlap

For Task D ("tlnks-dct-events.tab"), the third column is always "t0", where "t0" is a special id that refers to the Document Creation Time. Note that the exact value of the DCT is not specified in the file with timexes since it is not nexessarily associated with an offset, instead, the DCT timex is given for each file in a separate file named "dct.tab".

Annotation process

Annotation of the trial data proceeded in several phases:

  1. Dual annotation of event extents
  2. Judgement of event extents
  3. Dual annotation of timex extents
  4. Judgement of timex extents
  5. Annotation of event and timex attributes
  6. Annotation of temporal links

The output of phases 2 and 4 was input to phases 5 and 6. Note that for the trial data, no dual annotation was attempted. This will be different for the training and evaluation data.

Annotators were given four documents to guide them in their tasks: TimeML overview, event annotation guidelines, timex annotation guidelines, and TLINK annotation guidelines. The annotators were also asked to refer to additional descriptions of the ISO format for timex values given in the TIDES Temporal Annotation Guidelines Version 1.0.2 document.

Trial data statistics

  English Italian
files 2 8
tokens 5111 3355
task A: timexes 55 57
task B: events 764 519
task C: event-timex links 52 48
task D: event-dct links 764 519
task E: links between main events 193 87
task F: links in subordination contexts 173 94

Training and evaluation data

The English training data will be different in at least the following respects:

  1. Size. The training corpus will be an order of magnitude larger.
  2. Document selection. The documents in the trial data are very large compared to most documents expected to be in the training data. In addition, the first document in the trial set is a book description, these kinds of documents will be excluded from the training data.
  3. Annotation quality. The trial data have not been checked rigurously and annotation guidelines were written while the annotation was in progress. For the training data, we use fixed guidelines and dual annotation.
  4. Temporal relations. The set of temporal relations may be changed for some tasks.
  5. Eliminating little glitches. For example, not all events and timexes have attributes due to a hickup in the annotation procedure. This will be fixed for the training data.

The evaluation data will be held out from the entire corpus, but otherwise be identical in format to the training data. However, depending on the task some data will be left out:

Tasks A and B Only the file "base-segmentation.tab" will be provided. The task would be to create the files "timex-extents.tab", "timex-attributes.tab", "event-extents.tab", and "event-attributes.tab".
Tasks C through F All files will be provided, but the relation types in the files with temporal links will be set to "NONE". The task here is to replace "NONE" with one of the acceptable relation types.

Scorer scripts will be made available with the training data.