Tempeval-2 Training Data — Release Notes

Version 1 - Marc Verhagen - March 11th, 2010

In this document, we describe the Tempeval training data. The description focuses on the English data, but the format for all other languages, with the exception of the French data, is essentially the same. Available in this release is the first batch of training data for Chinese, English, French, Italian and Spanish.

The unavoidable updates to this document will be published on the Tempeval2 pages at http://www.timeml.org/tempeval2/, which also contains a link to the Tempeval Google group.

Data description

Training data are given for five languages. Not all language provide data for all tasks, below is a list with the contents of this distribution, the letters refer to the tasks in the Tempeval proposal.

language tokens A B C D E F
Chinese 23,000
English 17,000
Italian 26,000        
French 10,000        
Spanish 47,000    

In addition, for French there is a small set of 3000 tokens with event, timex and temporal link annotation, but the temporal links are not split into sub tasks.

This is the first of two training data releases, the next release will be March 21st and will, as far as we can see now, contain the following extras:

Chinese no more training data expected
English data for all six tasks for an additional 35,000 tokens
French increases in the set of 3000 tokens with event, timex and temporal link annotation
Italian temporal relations from tasks C, D and E added to the current set of 26,000 tokens
Korean 10,000 tokens with events and times
Spanish temporal relations from tasks E and F added to the current data set

The data are given in a set of vertical files with tab-separated rows. One file, named "base-segmentation.tab", contains the tokenized text. The other files are (1) an optional file with a lexical category for each token, (2) files with extents for events and timex3 tags, (3) files with attributes of events and timexes, and (4) files with temporal relations. Below is a fragment of the file with the tokenized text (throughout, we use the fragment "... of an impudent American whom Sony hosted for a year while..." as an example).

wsj_0037	27	5	of
wsj_0037	27	6	an
wsj_0037	27	7	impudent
wsj_0037	27	8	American
wsj_0037	27	9	whom
wsj_0037	27	10	Sony
wsj_0037	27	11	hosted
wsj_0037	27	12	*T*-58
wsj_0037	27	13	for
wsj_0037	27	14	a
wsj_0037	27	15	year
wsj_0037	27	16	while

Each token is uniquely defined by a file name, a sentence offset and a token offset. Both sentence offset and token offsets start at 0. For example, in the fragment above, "impudent" is the eighth token in the 28th sentence of the file wsj_0037. The files in this example were derived from the Penn treebank and contain the empty categories from the Treebank, as with "*T*-58" above. Some languages provide a file with lexical categories for each token, the format is the same as the file with the tokens.

Events and their attributes are stored in two files: "event-extents.tab" and "event-attributes.tab".

wsj_0037	27	11	event	e561	1
wsj_0037	27	11	event	e561	1	aspect	NONE
wsj_0037	27	11	event	e561	1	modality	
wsj_0037	27	11	event	e561	1	polarity	POS
wsj_0037	27	11	event	e561	1	tense	PAST

Again, the first three columns uniquely identify a token position in the source. The other columns of "event-extents.tab" contain the tag name (always "event" in this file), the tag id, and the instance id. The tag id is unique to the file, and usually, but not necessarily, unique to the corpus. The instance id is there to deal with cases like "She plays soccer on Monday and Wednesday", where there are two instances of the "play soccer" event. In the vast majority of cases, there is only one instance and the value in the last column is "1". The file "event-attributes.tab" contains two more columns: an attribute name and an attribute value.

Timexes and their attributes are stored in "timex-extents.tab" and "timex-attributes.tab".

wsj_0037	27	14	timex3	t2	1
wsj_0037	27	15	timex3	t2	1
wsj_0037	27	14	timex3	t2	1	type	DURATION
wsj_0037	27	14	timex3	t2	1	value	P1Y

These files use the same columns as their eventive counter parts. But note that the attributes are associated with the first token of the timex extent. Also note that the instance number is always "1" for timexes.

The four classes of Tempeval temporal relations are all stored with the same format. The associations between task identifier and file name are in the table below.

Task C tlinks-event-timex.tab
Task D tlinks-dct-events.tab
Task E tlinks-main-events.tab
Task F tlinks-subordinated-events.tab

The vertical files for the temporal relations all have four columns: (1) file name, (2) tag id of the first element in the relation, (3) tag id of the second element in the relation, and (4) the relation type. Here is a fragment of "tlinks-event-time.tab", with the temporal relation between "host" and "a year".

wsj_0037	e561	t2	overlap

For Task D ("tlinks-dct-events.tab"), the third column is always "t0", where "t0" is a special id that refers to the Document Creation Time. Note that the exact value of the DCT is not specified in the file with timexes since it is not necessarily associated with an offset, instead, the DCT timex is given for each file in a separate file named "dct.tab".

Annotation process

Annotation of the trial data proceeded in several phases:

  1. Annotation of event and timex extents
  2. Annotation of event and timex attributes
  3. Annotation of temporal links

The output of phase 1 was input to phases 2 and 3. For each phase, dual annotation followed by an adjudication phase was used. Annotators were given several documents to guide them in their tasks. What documents were used is language specific, see the docs directory for each language for guidelines and other documentation. Some documentation is still forthcoming.

Browsing the data

The tabular data format is rather uninviting for manual inspection, therefore, we created HTML dumps for all tables. Links are available in the list below.

There is no browser for French because the French data are in XML format and can be viewed as is. Note that the Italian data give a glimpse into some of the temporal relations even though those relations are not part of this official release. Consider this a sneak preview on the final data delivery.

Evaluation data

The evaluation data will be held out from the entire corpus, but otherwise be identical in format to the training data. However, depending on the task some data will be left out:

Tasks A and B Only the file "base-segmentation.tab" will be provided. The task would be to create the files "timex-extents.tab", "timex-attributes.tab", "event-extents.tab", and "event-attributes.tab".
Tasks C through F All files will be provided, but the relation types in the files with temporal links will be set to "NONE". The task here is to replace "NONE" with one of the acceptable relation types.

The evaluation data will be released on March 28th. Scorer scripts will be made available on the Tempeval website before March 21st.

Language-specific remarks

The French data are not provided as vertical files, but as XML files. Likewise, the evaluation data for French will be in XML format.

The Document Creation Time is given in listings in the data directory for three languages: Chinese, English and Spanish. For French the DCT is given as a timex inside the XML document where the attribute functionInDocument has value PUBLICATION_TIME, in most cases, the DCT is also reflected in the file name. For Italian, the DCT is typically given as the first timex in each document, marked with t1. There are a couple of exceptions though:

Contacts

For general questions on Tempeval and questions on this document, email marc@cs.brandeis.edu. For questions on the data sets, send an email to the Google group. You can also email one of the people below, depending on what language your question is on:

Chinese Nianwen Xue xuen@brandeis.edu
English Marc Verhagen marc@cs.brandeis.edu
French Andre Bittar andre.bittar@linguist.jussieu.fr
Italian Tommaso Caselli t.caselli@gmail.com
Korean Seohyun Im ish97@cs.brandeis.edu
Spanish Roser Saurí roser.sauri@barcelonamedia.org