EVALITA 2009 - Italian Parsing Task

Dependency Parsing Track: Pilot Subtask


Dipartimento di Informatica, Università di Pisa

Dipartimento di Linguistica, Università di Pisa

Istituto di Linguistica Computazionale (ILC) - CNR

Corpus summary

The pilot dependency subtask (pilotDepPar) uses as the development set the TANL dependency annotated corpus jointly developed by the Istituto di Linguistica Computazionale (ILC-CNR) and the University of Pisa in the framework of the project "Analisi di Testi per il Semantic Web e il Question Answering" [1]. The TANL dependency annotated corpus originates as a revision of the ISST-CoNLL corpus [2] used in the multilingual track of the CONLL-2007 shared task, which was built in its turn starting from the Italian Syntactic-Semantic Treebank [3], in particular, the morpho-syntactic and syntactic dependency annotation levels.

Evaluation

The evaluation will be based on three data sets:
  1. Training Corpus (TrainSet-pilotDepPar): containing data annotated using the Tanl tagset to be used for training of the pilot subtask participating systems
  2. Development Corpus (DevSet-pilotDepPar): a smaller corpus to be used for development
  3. Test Set (TestSet-pilotDepPar): containing blind test data for the evaluation (available from September the 10th 2009)

Corpora statistics

Training corpus

#sentences 2,868
#tokens 66,528

Development corpus

#sentences 241
#tokens 4,745

Copyright and license

The TANL Dependency annotated corpus is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Italy License. Participants will be requested to agree on the conditions upon downloading the resource.

Resource download

TANL Dependency Annotated Corpus (Training and Development corpora): TANL_DEP.tgz

Test set download

TANL Dependency Annotated Corpus (Test set): isst_test.evalita.gz
TANL Dependency Annotated Corpus (Gold set): isst_gold.evalita.gz   (New)

Acknowledgements

Giuseppe Attardi, Maria Simi, Eva Maria Vecchi, Simone Marchi, Antonio Fuschetto, Francesco Tamberi

References

[1] G. Attardi et al. 2008. Tanl (Text Analytics and Natural Language processing). Project Analisi di Testi per il Semantic Web e il Question Answering, http://medialab.di.unipi.it/wiki/SemaWiki.
[2] S. Montemagni, M. Simi 2007. The Italian dependency annotated corpus developed for the CoNLL-2007 Shared Task. ILC Technical Report, January 2007, available at http://www.ilc.cnr.it/tressi_prg/ISST@CoNNL2007/ISST/ISST@CoNNL2007.pdf
[3] S. Montemagni et al. 2003. Building the Italian Syntactic-Semantic Treebank. In Abeillé (ed.), Building and using Parsed Corpora, Language and Speech series, Kluwer, Dordrecht, 189–210

Contacts

Felice Dell'Orletta
Dipartimento di Informatica,
Università di Pisa
e-mail felice.dellorletta@ilc.cnr.it
tel. +39 050 3152847
Alessandro Lenci
Dipartimento di Linguistica,
Università di Pisa
e-mail alessandro.lenci@ilc.cnr.it
tel. +39 050 2215638
Simonetta Montemagni
Istituto di Linguistica Computazionale
Area della ricerca di Pisa - CNR

e-mail simonetta.montemagni@ilc.cnr.it
tel. +39 050 3152850