EVALITA 2009 - Italian Parsing Task
Dependency Parsing Track: Pilot
Subtask
Dipartimento di Informatica, Università di Pisa
Dipartimento di Linguistica, Università di Pisa
Istituto di Linguistica Computazionale (ILC) - CNR
Corpus summary
The pilot dependency subtask (pilotDepPar) uses as the development set
the TANL dependency annotated corpus jointly developed by the Istituto
di Linguistica Computazionale (ILC-CNR) and the University of Pisa in
the framework of the project "Analisi di Testi per il Semantic Web e il
Question Answering" [1]. The TANL dependency annotated corpus
originates as a revision of the ISST-CoNLL corpus [2] used in the
multilingual track of the CONLL-2007 shared task, which was built in
its turn starting from the Italian Syntactic-Semantic Treebank [3], in
particular, the morpho-syntactic and syntactic dependency annotation
levels.
Evaluation
The evaluation will be based on three data sets:
- Training Corpus
(TrainSet-pilotDepPar): containing data annotated
using the Tanl tagset to be used for training of the pilot subtask
participating systems
- Development Corpus
(DevSet-pilotDepPar): a smaller corpus to be
used for development
- Test Set
(TestSet-pilotDepPar): containing blind test data for
the evaluation (available from September the 10th 2009)
Corpora statistics
Training corpus
| #sentences |
2,868 |
| #tokens |
66,528 |
Development corpus
| #sentences |
241 |
| #tokens |
4,745 |
Copyright and license
The TANL Dependency annotated corpus is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.
Participants will be requested to agree on the conditions upon
downloading the resource.
Resource download
TANL Dependency Annotated Corpus (Training and Development corpora):
TANL_DEP.tgz
Test set download
TANL Dependency Annotated Corpus (Test set):
isst_test.evalita.gz
TANL Dependency Annotated Corpus (Gold set):
isst_gold.evalita.gz  
Acknowledgements
Giuseppe Attardi, Maria Simi, Eva Maria Vecchi, Simone Marchi, Antonio
Fuschetto, Francesco Tamberi
References
[1] G. Attardi et al. 2008. Tanl (Text Analytics and Natural Language
processing). Project Analisi di Testi per il Semantic Web e il Question
Answering, http://medialab.di.unipi.it/wiki/SemaWiki.
[2] S. Montemagni, M. Simi 2007. The Italian dependency annotated
corpus
developed for the CoNLL-2007 Shared Task. ILC Technical Report, January
2007, available at
http://www.ilc.cnr.it/tressi_prg/ISST@CoNNL2007/ISST/ISST@CoNNL2007.pdf
[3] S. Montemagni et al. 2003. Building the Italian Syntactic-Semantic
Treebank. In Abeillé (ed.), Building and using Parsed Corpora, Language
and Speech series, Kluwer, Dordrecht, 189–210
Contacts