Corpora

We provide an integrated tool to obtain and preprocess 35 NER corpora. The script contains the following steps:
  • Download the corpora
  • Conversion of the corpora
  • Sentence splitting, tokenization and POS tagging (if not given)
  • Deterministic split into training, development and test set with ratio 60:10:30
The tool and further documentation can be found at this git repository. For further details on the corpora and the preprocessing, please consult the accompanying publication.

If you use the tool in your research, please cite us.