Corpora

We provide an integrated tool to obtain and preprocess 35 NER corpora. The script contains the following steps:

Download the corpora
Conversion of the corpora
Sentence splitting, tokenization and POS tagging (if not given)
Deterministic split into training, development and test set with ratio 60:10:30

The tool and further documentation can be found at this git repository. For further details on the corpora and the preprocessing, please consult the accompanying publication.

If you use the tool in your research, please cite us.

HUNER

An of the shelf tagger for biomedical entities

Corpora