Easy with Brown Corpus, which comes with NLTK's nltk_data. To create data for part-of-speech tagging, just call tagged_sents (). If you specify categories, you can handle only the data of that domain (in addition to news, there are various reviews, fiction, romance, mystery, etc.).
import nltk
from nltk.corpus import brown
corpus = brown.tagged_sents(categories='news')
def dataset(N=100):
d = []
for tagged_sent in corpus[:N]:
untagged_sent = nltk.tag.untag(tagged_sent)
pos_sequence = [pos for (word, pos) in tagged_sent]
d.append((untagged_sent, pos_sequence))
return d
if __name__ == "__main__":
dataset = dataset()
Recommended Posts