Quickstart¶
There are six main tasks supported by SEACoreNLP at the moment, namely:
Word Tokenization
Sentence Segmentation
Part-of-speech Tagging
Named Entity Recognition
Constituency Parsing
Dependency Parsing
SEACoreNLP provides classes that can perform each of these tasks.
Classes¶
Segmentation Tasks¶
For tokenization and sentence segmentation, we provide the classes Tokenizer
and SentenceSplitter
.
As we do not provide natively trained segmenters for now, the only segmenters available are from third-party
libraries.
In order to instantiate a segmenter, use the from_library
method if you have a specific one in mind,
or use the from_default
method if you would like to use the default segmenter.
from seacorenlp.data.tokenizers import Tokenizer, SentenceSplitter
text = 'ผมอยากกินข้าว'
# Default Tokenizer
tokenizer = Tokenizer.from_default('th')
tokenizer.tokenize(text)
# Output: [ผม, อยาก, กิน, ข้าว]
# Specific Tokenizer
tokenizer = Tokenizer.from_library('pythainlp', engine='newmm')
tokenizer.tokenize(text)
# Output: [ผม, อยาก, กินข้าว]
longer_text = 'Tôi muốn ăn cơm. Chị muốn đi du lịch.'
# Default SentenceSplitter
splitter = SentenceSplitter.from_default('vi')
splitter.split_sentences(longer_text)
# Output: ['Tôi muốn ăn cơm.', 'Chị muốn đi du lịch.']
Tagging & Parsing Tasks¶
For tagging (POS, NER) and parsing (constituency, dependency) tasks, we provide natively trained
models as well as third-party models which can be used by instantiating the relevant class with
the from_pretrained
and from_library
methods respectively.
from seacorenlp.tagging import POSTagger
th_text = 'ผมอยากกินข้าว'
# Native Models
native_tagger = POSTagger.from_pretrained('pos-th-ud-xlmr')
native_tagger.predict(th_text)
# Output: [('ผม', 'PRON'), ('อยาก', 'VERB'), ('กิน', 'VERB'), ('ข้าว', 'NOUN')]
# External Models
# Include keyword arguments as necessary (see respective class documentation)
external_tagger = POSTagger.from_library('pythainlp', corpus='orchid')
external_tagger.predict(th_text)
# Output: [('ผม', 'PPRS'), ('อยาก', 'XVMM'), ('กิน', 'VACT'), ('ข้าว', 'NCMN')]
For the full list of models available, refer to Model Performance.
Here are some examples for each task:
from seacorenlp.tagging import POSTagger, NERTagger
from seacorenlp.parsing import ConstituencyParser, DependencyParser
# POS Tagging
pos_text = 'Tôi muốn ăn cơm.'
pos_tagger = POSTagger.from_pretrained('pos-vi-ud-xlmr')
pos_tagger.predict(pos_text)
# Output: [('Tôi', 'PROPN'), ('muốn', 'VERB'), ('ăn', 'VERB'), ('cơm', 'NOUN'), ('.', 'PUNCT')]
# NER
ner_text = 'Thủ tướng Trung Quốc Ôn Gia Bảo đã đến thăm Việt Nam vào năm 2004.'
ner_tagger = NERTagger.from_library('underthesea')
ner_tagger.predict(ner_text)
# Output:
# [('Thủ tướng', 'O'),
# ('Trung Quốc', 'B-LOC'),
# ('Ôn', 'B-PER'),
# ('Gia Bảo', 'I-PER'),
# ('đã', 'O'),
# ('đến', 'O'),
# ('thăm', 'O'),
# ('Việt Nam', 'B-LOC'),
# ('vào', 'O'),
# ('năm', 'O'),
# ('2004', 'O'),
# ('.', 'O')]
# Constituency Parsing
const_text = 'Saya pergi ke sekolah'
const_parser = ConstituencyParser.from_pretrained('cp-id-kethu-benepar-xlmr-best')
trees = const_parser.predict(const_text)
print(trees[0])
# Output:
# (TOP
# (S
# (NP-SBJ (PRP Saya))
# (VP (VB pergi) (PP (IN ke) (NP (NN sekolah))))))
# Dependency Parsing
dep_text = 'Saya pergi ke sekolah'
dep_parser = DependencyParser.from_pretrained('dp-id-ud-xlmr')
results = dep_parser.predict(dp_text)
print(results[0])
# Output: [('Saya', 2, 'nsubj'), ('pergi', 0, 'root'), ('ke', 4, 'case'), ('sekolah', 2, 'obl')]