Quickstart

There are six main tasks supported by SEACoreNLP at the moment, namely:

  • Word Tokenization

  • Sentence Segmentation

  • Part-of-speech Tagging

  • Named Entity Recognition

  • Constituency Parsing

  • Dependency Parsing

SEACoreNLP provides classes that can perform each of these tasks.

Classes

Segmentation Tasks

For tokenization and sentence segmentation, we provide the classes Tokenizer and SentenceSplitter. As we do not provide natively trained segmenters for now, the only segmenters available are from third-party libraries.

In order to instantiate a segmenter, use the from_library method if you have a specific one in mind, or use the from_default method if you would like to use the default segmenter.

from seacorenlp.data.tokenizers import Tokenizer, SentenceSplitter

text = 'ผมอยากกินข้าว'

# Default Tokenizer
tokenizer = Tokenizer.from_default('th')
tokenizer.tokenize(text)
# Output: [ผม, อยาก, กิน, ข้าว]

# Specific Tokenizer
tokenizer = Tokenizer.from_library('pythainlp', engine='newmm')
tokenizer.tokenize(text)
# Output: [ผม, อยาก, กินข้าว]

longer_text = 'Tôi muốn ăn cơm. Chị muốn đi du lịch.'

# Default SentenceSplitter
splitter = SentenceSplitter.from_default('vi')
splitter.split_sentences(longer_text)
# Output: ['Tôi muốn ăn cơm.', 'Chị muốn đi du lịch.']

Tagging & Parsing Tasks

For tagging (POS, NER) and parsing (constituency, dependency) tasks, we provide natively trained models as well as third-party models which can be used by instantiating the relevant class with the from_pretrained and from_library methods respectively.

from seacorenlp.tagging import POSTagger

th_text = 'ผมอยากกินข้าว'

# Native Models
native_tagger = POSTagger.from_pretrained('pos-th-ud-xlmr')
native_tagger.predict(th_text)
# Output: [('ผม', 'PRON'), ('อยาก', 'VERB'), ('กิน', 'VERB'), ('ข้าว', 'NOUN')]

# External Models
# Include keyword arguments as necessary (see respective class documentation)
external_tagger = POSTagger.from_library('pythainlp', corpus='orchid')
external_tagger.predict(th_text)
# Output: [('ผม', 'PPRS'), ('อยาก', 'XVMM'), ('กิน', 'VACT'), ('ข้าว', 'NCMN')]

For the full list of models available, refer to Model Performance.

Here are some examples for each task:

from seacorenlp.tagging import POSTagger, NERTagger
from seacorenlp.parsing import ConstituencyParser, DependencyParser

# POS Tagging

pos_text = 'Tôi muốn ăn cơm.'
pos_tagger = POSTagger.from_pretrained('pos-vi-ud-xlmr')
pos_tagger.predict(pos_text)
# Output: [('Tôi', 'PROPN'), ('muốn', 'VERB'), ('ăn', 'VERB'), ('cơm', 'NOUN'), ('.', 'PUNCT')]


# NER

ner_text = 'Thủ tướng Trung Quốc Ôn Gia Bảo đã đến thăm Việt Nam vào năm 2004.'
ner_tagger = NERTagger.from_library('underthesea')
ner_tagger.predict(ner_text)
# Output:
# [('Thủ tướng', 'O'),
#  ('Trung Quốc', 'B-LOC'),
#  ('Ôn', 'B-PER'),
#  ('Gia Bảo', 'I-PER'),
#  ('đã', 'O'),
#  ('đến', 'O'),
#  ('thăm', 'O'),
#  ('Việt Nam', 'B-LOC'),
#  ('vào', 'O'),
#  ('năm', 'O'),
#  ('2004', 'O'),
#  ('.', 'O')]


# Constituency Parsing

const_text = 'Saya pergi ke sekolah'
const_parser = ConstituencyParser.from_pretrained('cp-id-kethu-benepar-xlmr-best')
trees = const_parser.predict(const_text)
print(trees[0])
# Output:
# (TOP
#  (S
#    (NP-SBJ (PRP Saya))
#    (VP (VB pergi) (PP (IN ke) (NP (NN sekolah))))))


# Dependency Parsing

dep_text = 'Saya pergi ke sekolah'
dep_parser = DependencyParser.from_pretrained('dp-id-ud-xlmr')
results = dep_parser.predict(dp_text)
print(results[0])
# Output: [('Saya', 2, 'nsubj'), ('pergi', 0, 'root'), ('ke', 4, 'case'), ('sekolah', 2, 'obl')]