Part-of-speech Tagging Module¶

Part-of-speech (POS) Tagging is the task of assigning a POS to each token in a sentence.

seacorenlp.tagging.pos¶

Module for Part-of-speech Tagging

POSTagger()

Base class to instantiate specific POSTagger (AllenNLP Predictor)

POS Taggers¶

Model Performance¶

Language	Package	Model Name	Architecture	Size	Dataset	Accuracy (%)	F1 (%)
Indonesian	SEACoreNLP	pos-id-ud-xlmr-best	XLM-R (Base) + FFNN	774.4MB	UD-ID-GSD	93.90
		pos-id-ud-xlmr	XLM-R (Base) + FFNN	47KB (FFNN)	UD-ID-GSD	92.44
		pos-id-ud-indobert	IndoBERT (Base) + FFNN	462.1MB	UD-ID-GSD	91.54
		pos-id-ud-bilstm	Embeddings (200) + Bi-LSTM	16.3MB	UD-ID-GSD	90.19
	Trankit*	XLM-R Base	Embeddings + Adapters + FFNN	?	UD-ID-GSD		93.57
	Stanza		word2vec/fastText + Bi-LSTM	17.3MB	UD-ID-GSD		93.40
	Malaya	XLNET	Transformer Embedding + CRF	446.6MB	UD-ID-GSD		93.24
		BERT	Transformer Embedding + CRF	426.4MB	UD-ID-GSD		93.18
		ALXLNET	Transformer Embedding + CRF	46.8MB	UD-ID-GSD		92.82
		Tiny-BERT	Transformer Embedding + CRF	57.7MB	UD-ID-GSD		92.70
		ALBERT	Transformer Embedding + CRF	48.7MB	UD-ID-GSD		92.55
		Tiny-ALBERT	Transformer Embedding + CRF	22.4MB	UD-ID-GSD		90.00
Thai	SEACoreNLP	pos-th-ud-xlmr-best	XLM-R (Base) + FFNN	755.8MB	UD-TH-PUD	97.20
		pos-th-ud-xlmr	XLM-R (Base) + FFNN	44KB (FFNN)	UD-TH-PUD	92.89
		pos-th-ud-bilstmcrf	Embeddings (100) + Bi-LSTM + CRF	2.1MB	UD-TH-PUD	89.10
		pos-th-ud-bilstm	Embeddings (100) + Bi-LSTM	2.1MB	UD-TH-PUD	88.48
	PyThaiNLP	Averaged Perceptron		?	UD-TH-PUD	99.09
		Unigram		?	UD-TH-PUD	93.18
		RDRPOSTagger	RDR (Rule-based)	?	UD-TH-PUD	93.18
Vietnamese	SEACoreNLP	pos-vi-ud-xlmr-best	XLM-R (Base) + FFNN	755.1MB	UD-VI-VTB	93.07
		pos-vi-ud-xlmr	XLM-R (Base) + FFNN	41KB (FFNN)	UD-VI-VTB	91.90
		pos-vi-ud-phobert	PhoBERT (Base) + FFNN	438MB	UD-VI-VTB	92.92
		pos-vi-ud-bilstm	Embeddings (256) + Bi-LSTM	8.4MB	UD-VI-VTB	85.21
	Trankit*	XLM-R Base	Embeddings + Adapters + FFNN	?	UD-VI-VTB		89.70
	Stanza		word2vec/fastText + Bi-LSTM	18.1MB	UD-VI-VTB		79.50
	UnderTheSea		CRF	2.68MB	?	?	?
	VnCoreNLP*	MarMoT	CRF	28.3MB	VLSP 2013	95.88

class seacorenlp.tagging.pos.POSTagger¶

Base class to instantiate specific POSTagger (AllenNLP Predictor)

Options for model_name:

E.g. pos-th-ud-xlmr
Refer to table containing POS Tagger performance for full list

Options for library_name:

malaya (For Indonesian/Malay)
stanza (For Indonesian and Vietnamese)
pythainlp (For Thai)
underthesea (For Vietnamese)

Defaults available for the following languages:

id: Indonesian
ms: Malay
th: Thai
vi: Vietnamese

classmethod from_default(lang)¶

Returns a default Predictor based on the language specified.

Parameters: lang (str) – The 2-letter ISO 639-1 code of the desired language
Return type: Predictor

classmethod from_library(library_name, **kwargs)¶

Returns a third-party Predictor based on the name of the library provided.

Keyword arguments can be passed as necessary.

Parameters

library_name (str) – Name of third-party library
**kwargs – Additional keyword arguments specific to each library

Return type

Predictor

classmethod from_pretrained(model_name)¶

Returns a natively trained AllenNLP Predictor based on the model name provided

Parameters: model_name (str) – Name of the model
Return type: Predictor

Keyword arguments for POSTagger.from_library¶

If choosing pythainlp:

engine - POS tagging engine (Default = perceptron)
- perceptron - Averaged perceptron
- unigram - Unigram tagger
- artagger - RDRPOSTagger (rule-based)
corpus - Corpus used in model training (affects tagset) (Default = orchid_ud)
- orchid - ORCHID dataset (XPOS)
- orchid_ud - ORCHID dataset with XPOS mapped automatically to UPOS
- pud - Parallel Universal Dependencies dataset (UPOS)
- lst20 - LST20 dataset (XPOS)
tokenizer - Tokenizer engine used to tokenize text for POS tagger (Default = attacut)
- attacut - Good balance between accuracy and speed
- newmm - Dictionary-based, uses Thai Character Clusters and maximal matching, may produce longer tokens

If choosing stanza:

lang - Language for stanza
- id - Indonesian
- vi - Vietnamese

If choosing malaya:

engine - Transformer model to use (Default = alxlnet)
- bert
- tiny-bert
- albert
- tiny-albert
- xlnet
- alxlnet
quantized - Boolean for whether to use a quantized transformer (Default = False)

AllenNLP Predictor¶

The POSTagger instance returned by POSTagger.from_library or POSTagger.from_default or POSTagger.from_pretrained is an instance of the Predictor class from the AllenNLP library.

class allennlp.predictors.Predictor¶

predict(text)¶

Parameters: text (str) – Text to predict on
Returns: List of tuples containing the token’s text and its predicted POS tag
Return type: List[Tuple[str, str]]

Command Line Training¶

SEACoreNLP provides a CLI for training POS taggers. The general script to run is as follows:

[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=pos

The arguments that can/need to be specified for CLI training are detailed in the following table:

Argument	Details	Options
model_name	Name of model architecture to use	`bi-lstm`/ `bi-lstm-crf`/ Huggingface transformer name (e.g. `xlm-roberta-base`)
use_pretrained	If using pre-trained model from Huggingface, set this to `true`	`true`/ `false`
freeze	Whether to freeze pre-trained model’s parameters when training	`true`/ `false`
embedding_dim	No. of dimensions to use for word embeddings (if not using pre-trained model)	Integer
hidden_dim	No. of dimensions to use for Bi-LSTM’s hidden state	Integer
num_epochs	No. of epochs to train for	Integer
batch_size	Batch size for training	Integer
patience	Early stopping patience	Integer
lr	Learning rate	Float (`1e-5` is acceptable)

DatasetReaders for POS Taggers¶

When training, the models read in the data using an AllenNLP DatasetReader. The default one used for POS Taggers expects data in the Universal Dependencies format. That is to say, the tokens will be in the second column and the UPOS will be in the fourth column.

An example from the UD-ID-GSD dataset is shown here:

# sent_id = dev-s1
# text = Ahli rekayasa optik mendesain komponen dari instrumen optik seperti lensa, mikroskop, teleskop, dan peralatan lainnya yang mendukung sifat cahaya.
1    Ahli    ahli    PROPN   NSD     Number=Sing     4       nsubj   _       MorphInd=^ahli<n>_NSD$
2    rekayasa        rekayasa        NOUN    NSD     Number=Sing     1       compound        _       MorphInd=^rekayasa<n>_NSD$
3    optik   optik   NOUN    NSD     Number=Sing     2       compound        _       MorphInd=^optik<n>_NSD$

Training Details for Native Models¶

All our native models are trained, validated and tested on official train/val/test splits provided by Universal Dependencies. However, for Thai, we did a random split of 90:10 into train and test sets. Thai models were trained on the train set and validated on the test set.