Part-of-speech Tagging Module

Part-of-speech (POS) Tagging is the task of assigning a POS to each token in a sentence.

seacorenlp.tagging.pos

Module for Part-of-speech Tagging

POSTagger()

Base class to instantiate specific POSTagger (AllenNLP Predictor)

POS Taggers

Model Performance

Language

Package

Model Name

Architecture

Size

Dataset

Accuracy (%)

F1 (%)

Indonesian

SEACoreNLP

pos-id-ud-xlmr-best

XLM-R (Base) + FFNN

774.4MB

UD-ID-GSD

93.90

pos-id-ud-xlmr

XLM-R (Base) + FFNN

47KB (FFNN)

UD-ID-GSD

92.44

pos-id-ud-indobert

IndoBERT (Base) + FFNN

462.1MB

UD-ID-GSD

91.54

pos-id-ud-bilstm

Embeddings (200) + Bi-LSTM

16.3MB

UD-ID-GSD

90.19

Trankit*

XLM-R Base

Embeddings + Adapters + FFNN

?

UD-ID-GSD

93.57

Stanza

word2vec/fastText + Bi-LSTM

17.3MB

UD-ID-GSD

93.40

Malaya

XLNET

Transformer Embedding + CRF

446.6MB

UD-ID-GSD

93.24

BERT

Transformer Embedding + CRF

426.4MB

UD-ID-GSD

93.18

ALXLNET

Transformer Embedding + CRF

46.8MB

UD-ID-GSD

92.82

Tiny-BERT

Transformer Embedding + CRF

57.7MB

UD-ID-GSD

92.70

ALBERT

Transformer Embedding + CRF

48.7MB

UD-ID-GSD

92.55

Tiny-ALBERT

Transformer Embedding + CRF

22.4MB

UD-ID-GSD

90.00

Thai

SEACoreNLP

pos-th-ud-xlmr-best

XLM-R (Base) + FFNN

755.8MB

UD-TH-PUD

97.20

pos-th-ud-xlmr

XLM-R (Base) + FFNN

44KB (FFNN)

UD-TH-PUD

92.89

pos-th-ud-bilstmcrf

Embeddings (100) + Bi-LSTM + CRF

2.1MB

UD-TH-PUD

89.10

pos-th-ud-bilstm

Embeddings (100) + Bi-LSTM

2.1MB

UD-TH-PUD

88.48

PyThaiNLP

Averaged Perceptron

?

UD-TH-PUD

99.09

Unigram

?

UD-TH-PUD

93.18

RDRPOSTagger

RDR (Rule-based)

?

UD-TH-PUD

93.18

Vietnamese

SEACoreNLP

pos-vi-ud-xlmr-best

XLM-R (Base) + FFNN

755.1MB

UD-VI-VTB

93.07

pos-vi-ud-xlmr

XLM-R (Base) + FFNN

41KB (FFNN)

UD-VI-VTB

91.90

pos-vi-ud-phobert

PhoBERT (Base) + FFNN

438MB

UD-VI-VTB

92.92

pos-vi-ud-bilstm

Embeddings (256) + Bi-LSTM

8.4MB

UD-VI-VTB

85.21

Trankit*

XLM-R Base

Embeddings + Adapters + FFNN

?

UD-VI-VTB

89.70

Stanza

word2vec/fastText + Bi-LSTM

18.1MB

UD-VI-VTB

79.50

UnderTheSea

CRF

2.68MB

?

?

?

VnCoreNLP*

MarMoT

CRF

28.3MB

VLSP 2013

95.88

class seacorenlp.tagging.pos.POSTagger

Base class to instantiate specific POSTagger (AllenNLP Predictor)

Options for model_name:
  • E.g. pos-th-ud-xlmr

  • Refer to table containing POS Tagger performance for full list

Options for library_name:
  • malaya (For Indonesian/Malay)

  • stanza (For Indonesian and Vietnamese)

  • pythainlp (For Thai)

  • underthesea (For Vietnamese)

Defaults available for the following languages:
  • id: Indonesian

  • ms: Malay

  • th: Thai

  • vi: Vietnamese

classmethod from_default(lang)

Returns a default Predictor based on the language specified.

Parameters

lang (str) – The 2-letter ISO 639-1 code of the desired language

Return type

Predictor

classmethod from_library(library_name, **kwargs)

Returns a third-party Predictor based on the name of the library provided.

Keyword arguments can be passed as necessary.

Parameters
  • library_name (str) – Name of third-party library

  • **kwargs – Additional keyword arguments specific to each library

Return type

Predictor

classmethod from_pretrained(model_name)

Returns a natively trained AllenNLP Predictor based on the model name provided

Parameters

model_name (str) – Name of the model

Return type

Predictor

Keyword arguments for POSTagger.from_library

If choosing pythainlp:
  • engine - POS tagging engine (Default = perceptron)

    • perceptron - Averaged perceptron

    • unigram - Unigram tagger

    • artagger - RDRPOSTagger (rule-based)

  • corpus - Corpus used in model training (affects tagset) (Default = orchid_ud)

    • orchid - ORCHID dataset (XPOS)

    • orchid_ud - ORCHID dataset with XPOS mapped automatically to UPOS

    • pud - Parallel Universal Dependencies dataset (UPOS)

    • lst20 - LST20 dataset (XPOS)

  • tokenizer - Tokenizer engine used to tokenize text for POS tagger (Default = attacut)

    • attacut - Good balance between accuracy and speed

    • newmm - Dictionary-based, uses Thai Character Clusters and maximal matching, may produce longer tokens

If choosing stanza:
  • lang - Language for stanza

    • id - Indonesian

    • vi - Vietnamese

If choosing malaya:
  • engine - Transformer model to use (Default = alxlnet)

    • bert

    • tiny-bert

    • albert

    • tiny-albert

    • xlnet

    • alxlnet

  • quantized - Boolean for whether to use a quantized transformer (Default = False)

AllenNLP Predictor

The POSTagger instance returned by POSTagger.from_library or POSTagger.from_default or POSTagger.from_pretrained is an instance of the Predictor class from the AllenNLP library.

class allennlp.predictors.Predictor
predict(text)
Parameters

text (str) – Text to predict on

Returns

List of tuples containing the token’s text and its predicted POS tag

Return type

List[Tuple[str, str]]

Command Line Training

SEACoreNLP provides a CLI for training POS taggers. The general script to run is as follows:

[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=pos

The arguments that can/need to be specified for CLI training are detailed in the following table:

Argument

Details

Options

model_name

Name of model architecture to use

bi-lstm/ bi-lstm-crf/ Huggingface transformer name (e.g. xlm-roberta-base)

use_pretrained

If using pre-trained model from Huggingface, set this to true

true/ false

freeze

Whether to freeze pre-trained model’s parameters when training

true/ false

embedding_dim

No. of dimensions to use for word embeddings (if not using pre-trained model)

Integer

hidden_dim

No. of dimensions to use for Bi-LSTM’s hidden state

Integer

num_epochs

No. of epochs to train for

Integer

batch_size

Batch size for training

Integer

patience

Early stopping patience

Integer

lr

Learning rate

Float (1e-5 is acceptable)

DatasetReaders for POS Taggers

When training, the models read in the data using an AllenNLP DatasetReader. The default one used for POS Taggers expects data in the Universal Dependencies format. That is to say, the tokens will be in the second column and the UPOS will be in the fourth column.

An example from the UD-ID-GSD dataset is shown here:

# sent_id = dev-s1
# text = Ahli rekayasa optik mendesain komponen dari instrumen optik seperti lensa, mikroskop, teleskop, dan peralatan lainnya yang mendukung sifat cahaya.
1    Ahli    ahli    PROPN   NSD     Number=Sing     4       nsubj   _       MorphInd=^ahli<n>_NSD$
2    rekayasa        rekayasa        NOUN    NSD     Number=Sing     1       compound        _       MorphInd=^rekayasa<n>_NSD$
3    optik   optik   NOUN    NSD     Number=Sing     2       compound        _       MorphInd=^optik<n>_NSD$

Training Details for Native Models

All our native models are trained, validated and tested on official train/val/test splits provided by Universal Dependencies. However, for Thai, we did a random split of 90:10 into train and test sets. Thai models were trained on the train set and validated on the test set.