Part-of-speech Tagging Module¶
Part-of-speech (POS) Tagging is the task of assigning a POS to each token in a sentence.
seacorenlp.tagging.pos¶
Module for Part-of-speech Tagging
Base class to instantiate specific POSTagger (AllenNLP Predictor) |
POS Taggers¶
Model Performance¶
Language |
Package |
Model Name |
Architecture |
Size |
Dataset |
Accuracy (%) |
F1 (%) |
---|---|---|---|---|---|---|---|
Indonesian |
SEACoreNLP |
pos-id-ud-xlmr-best |
XLM-R (Base) + FFNN |
774.4MB |
93.90 |
||
pos-id-ud-xlmr |
XLM-R (Base) + FFNN |
47KB (FFNN) |
92.44 |
||||
pos-id-ud-indobert |
IndoBERT (Base) + FFNN |
462.1MB |
91.54 |
||||
pos-id-ud-bilstm |
Embeddings (200) + Bi-LSTM |
16.3MB |
90.19 |
||||
XLM-R Base |
Embeddings + Adapters + FFNN |
? |
93.57 |
||||
word2vec/fastText + Bi-LSTM |
17.3MB |
93.40 |
|||||
XLNET |
Transformer Embedding + CRF |
446.6MB |
93.24 |
||||
BERT |
Transformer Embedding + CRF |
426.4MB |
93.18 |
||||
ALXLNET |
Transformer Embedding + CRF |
46.8MB |
92.82 |
||||
Tiny-BERT |
Transformer Embedding + CRF |
57.7MB |
92.70 |
||||
ALBERT |
Transformer Embedding + CRF |
48.7MB |
92.55 |
||||
Tiny-ALBERT |
Transformer Embedding + CRF |
22.4MB |
90.00 |
||||
Thai |
SEACoreNLP |
pos-th-ud-xlmr-best |
XLM-R (Base) + FFNN |
755.8MB |
97.20 |
||
pos-th-ud-xlmr |
XLM-R (Base) + FFNN |
44KB (FFNN) |
92.89 |
||||
pos-th-ud-bilstmcrf |
Embeddings (100) + Bi-LSTM + CRF |
2.1MB |
89.10 |
||||
pos-th-ud-bilstm |
Embeddings (100) + Bi-LSTM |
2.1MB |
88.48 |
||||
Averaged Perceptron |
? |
99.09 |
|||||
Unigram |
? |
93.18 |
|||||
RDRPOSTagger |
RDR (Rule-based) |
? |
93.18 |
||||
Vietnamese |
SEACoreNLP |
pos-vi-ud-xlmr-best |
XLM-R (Base) + FFNN |
755.1MB |
93.07 |
||
pos-vi-ud-xlmr |
XLM-R (Base) + FFNN |
41KB (FFNN) |
91.90 |
||||
pos-vi-ud-phobert |
PhoBERT (Base) + FFNN |
438MB |
92.92 |
||||
pos-vi-ud-bilstm |
Embeddings (256) + Bi-LSTM |
8.4MB |
85.21 |
||||
XLM-R Base |
Embeddings + Adapters + FFNN |
? |
89.70 |
||||
word2vec/fastText + Bi-LSTM |
18.1MB |
79.50 |
|||||
CRF |
2.68MB |
? |
? |
? |
|||
MarMoT |
CRF |
28.3MB |
VLSP 2013 |
95.88 |
-
class
seacorenlp.tagging.pos.
POSTagger
¶ Base class to instantiate specific POSTagger (AllenNLP Predictor)
- Options for model_name:
E.g.
pos-th-ud-xlmr
Refer to table containing POS Tagger performance for full list
- Options for library_name:
malaya
(For Indonesian/Malay)stanza
(For Indonesian and Vietnamese)pythainlp
(For Thai)underthesea
(For Vietnamese)
- Defaults available for the following languages:
id
: Indonesianms
: Malayth
: Thaivi
: Vietnamese
-
classmethod
from_default
(lang)¶ Returns a default Predictor based on the language specified.
- Parameters
lang (
str
) – The 2-letter ISO 639-1 code of the desired language- Return type
-
classmethod
from_library
(library_name, **kwargs)¶ Returns a third-party Predictor based on the name of the library provided.
Keyword arguments can be passed as necessary.
- Parameters
library_name (
str
) – Name of third-party library**kwargs – Additional keyword arguments specific to each library
- Return type
Keyword arguments for POSTagger.from_library¶
- If choosing
pythainlp
: engine
- POS tagging engine (Default =perceptron
)perceptron
- Averaged perceptronunigram
- Unigram taggerartagger
- RDRPOSTagger (rule-based)
corpus
- Corpus used in model training (affects tagset) (Default =orchid_ud
)orchid
- ORCHID dataset (XPOS)orchid_ud
- ORCHID dataset with XPOS mapped automatically to UPOSpud
- Parallel Universal Dependencies dataset (UPOS)lst20
- LST20 dataset (XPOS)
tokenizer
- Tokenizer engine used to tokenize text for POS tagger (Default =attacut
)attacut
- Good balance between accuracy and speednewmm
- Dictionary-based, uses Thai Character Clusters and maximal matching, may produce longer tokens
- If choosing
stanza
: lang
- Language for stanzaid
- Indonesianvi
- Vietnamese
- If choosing
malaya
: engine
- Transformer model to use (Default =alxlnet
)bert
tiny-bert
albert
tiny-albert
xlnet
alxlnet
quantized
- Boolean for whether to use a quantized transformer (Default =False
)
AllenNLP Predictor¶
The POSTagger
instance returned by POSTagger.from_library
or POSTagger.from_default
or POSTagger.from_pretrained
is an instance of the Predictor
class from the AllenNLP library.
Command Line Training¶
SEACoreNLP provides a CLI for training POS taggers. The general script to run is as follows:
[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=pos
The arguments that can/need to be specified for CLI training are detailed in the following table:
Argument |
Details |
Options |
---|---|---|
model_name |
Name of model architecture to use |
|
use_pretrained |
If using pre-trained model from Huggingface, set this to |
|
freeze |
Whether to freeze pre-trained model’s parameters when training |
|
embedding_dim |
No. of dimensions to use for word embeddings (if not using pre-trained model) |
Integer |
hidden_dim |
No. of dimensions to use for Bi-LSTM’s hidden state |
Integer |
num_epochs |
No. of epochs to train for |
Integer |
batch_size |
Batch size for training |
Integer |
patience |
Early stopping patience |
Integer |
lr |
Learning rate |
Float ( |
DatasetReaders for POS Taggers¶
When training, the models read in the data using an AllenNLP DatasetReader
.
The default one used for POS Taggers expects data in the Universal Dependencies format.
That is to say, the tokens will be in the second column and the UPOS will be in the
fourth column.
An example from the UD-ID-GSD dataset is shown here:
# sent_id = dev-s1
# text = Ahli rekayasa optik mendesain komponen dari instrumen optik seperti lensa, mikroskop, teleskop, dan peralatan lainnya yang mendukung sifat cahaya.
1 Ahli ahli PROPN NSD Number=Sing 4 nsubj _ MorphInd=^ahli<n>_NSD$
2 rekayasa rekayasa NOUN NSD Number=Sing 1 compound _ MorphInd=^rekayasa<n>_NSD$
3 optik optik NOUN NSD Number=Sing 2 compound _ MorphInd=^optik<n>_NSD$
Training Details for Native Models¶
All our native models are trained, validated and tested on official train/val/test splits provided by Universal Dependencies. However, for Thai, we did a random split of 90:10 into train and test sets. Thai models were trained on the train set and validated on the test set.