.. currentmodule:: seacorenlp.tagging.pos

#############################
Part-of-speech Tagging Module
#############################

Part-of-speech (POS) Tagging is the task of assigning a POS to each token in a sentence.


**********************
seacorenlp.tagging.pos
**********************

.. automodule:: seacorenlp.tagging.pos

.. autosummary::

   POSTagger

***********
POS Taggers
***********

Model Performance
=================

.. csv-table::
   :file: ../intro/tables/pos-taggers.csv
   :header-rows: 1

.. autoclass:: POSTagger
   :members: from_default, from_library, from_pretrained

Keyword arguments for POSTagger.from_library
============================================

If choosing ``pythainlp``:
  * ``engine`` - POS tagging engine (Default = ``perceptron``)

    * ``perceptron`` - Averaged perceptron
    * ``unigram`` - Unigram tagger
    * ``artagger`` - RDRPOSTagger (rule-based)

  * ``corpus`` - Corpus used in model training (affects tagset) (Default = ``orchid_ud``)

    * ``orchid`` - ORCHID dataset (XPOS)
    * ``orchid_ud`` - ORCHID dataset with XPOS mapped automatically to UPOS
    * ``pud`` - Parallel Universal Dependencies dataset (UPOS)
    * ``lst20`` - LST20 dataset (XPOS)

  * ``tokenizer`` - Tokenizer engine used to tokenize text for POS tagger (Default = ``attacut``)

    * ``attacut`` - Good balance between accuracy and speed
    * ``newmm`` - Dictionary-based, uses Thai Character Clusters and maximal matching, may produce longer tokens

If choosing ``stanza``:
  * ``lang`` - Language for stanza

    * ``id`` - Indonesian
    * ``vi`` - Vietnamese

If choosing ``malaya``:
  * ``engine`` - Transformer model to use (Default = ``alxlnet``)

    * ``bert``
    * ``tiny-bert``
    * ``albert``
    * ``tiny-albert``
    * ``xlnet``
    * ``alxlnet``

  * ``quantized`` - Boolean for whether to use a quantized transformer (Default = ``False``)

AllenNLP Predictor
==================

The ``POSTagger`` instance returned by ``POSTagger.from_library`` or ``POSTagger.from_default``
or ``POSTagger.from_pretrained`` is an instance of the ``Predictor`` class from the AllenNLP library.

.. class:: allennlp.predictors.Predictor

   .. method:: predict(text)

      :param text: Text to predict on
      :type text: str
      :return: List of tuples containing the token's text and its predicted POS tag
      :rtype: List[Tuple[str, str]]

*********************
Command Line Training
*********************

SEACoreNLP provides a CLI for training POS taggers. The general script to run is as follows:

.. code-block:: shell

   [ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=pos

The arguments that can/need to be specified for CLI training are detailed in the following table:

.. csv-table::
   :file: tables/pos-arguments.csv
   :header-rows: 1

DatasetReaders for POS Taggers
==============================

When training, the models read in the data using an AllenNLP ``DatasetReader``.
The default one used for POS Taggers expects data in the Universal Dependencies format.
That is to say, the tokens will be in the second column and the UPOS will be in the
fourth column.

An example from the `UD-ID-GSD <https://github.com/UniversalDependencies/UD_Indonesian-GSD>`_
dataset is shown here:

.. code-block::

   # sent_id = dev-s1
   # text = Ahli rekayasa optik mendesain komponen dari instrumen optik seperti lensa, mikroskop, teleskop, dan peralatan lainnya yang mendukung sifat cahaya.
   1	Ahli	ahli	PROPN	NSD	Number=Sing	4	nsubj	_	MorphInd=^ahli<n>_NSD$
   2	rekayasa	rekayasa	NOUN	NSD	Number=Sing	1	compound	_	MorphInd=^rekayasa<n>_NSD$
   3	optik	optik	NOUN	NSD	Number=Sing	2	compound	_	MorphInd=^optik<n>_NSD$


Training Details for Native Models
==================================

All our native models are trained, validated and tested on official train/val/test splits
provided by Universal Dependencies. However, for Thai, we did a random split of 90:10 into
train and test sets. Thai models were trained on the train set and validated on the test set.