.. currentmodule:: seacorenlp.parsing.dependency

#########################
Dependency Parsing Module
#########################

Dependency Parsing is the task of parsing the syntactic structure of a sentence under the framework
of dependency grammar. This establishes relationships between head words and words which are
dependent on these head words.


*****************************
seacorenlp.parsing.dependency
*****************************

.. automodule:: seacorenlp.parsing.dependency

.. autosummary::

   DependencyParser

******************
Dependency Parsers
******************

Model Performance
=================

.. csv-table::
   :file: tables/dependency-parsers.csv
   :header-rows: 1

.. [#f1] The scores displayed here under `UAS` and `LAS` for the Malaya models
   are reported as `Arc Accuracy` and `Types Accuracy` in the official `Malaya documentation
   <https://malaya.readthedocs.io/en/latest/load-dependency.html>`_. We believe that they
   correspond and have therefore reported them as `UAS` and `LAS` in order to standardize
   the way we report metrics, but it is unclear what the author of the documentation
   meant exactly by these terms.

.. autoclass:: DependencyParser
   :members: from_library, from_pretrained

Keyword arguments for DependencyParser.from_library
=====================================================

If choosing ``malaya``:
  * ``model`` - Transformer model to use (Default = ``alxlnet``)

    * ``bert``
    * ``tiny-bert``
    * ``albert``
    * ``tiny-albert``
    * ``xlnet``
    * ``alxlnet``

  * ``quantized`` - Boolean for whether to use a quantized transformer (Default = ``False``)

AllenNLP Predictor
==================

The ``DependencyParser`` instance returned by ``DependencyParser.from_library`` or
``DependencyParser.from_pretrained`` is an instance of the ``Predictor`` class from the AllenNLP library.
It first segments the text into sentences and then parses each sentence.

.. class:: allennlp.predictors.Predictor

   .. method:: predict(text)

      :param text: Text to predict on
      :type text: str
      :return: List of List of Tuples with the format (Token Text, Head, Dependency Relation)
      :rtype: List[List[str, int, str]]

*********************
Command Line Training
*********************

The architecture for training dependency parsers in SEACoreNLP is based on Dozat and Manning's
`Bi-LSTM model with Deep Biaffine Attention <https://arxiv.org/pdf/1611.01734.pdf>`_.
It comprises the following layers:

Embeddings (Word + POS) > Bi-LSTM Encoder > Biaffine Attention

The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface.

The general script to run to train a dependency parser with this architecture is as follows:

.. code-block:: shell

   [ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=dependency

The arguments that can/need to be specified for CLI training are detailed in the following table:

.. csv-table::
   :file: tables/dp-arguments.csv
   :header-rows: 1

For reference, the default parameters provided by AllenNLP (in accordance with Dozat and Manning's paper)
are as follows:

* ``embedding_dim`` = 100
* ``pos_tag_embedding_dim`` = 100
* ``lstm_input_dim`` = 200
* ``lstm_hidden_dim`` = 400
* ``lstm_layers`` = 3
* ``lstm_dropout`` = 0.3
* ``batch_size`` = 128
* ``lr`` = 0.001

DatasetReaders for DependencyParsers
======================================

When training, the models read in the data using an AllenNLP ``DatasetReader`` and
take in only a particular data format.

We currently use the default ``DatasetReader`` provided by AllenNLP for dependency parsing.
It expects the Universal Dependencies format.

An example from the `UD-ID-GSD <https://github.com/UniversalDependencies/UD_Indonesian-GSD>`_
dataset is shown here:

.. code-block::

   # sent_id = dev-s1
   # text = Ahli rekayasa optik mendesain komponen dari instrumen optik seperti lensa, mikroskop, teleskop, dan peralatan lainnya yang mendukung sifat cahaya.
   1	Ahli	ahli	PROPN	NSD	Number=Sing	4	nsubj	_	MorphInd=^ahli<n>_NSD$
   2	rekayasa	rekayasa	NOUN	NSD	Number=Sing	1	compound	_	MorphInd=^rekayasa<n>_NSD$
   3	optik	optik	NOUN	NSD	Number=Sing	2	compound	_	MorphInd=^optik<n>_NSD$


Training Details for Native Models
==================================

All our native models are trained, validated and tested on official train/val/test splits
provided by Universal Dependencies. However, for Thai, we did a random split of 90:10 into
train and test sets. Thai models were trained on the train set and validated on the test set.