.. currentmodule:: seacorenlp.parsing.dependency ######################### Dependency Parsing Module ######################### Dependency Parsing is the task of parsing the syntactic structure of a sentence under the framework of dependency grammar. This establishes relationships between head words and words which are dependent on these head words. ***************************** seacorenlp.parsing.dependency ***************************** .. automodule:: seacorenlp.parsing.dependency .. autosummary:: DependencyParser ****************** Dependency Parsers ****************** Model Performance ================= .. csv-table:: :file: tables/dependency-parsers.csv :header-rows: 1 .. [#f1] The scores displayed here under `UAS` and `LAS` for the Malaya models are reported as `Arc Accuracy` and `Types Accuracy` in the official `Malaya documentation `_. We believe that they correspond and have therefore reported them as `UAS` and `LAS` in order to standardize the way we report metrics, but it is unclear what the author of the documentation meant exactly by these terms. .. autoclass:: DependencyParser :members: from_library, from_pretrained Keyword arguments for DependencyParser.from_library ===================================================== If choosing ``malaya``: * ``model`` - Transformer model to use (Default = ``alxlnet``) * ``bert`` * ``tiny-bert`` * ``albert`` * ``tiny-albert`` * ``xlnet`` * ``alxlnet`` * ``quantized`` - Boolean for whether to use a quantized transformer (Default = ``False``) AllenNLP Predictor ================== The ``DependencyParser`` instance returned by ``DependencyParser.from_library`` or ``DependencyParser.from_pretrained`` is an instance of the ``Predictor`` class from the AllenNLP library. It first segments the text into sentences and then parses each sentence. .. class:: allennlp.predictors.Predictor .. method:: predict(text) :param text: Text to predict on :type text: str :return: List of List of Tuples with the format (Token Text, Head, Dependency Relation) :rtype: List[List[str, int, str]] ********************* Command Line Training ********************* The architecture for training dependency parsers in SEACoreNLP is based on Dozat and Manning's `Bi-LSTM model with Deep Biaffine Attention `_. It comprises the following layers: Embeddings (Word + POS) > Bi-LSTM Encoder > Biaffine Attention The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface. The general script to run to train a dependency parser with this architecture is as follows: .. code-block:: shell [ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=dependency The arguments that can/need to be specified for CLI training are detailed in the following table: .. csv-table:: :file: tables/dp-arguments.csv :header-rows: 1 For reference, the default parameters provided by AllenNLP (in accordance with Dozat and Manning's paper) are as follows: * ``embedding_dim`` = 100 * ``pos_tag_embedding_dim`` = 100 * ``lstm_input_dim`` = 200 * ``lstm_hidden_dim`` = 400 * ``lstm_layers`` = 3 * ``lstm_dropout`` = 0.3 * ``batch_size`` = 128 * ``lr`` = 0.001 DatasetReaders for DependencyParsers ====================================== When training, the models read in the data using an AllenNLP ``DatasetReader`` and take in only a particular data format. We currently use the default ``DatasetReader`` provided by AllenNLP for dependency parsing. It expects the Universal Dependencies format. An example from the `UD-ID-GSD `_ dataset is shown here: .. code-block:: # sent_id = dev-s1 # text = Ahli rekayasa optik mendesain komponen dari instrumen optik seperti lensa, mikroskop, teleskop, dan peralatan lainnya yang mendukung sifat cahaya. 1 Ahli ahli PROPN NSD Number=Sing 4 nsubj _ MorphInd=^ahli_NSD$ 2 rekayasa rekayasa NOUN NSD Number=Sing 1 compound _ MorphInd=^rekayasa_NSD$ 3 optik optik NOUN NSD Number=Sing 2 compound _ MorphInd=^optik_NSD$ Training Details for Native Models ================================== All our native models are trained, validated and tested on official train/val/test splits provided by Universal Dependencies. However, for Thai, we did a random split of 90:10 into train and test sets. Thai models were trained on the train set and validated on the test set.