Dependency Parsing Module

Dependency Parsing is the task of parsing the syntactic structure of a sentence under the framework of dependency grammar. This establishes relationships between head words and words which are dependent on these head words.

seacorenlp.parsing.dependency

Module for Dependency Parsing

DependencyParser()

Base class to instantiate specific DependencyParser (AllenNLP Predictor)

Dependency Parsers

Model Performance

Language

Package

Model Name

Architecture

Size

Dataset

UAS (%) 1

LAS (%)

Indonesian

SEACoreNLP

dp-id-ud-xlmr-best

Bi-LSTM + Deep Biaffine Attention

841.5MB

UD-ID-GSD

88.10

82.23

dp-id-ud-xlmr

Bi-LSTM + Deep Biaffine Attention

67.5MB (Classifier layers)

UD-ID-GSD

86.02

80.17

dp-id-ud-indobert

Bi-LSTM + Deep Biaffine Attention

67.5MB (Classifier layers)

UD-ID-GSD

86.67

81.04

dp-id-ud-scratch

Bi-LSTM + Deep Biaffine Attention

63.3MB

UD-ID-GSD

84.23

78.70

Trankit*

XLM-R Base

Embeddings + Adapters + Deep Biaffine Attention

UD-ID-GSD

86.55

80.28

Stanza

Bi-LSTM + Deep Biaffine Attention

95.3MB

UD-ID-GSD

85.17

79.19

Malaya

XLNET

450.2MB

Augmented UD

93.10

92.50

ALXLNET

50.0MB

Augmented UD

89.40

88.60

BERT

426.0MB

Augmented UD

85.50

84.80

ALBERT

50.0MB

Augmented UD

81.10

79.30

Tiny-BERT

59.5MB

Augmented UD

71.80

69.40

Tiny-ALBERT

24.8MB

Augmented UD

70.80

67.30

Thai

SEACoreNLP

dp-th-ud-xlmr-best

Bi-LSTM + Deep Biaffine Attention

823.7MB

UD-TH-PUD

89.74

82.30

dp-th-ud-xlmr

Bi-LSTM + Deep Biaffine Attention

67.9MB (Classifier layers)

UD-TH-PUD

88.33

82.39

dp-th-ud-scratch

Bi-LSTM + Deep Biaffine Attention

57.5MB

UD-TH-PUD

81.06

73.67

spaCy-Thai

UDPipe

4.82MB

UD-TH-PUD

?

?

Vietnamese

SEACoreNLP

dp-vi-ud-xlmr-best

Bi-LSTM + Deep Biaffine Attention

822.5MB

UD-VI-VTB

77.79

71.03

dp-vi-ud-xlmr

Bi-LSTM + Deep Biaffine Attention

67.3MB (Classifier layers)

UD-VI-VTB

77.37

73.65

dp-vi-ud-scratch

Bi-LSTM + Deep Biaffine Attention

57.2MB

UD-VI-VTB

67.56

63.96

Trankit*

XLM-R Large

Embeddings + Adapters + Deep Biaffine Attention

UD-VI-VTB

71.07

65.37

Stanza

Bi-LSTM + Deep Biaffine Attention

93.1MB

UD-VI-VTB

53.63

48.16

UnderTheSea

Bi-LSTM + Deep Biaffine Attention

?

?

?

?

VnCoreNLP*

Transition-based Parser

15.3MB

VnDT

79.02

73.39

1

The scores displayed here under UAS and LAS for the Malaya models are reported as Arc Accuracy and Types Accuracy in the official Malaya documentation. We believe that they correspond and have therefore reported them as UAS and LAS in order to standardize the way we report metrics, but it is unclear what the author of the documentation meant exactly by these terms.

class seacorenlp.parsing.dependency.DependencyParser

Base class to instantiate specific DependencyParser (AllenNLP Predictor)

Options for model_name:
  • E.g. dp-th-ud-xlmr

  • Refer to table containing Dependency Parser performance for full list

Options for library_name:
  • malaya (For Indonesian/Malay)

  • pythainlp (For Thai)

  • underthesea (For Vietnamese)

classmethod from_library(library_name, **kwargs)

Returns a third-party Predictor based on the name of the library provided.

Keyword arguments can be passed as necessary.

Parameters
  • library_name (str) – Name of third-party library

  • **kwargs – Additional keyword arguments specific to each library

Return type

Predictor

classmethod from_pretrained(model_name)

Returns a natively trained AllenNLP Predictor based on the model name provided

Parameters

model_name (str) – Name of the model

Return type

Predictor

Keyword arguments for DependencyParser.from_library

If choosing malaya:
  • model - Transformer model to use (Default = alxlnet)

    • bert

    • tiny-bert

    • albert

    • tiny-albert

    • xlnet

    • alxlnet

  • quantized - Boolean for whether to use a quantized transformer (Default = False)

AllenNLP Predictor

The DependencyParser instance returned by DependencyParser.from_library or DependencyParser.from_pretrained is an instance of the Predictor class from the AllenNLP library. It first segments the text into sentences and then parses each sentence.

class allennlp.predictors.Predictor
predict(text)
Parameters

text (str) – Text to predict on

Returns

List of List of Tuples with the format (Token Text, Head, Dependency Relation)

Return type

List[List[str, int, str]]

Command Line Training

The architecture for training dependency parsers in SEACoreNLP is based on Dozat and Manning’s Bi-LSTM model with Deep Biaffine Attention. It comprises the following layers:

Embeddings (Word + POS) > Bi-LSTM Encoder > Biaffine Attention

The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface.

The general script to run to train a dependency parser with this architecture is as follows:

[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=dependency

The arguments that can/need to be specified for CLI training are detailed in the following table:

Argument

Details

Options

model_name

Name of pre-trained model to use

Huggingface transformer name (e.g. xlm-roberta-base)

use_pretrained

If using pre-trained model, set this to true

true/ false

freeze

Whether to freeze pre-trained model’s parameters when training

true/ false

embedding_dim

No. of dimensions to use for word embeddings (if not using pre-trained model)

Integer

pos_tag_embedding_dim

No. of dimensions to use for POS tag embeddings

Integer

lstm_input_dim

No. of dimensions of input to Bi-LSTM (i.e. dimensions of embedding output)

Integer

lstm_hidden_dim

No. of dimensions to use for Bi-LSTM’s hidden state

Integer

lstm_layers

No. of layers of Bi-LSTM to use

Integer

lstm_dropout

% of dropout between Bi-LSTM layers (if >1)

Integer

num_epochs

No. of epochs to train for

Integer

batch_size

Batch size for training

Integer

patience

Early stopping patience

Integer

lr

Learning rate

Float (1e-5 is acceptable)

For reference, the default parameters provided by AllenNLP (in accordance with Dozat and Manning’s paper) are as follows:

  • embedding_dim = 100

  • pos_tag_embedding_dim = 100

  • lstm_input_dim = 200

  • lstm_hidden_dim = 400

  • lstm_layers = 3

  • lstm_dropout = 0.3

  • batch_size = 128

  • lr = 0.001

DatasetReaders for DependencyParsers

When training, the models read in the data using an AllenNLP DatasetReader and take in only a particular data format.

We currently use the default DatasetReader provided by AllenNLP for dependency parsing. It expects the Universal Dependencies format.

An example from the UD-ID-GSD dataset is shown here:

# sent_id = dev-s1
# text = Ahli rekayasa optik mendesain komponen dari instrumen optik seperti lensa, mikroskop, teleskop, dan peralatan lainnya yang mendukung sifat cahaya.
1    Ahli    ahli    PROPN   NSD     Number=Sing     4       nsubj   _       MorphInd=^ahli<n>_NSD$
2    rekayasa        rekayasa        NOUN    NSD     Number=Sing     1       compound        _       MorphInd=^rekayasa<n>_NSD$
3    optik   optik   NOUN    NSD     Number=Sing     2       compound        _       MorphInd=^optik<n>_NSD$

Training Details for Native Models

All our native models are trained, validated and tested on official train/val/test splits provided by Universal Dependencies. However, for Thai, we did a random split of 90:10 into train and test sets. Thai models were trained on the train set and validated on the test set.