Dependency Parsing Module¶

Dependency Parsing is the task of parsing the syntactic structure of a sentence under the framework of dependency grammar. This establishes relationships between head words and words which are dependent on these head words.

seacorenlp.parsing.dependency¶

Module for Dependency Parsing

DependencyParser()

Base class to instantiate specific DependencyParser (AllenNLP Predictor)

Dependency Parsers¶

Model Performance¶

Language	Package	Model Name	Architecture	Size	Dataset	UAS (%) 1	LAS (%)
Indonesian	SEACoreNLP	dp-id-ud-xlmr-best	Bi-LSTM + Deep Biaffine Attention	841.5MB	UD-ID-GSD	88.10	82.23
		dp-id-ud-xlmr	Bi-LSTM + Deep Biaffine Attention	67.5MB (Classifier layers)	UD-ID-GSD	86.02	80.17
		dp-id-ud-indobert	Bi-LSTM + Deep Biaffine Attention	67.5MB (Classifier layers)	UD-ID-GSD	86.67	81.04
		dp-id-ud-scratch	Bi-LSTM + Deep Biaffine Attention	63.3MB	UD-ID-GSD	84.23	78.70
	Trankit*	XLM-R Base	Embeddings + Adapters + Deep Biaffine Attention		UD-ID-GSD	86.55	80.28
	Stanza		Bi-LSTM + Deep Biaffine Attention	95.3MB	UD-ID-GSD	85.17	79.19
	Malaya	XLNET		450.2MB	Augmented UD	93.10	92.50
		ALXLNET		50.0MB	Augmented UD	89.40	88.60
		BERT		426.0MB	Augmented UD	85.50	84.80
		ALBERT		50.0MB	Augmented UD	81.10	79.30
		Tiny-BERT		59.5MB	Augmented UD	71.80	69.40
		Tiny-ALBERT		24.8MB	Augmented UD	70.80	67.30
Thai	SEACoreNLP	dp-th-ud-xlmr-best	Bi-LSTM + Deep Biaffine Attention	823.7MB	UD-TH-PUD	89.74	82.30
		dp-th-ud-xlmr	Bi-LSTM + Deep Biaffine Attention	67.9MB (Classifier layers)	UD-TH-PUD	88.33	82.39
		dp-th-ud-scratch	Bi-LSTM + Deep Biaffine Attention	57.5MB	UD-TH-PUD	81.06	73.67
	spaCy-Thai		UDPipe	4.82MB	UD-TH-PUD	?	?
Vietnamese	SEACoreNLP	dp-vi-ud-xlmr-best	Bi-LSTM + Deep Biaffine Attention	822.5MB	UD-VI-VTB	77.79	71.03
		dp-vi-ud-xlmr	Bi-LSTM + Deep Biaffine Attention	67.3MB (Classifier layers)	UD-VI-VTB	77.37	73.65
		dp-vi-ud-scratch	Bi-LSTM + Deep Biaffine Attention	57.2MB	UD-VI-VTB	67.56	63.96
	Trankit*	XLM-R Large	Embeddings + Adapters + Deep Biaffine Attention		UD-VI-VTB	71.07	65.37
	Stanza		Bi-LSTM + Deep Biaffine Attention	93.1MB	UD-VI-VTB	53.63	48.16
	UnderTheSea		Bi-LSTM + Deep Biaffine Attention	?	?	?	?
	VnCoreNLP*		Transition-based Parser	15.3MB	VnDT	79.02	73.39

1: The scores displayed here under UAS and LAS for the Malaya models are reported as Arc Accuracy and Types Accuracy in the official Malaya documentation. We believe that they correspond and have therefore reported them as UAS and LAS in order to standardize the way we report metrics, but it is unclear what the author of the documentation meant exactly by these terms.

class seacorenlp.parsing.dependency.DependencyParser¶

Base class to instantiate specific DependencyParser (AllenNLP Predictor)

Options for model_name:

E.g. dp-th-ud-xlmr
Refer to table containing Dependency Parser performance for full list

Options for library_name:

malaya (For Indonesian/Malay)
pythainlp (For Thai)
underthesea (For Vietnamese)

classmethod from_library(library_name, **kwargs)¶

Returns a third-party Predictor based on the name of the library provided.

Keyword arguments can be passed as necessary.

Parameters

library_name (str) – Name of third-party library
**kwargs – Additional keyword arguments specific to each library

Return type

Predictor

classmethod from_pretrained(model_name)¶

Returns a natively trained AllenNLP Predictor based on the model name provided

Parameters: model_name (str) – Name of the model
Return type: Predictor

Keyword arguments for DependencyParser.from_library¶

If choosing malaya:

model - Transformer model to use (Default = alxlnet)
- bert
- tiny-bert
- albert
- tiny-albert
- xlnet
- alxlnet
quantized - Boolean for whether to use a quantized transformer (Default = False)

AllenNLP Predictor¶

The DependencyParser instance returned by DependencyParser.from_library or DependencyParser.from_pretrained is an instance of the Predictor class from the AllenNLP library. It first segments the text into sentences and then parses each sentence.

class allennlp.predictors.Predictor¶

predict(text)¶

Parameters: text (str) – Text to predict on
Returns: List of List of Tuples with the format (Token Text, Head, Dependency Relation)
Return type: List[List[str, int, str]]

Command Line Training¶

The architecture for training dependency parsers in SEACoreNLP is based on Dozat and Manning’s Bi-LSTM model with Deep Biaffine Attention. It comprises the following layers:

Embeddings (Word + POS) > Bi-LSTM Encoder > Biaffine Attention

The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface.

The general script to run to train a dependency parser with this architecture is as follows:

[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=dependency

The arguments that can/need to be specified for CLI training are detailed in the following table:

Argument	Details	Options
model_name	Name of pre-trained model to use	Huggingface transformer name (e.g. `xlm-roberta-base`)
use_pretrained	If using pre-trained model, set this to `true`	`true`/ `false`
freeze	Whether to freeze pre-trained model’s parameters when training	`true`/ `false`
embedding_dim	No. of dimensions to use for word embeddings (if not using pre-trained model)	Integer
pos_tag_embedding_dim	No. of dimensions to use for POS tag embeddings	Integer
lstm_input_dim	No. of dimensions of input to Bi-LSTM (i.e. dimensions of embedding output)	Integer
lstm_hidden_dim	No. of dimensions to use for Bi-LSTM’s hidden state	Integer
lstm_layers	No. of layers of Bi-LSTM to use	Integer
lstm_dropout	% of dropout between Bi-LSTM layers (if >1)	Integer
num_epochs	No. of epochs to train for	Integer
batch_size	Batch size for training	Integer
patience	Early stopping patience	Integer
lr	Learning rate	Float (`1e-5` is acceptable)

For reference, the default parameters provided by AllenNLP (in accordance with Dozat and Manning’s paper) are as follows:

embedding_dim = 100
pos_tag_embedding_dim = 100
lstm_input_dim = 200
lstm_hidden_dim = 400
lstm_layers = 3
lstm_dropout = 0.3
batch_size = 128
lr = 0.001

DatasetReaders for DependencyParsers¶

When training, the models read in the data using an AllenNLP DatasetReader and take in only a particular data format.

We currently use the default DatasetReader provided by AllenNLP for dependency parsing. It expects the Universal Dependencies format.

An example from the UD-ID-GSD dataset is shown here:

# sent_id = dev-s1
# text = Ahli rekayasa optik mendesain komponen dari instrumen optik seperti lensa, mikroskop, teleskop, dan peralatan lainnya yang mendukung sifat cahaya.
1    Ahli    ahli    PROPN   NSD     Number=Sing     4       nsubj   _       MorphInd=^ahli<n>_NSD$
2    rekayasa        rekayasa        NOUN    NSD     Number=Sing     1       compound        _       MorphInd=^rekayasa<n>_NSD$
3    optik   optik   NOUN    NSD     Number=Sing     2       compound        _       MorphInd=^optik<n>_NSD$

Training Details for Native Models¶

All our native models are trained, validated and tested on official train/val/test splits provided by Universal Dependencies. However, for Thai, we did a random split of 90:10 into train and test sets. Thai models were trained on the train set and validated on the test set.