Dependency Parsing Module¶
Dependency Parsing is the task of parsing the syntactic structure of a sentence under the framework of dependency grammar. This establishes relationships between head words and words which are dependent on these head words.
seacorenlp.parsing.dependency¶
Module for Dependency Parsing
Base class to instantiate specific DependencyParser (AllenNLP Predictor) |
Dependency Parsers¶
Model Performance¶
Language |
Package |
Model Name |
Architecture |
Size |
Dataset |
UAS (%) 1 |
LAS (%) |
---|---|---|---|---|---|---|---|
Indonesian |
SEACoreNLP |
dp-id-ud-xlmr-best |
Bi-LSTM + Deep Biaffine Attention |
841.5MB |
88.10 |
82.23 |
|
dp-id-ud-xlmr |
Bi-LSTM + Deep Biaffine Attention |
67.5MB (Classifier layers) |
86.02 |
80.17 |
|||
dp-id-ud-indobert |
Bi-LSTM + Deep Biaffine Attention |
67.5MB (Classifier layers) |
86.67 |
81.04 |
|||
dp-id-ud-scratch |
Bi-LSTM + Deep Biaffine Attention |
63.3MB |
84.23 |
78.70 |
|||
XLM-R Base |
Embeddings + Adapters + Deep Biaffine Attention |
86.55 |
80.28 |
||||
Bi-LSTM + Deep Biaffine Attention |
95.3MB |
85.17 |
79.19 |
||||
XLNET |
450.2MB |
93.10 |
92.50 |
||||
ALXLNET |
50.0MB |
89.40 |
88.60 |
||||
BERT |
426.0MB |
85.50 |
84.80 |
||||
ALBERT |
50.0MB |
81.10 |
79.30 |
||||
Tiny-BERT |
59.5MB |
71.80 |
69.40 |
||||
Tiny-ALBERT |
24.8MB |
70.80 |
67.30 |
||||
Thai |
SEACoreNLP |
dp-th-ud-xlmr-best |
Bi-LSTM + Deep Biaffine Attention |
823.7MB |
89.74 |
82.30 |
|
dp-th-ud-xlmr |
Bi-LSTM + Deep Biaffine Attention |
67.9MB (Classifier layers) |
88.33 |
82.39 |
|||
dp-th-ud-scratch |
Bi-LSTM + Deep Biaffine Attention |
57.5MB |
81.06 |
73.67 |
|||
4.82MB |
? |
? |
|||||
Vietnamese |
SEACoreNLP |
dp-vi-ud-xlmr-best |
Bi-LSTM + Deep Biaffine Attention |
822.5MB |
77.79 |
71.03 |
|
dp-vi-ud-xlmr |
Bi-LSTM + Deep Biaffine Attention |
67.3MB (Classifier layers) |
77.37 |
73.65 |
|||
dp-vi-ud-scratch |
Bi-LSTM + Deep Biaffine Attention |
57.2MB |
67.56 |
63.96 |
|||
XLM-R Large |
Embeddings + Adapters + Deep Biaffine Attention |
71.07 |
65.37 |
||||
Bi-LSTM + Deep Biaffine Attention |
93.1MB |
53.63 |
48.16 |
||||
Bi-LSTM + Deep Biaffine Attention |
? |
? |
? |
? |
|||
Transition-based Parser |
15.3MB |
VnDT |
79.02 |
73.39 |
- 1
The scores displayed here under UAS and LAS for the Malaya models are reported as Arc Accuracy and Types Accuracy in the official Malaya documentation. We believe that they correspond and have therefore reported them as UAS and LAS in order to standardize the way we report metrics, but it is unclear what the author of the documentation meant exactly by these terms.
-
class
seacorenlp.parsing.dependency.
DependencyParser
¶ Base class to instantiate specific DependencyParser (AllenNLP Predictor)
- Options for model_name:
E.g.
dp-th-ud-xlmr
Refer to table containing Dependency Parser performance for full list
- Options for library_name:
malaya
(For Indonesian/Malay)pythainlp
(For Thai)underthesea
(For Vietnamese)
-
classmethod
from_library
(library_name, **kwargs)¶ Returns a third-party Predictor based on the name of the library provided.
Keyword arguments can be passed as necessary.
- Parameters
library_name (
str
) – Name of third-party library**kwargs – Additional keyword arguments specific to each library
- Return type
Keyword arguments for DependencyParser.from_library¶
- If choosing
malaya
: model
- Transformer model to use (Default =alxlnet
)bert
tiny-bert
albert
tiny-albert
xlnet
alxlnet
quantized
- Boolean for whether to use a quantized transformer (Default =False
)
AllenNLP Predictor¶
The DependencyParser
instance returned by DependencyParser.from_library
or
DependencyParser.from_pretrained
is an instance of the Predictor
class from the AllenNLP library.
It first segments the text into sentences and then parses each sentence.
Command Line Training¶
The architecture for training dependency parsers in SEACoreNLP is based on Dozat and Manning’s Bi-LSTM model with Deep Biaffine Attention. It comprises the following layers:
Embeddings (Word + POS) > Bi-LSTM Encoder > Biaffine Attention
The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface.
The general script to run to train a dependency parser with this architecture is as follows:
[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=dependency
The arguments that can/need to be specified for CLI training are detailed in the following table:
Argument |
Details |
Options |
---|---|---|
model_name |
Name of pre-trained model to use |
Huggingface transformer name (e.g. |
use_pretrained |
If using pre-trained model, set this to |
|
freeze |
Whether to freeze pre-trained model’s parameters when training |
|
embedding_dim |
No. of dimensions to use for word embeddings (if not using pre-trained model) |
Integer |
pos_tag_embedding_dim |
No. of dimensions to use for POS tag embeddings |
Integer |
lstm_input_dim |
No. of dimensions of input to Bi-LSTM (i.e. dimensions of embedding output) |
Integer |
lstm_hidden_dim |
No. of dimensions to use for Bi-LSTM’s hidden state |
Integer |
lstm_layers |
No. of layers of Bi-LSTM to use |
Integer |
lstm_dropout |
% of dropout between Bi-LSTM layers (if >1) |
Integer |
num_epochs |
No. of epochs to train for |
Integer |
batch_size |
Batch size for training |
Integer |
patience |
Early stopping patience |
Integer |
lr |
Learning rate |
Float ( |
For reference, the default parameters provided by AllenNLP (in accordance with Dozat and Manning’s paper) are as follows:
embedding_dim
= 100pos_tag_embedding_dim
= 100lstm_input_dim
= 200lstm_hidden_dim
= 400lstm_layers
= 3lstm_dropout
= 0.3batch_size
= 128lr
= 0.001
DatasetReaders for DependencyParsers¶
When training, the models read in the data using an AllenNLP DatasetReader
and
take in only a particular data format.
We currently use the default DatasetReader
provided by AllenNLP for dependency parsing.
It expects the Universal Dependencies format.
An example from the UD-ID-GSD dataset is shown here:
# sent_id = dev-s1
# text = Ahli rekayasa optik mendesain komponen dari instrumen optik seperti lensa, mikroskop, teleskop, dan peralatan lainnya yang mendukung sifat cahaya.
1 Ahli ahli PROPN NSD Number=Sing 4 nsubj _ MorphInd=^ahli<n>_NSD$
2 rekayasa rekayasa NOUN NSD Number=Sing 1 compound _ MorphInd=^rekayasa<n>_NSD$
3 optik optik NOUN NSD Number=Sing 2 compound _ MorphInd=^optik<n>_NSD$
Training Details for Native Models¶
All our native models are trained, validated and tested on official train/val/test splits provided by Universal Dependencies. However, for Thai, we did a random split of 90:10 into train and test sets. Thai models were trained on the train set and validated on the test set.