Constituency Parsing Module¶

Constituency Parsing is the task of breaking down a text into constituents or smaller phrases.

seacorenlp.parsing.constituency¶

Module for Constituency Parsing

ConstituencyParser()

Base class to instantiate specific ConstituencyParser (AllenNLP Predictor)

Constituency Parsers¶

Model Performance¶

Language	Package	Model Name	Architecture	Size	Dataset	F1 (%)
Indonesian	SEACoreNLP	cp-id-kethu-benepar-xlmr-best	Benepar	825.9MB	Kethu	82.85
		cp-id-kethu-xlmr	AllenNLP	15.2MB (Classifier layers)	Kethu	77.05
	Malaya	XLNET		498.0MB	Augmented Kethu	83.31
		BERT		470.0MB	Augmented Kethu	80.35
		ALBERT		180.0MB	Augmented Kethu	79.01
		Tiny-BERT		125.0MB	Augmented Kethu	76.79
		Tiny-ALBERT		56.7MB	Augmented Kethu	70.84

class seacorenlp.parsing.constituency.ConstituencyParser¶

Base class to instantiate specific ConstituencyParser (AllenNLP Predictor)

Options for model_name:

E.g. cp-id-kethu-xlmr
Refer to table containing Constituency Parser performance for full list

Options for library_name:

malaya (For Indonesian/Malay)

classmethod from_library(library_name, **kwargs)¶

Returns a third-party Predictor based on the name of the library provided.

Keyword arguments can be passed as necessary.

Parameters

library_name (str) – Name of third-party library
**kwargs – Additional keyword arguments specific to each library

Return type

Predictor

classmethod from_pretrained(model_name)¶

Returns a natively trained ConstituencyParser based on the model name provided

Parameters: model_name (str) – Name of the model
Returns: An AllenNLP Predictor that performs constituency parsing
Return type: Predictor

Keyword arguments for ConstituencyParser.from_library¶

If choosing malaya:

model - Transformer model to use (Default = xlnet)
- bert
- tiny-bert
- albert
- tiny-albert
- xlnet
quantized - Boolean for whether to use a quantized transformer (Default = False)

AllenNLP Predictor¶

The ConstituencyParser instance returned by ConstituencyParser.from_library or ConstituencyParser.from_pretrained is an instance of the Predictor class from the AllenNLP library. It first segments the text into sentences and then parses each sentence into a NLTK tree.

class allennlp.predictors.Predictor¶

predict(text)¶

Parameters: text (str) – Text to predict on
Returns: List of NLTK Trees, one for each sentence
Return type: List[Tree]

Command Line Training¶

There are two architectures used to train constituency parsers in SEACoreNLP.

Berkeley Neural Parser¶

The first follows that of Berkeley Neural Parser. This architecture comprises embeddings followed by a variable number of self-attention layers. This architecture performs better than AllenNLP’s architecture so far. For more details, please refer to their original paper.

We currently do not provide a way to train models using this architecture as it does not follow the AllenNLP framework. Please refer to Berkeley Neural Parser’s Github for instructions on how to train a constituency parser using their architecture.

AllenNLP Constituency Parser¶

We provide a CLI for training models using AllenNLP’s own Constituency Parser architecture. The architecture is as follows:

Embeddings > Bi-LSTM Encoder > Bi-directional Span Extraction > Feedforward Neural Network

Please refer to AllenNLP’s original paper “Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (2018)” for more details.

The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface.

To train a model with AllenNLP’s Constituency Parser, the general script to run is as follows:

[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=constituency

The arguments that can/need to be specified for CLI training are detailed in the following table:

Argument	Details	Options
model_name	Name of pre-trained model to use	Huggingface transformer name (e.g. `xlm-roberta-base`)
use_pretrained	If using pre-trained model, set this to `true`	`true`/ `false`
freeze	Whether to freeze pre-trained model’s parameters when training	`true`/ `false`
embedding_dim	No. of dimensions to use for word embeddings (if not using pre-trained model)	Integer
lstm_input_dim	No. of dimensions of input to Bi-LSTM (i.e. dimensions of embedding output)	Integer
lstm_hidden_dim	No. of dimensions to use for Bi-LSTM’s hidden state	Integer
lstm_layers	No. of layers of Bi-LSTM to use	Integer
lstm_dropout	% of dropout between Bi-LSTM layers (if >1)	Integer
ff_hidden_dim	No. of dimensions for final feedforward layer	Integer
num_epochs	No. of epochs to train for	Integer
batch_size	Batch size for training	Integer
patience	Early stopping patience	Integer
lr	Learning rate	Float (`1e-5` is acceptable)

DatasetReaders for ConstituencyParsers¶

When training, the models read in the data using an AllenNLP DatasetReader and take in only a particular data format.

We currently use the default DatasetReader provided by AllenNLP for constituency parsing. It expects the Penn Treebank format.

An example from the Kethu dataset is as follows:

(NP (NN Kera) (SBAR (IN untuk) (S (NP-SBJ (-NONE- *)) (VP (VB amankan) (NP (NP (NN pesta) (NN olahraga)))))))
(S (PP (IN Selama) (NP (NN bertahun-tahun))) (NP-SBJ (NN monyet)) (VP (VB mengganggu) (NP (NN warga) (NNP Delhi))) (. .))

Training Details for Native Models¶

We currently only have trained models for the Indonesian language using the Kethu dataset.

The official train/test split provided is 925: 125 sentences (1030 in total). For validation, we further split the train set into 850: 75 (train/val).

When training using the Berkeley Neural Parser architecture, there were certain bugs encountered with a few of the sentences in the dataset. We skipped those sentences, leaving us with 843 sentences for training. This problem was not seen with the AllenNLP architecture.