Constituency Parsing Module

Constituency Parsing is the task of breaking down a text into constituents or smaller phrases.

seacorenlp.parsing.constituency

Module for Constituency Parsing

ConstituencyParser()

Base class to instantiate specific ConstituencyParser (AllenNLP Predictor)

Constituency Parsers

Model Performance

Language

Package

Model Name

Architecture

Size

Dataset

F1 (%)

Indonesian

SEACoreNLP

cp-id-kethu-benepar-xlmr-best

Benepar

825.9MB

Kethu

82.85

cp-id-kethu-xlmr

AllenNLP

15.2MB (Classifier layers)

Kethu

77.05

Malaya

XLNET

498.0MB

Augmented Kethu

83.31

BERT

470.0MB

Augmented Kethu

80.35

ALBERT

180.0MB

Augmented Kethu

79.01

Tiny-BERT

125.0MB

Augmented Kethu

76.79

Tiny-ALBERT

56.7MB

Augmented Kethu

70.84

class seacorenlp.parsing.constituency.ConstituencyParser

Base class to instantiate specific ConstituencyParser (AllenNLP Predictor)

Options for model_name:
  • E.g. cp-id-kethu-xlmr

  • Refer to table containing Constituency Parser performance for full list

Options for library_name:
  • malaya (For Indonesian/Malay)

classmethod from_library(library_name, **kwargs)

Returns a third-party Predictor based on the name of the library provided.

Keyword arguments can be passed as necessary.

Parameters
  • library_name (str) – Name of third-party library

  • **kwargs – Additional keyword arguments specific to each library

Return type

Predictor

classmethod from_pretrained(model_name)

Returns a natively trained ConstituencyParser based on the model name provided

Parameters

model_name (str) – Name of the model

Returns

An AllenNLP Predictor that performs constituency parsing

Return type

Predictor

Keyword arguments for ConstituencyParser.from_library

If choosing malaya:
  • model - Transformer model to use (Default = xlnet)

    • bert

    • tiny-bert

    • albert

    • tiny-albert

    • xlnet

  • quantized - Boolean for whether to use a quantized transformer (Default = False)

AllenNLP Predictor

The ConstituencyParser instance returned by ConstituencyParser.from_library or ConstituencyParser.from_pretrained is an instance of the Predictor class from the AllenNLP library. It first segments the text into sentences and then parses each sentence into a NLTK tree.

class allennlp.predictors.Predictor
predict(text)
Parameters

text (str) – Text to predict on

Returns

List of NLTK Trees, one for each sentence

Return type

List[Tree]

Command Line Training

There are two architectures used to train constituency parsers in SEACoreNLP.

Berkeley Neural Parser

The first follows that of Berkeley Neural Parser. This architecture comprises embeddings followed by a variable number of self-attention layers. This architecture performs better than AllenNLP’s architecture so far. For more details, please refer to their original paper.

We currently do not provide a way to train models using this architecture as it does not follow the AllenNLP framework. Please refer to Berkeley Neural Parser’s Github for instructions on how to train a constituency parser using their architecture.

AllenNLP Constituency Parser

We provide a CLI for training models using AllenNLP’s own Constituency Parser architecture. The architecture is as follows:

Embeddings > Bi-LSTM Encoder > Bi-directional Span Extraction > Feedforward Neural Network

Please refer to AllenNLP’s original paper “Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (2018)” for more details.

The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface.

To train a model with AllenNLP’s Constituency Parser, the general script to run is as follows:

[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=constituency

The arguments that can/need to be specified for CLI training are detailed in the following table:

Argument

Details

Options

model_name

Name of pre-trained model to use

Huggingface transformer name (e.g. xlm-roberta-base)

use_pretrained

If using pre-trained model, set this to true

true/ false

freeze

Whether to freeze pre-trained model’s parameters when training

true/ false

embedding_dim

No. of dimensions to use for word embeddings (if not using pre-trained model)

Integer

lstm_input_dim

No. of dimensions of input to Bi-LSTM (i.e. dimensions of embedding output)

Integer

lstm_hidden_dim

No. of dimensions to use for Bi-LSTM’s hidden state

Integer

lstm_layers

No. of layers of Bi-LSTM to use

Integer

lstm_dropout

% of dropout between Bi-LSTM layers (if >1)

Integer

ff_hidden_dim

No. of dimensions for final feedforward layer

Integer

num_epochs

No. of epochs to train for

Integer

batch_size

Batch size for training

Integer

patience

Early stopping patience

Integer

lr

Learning rate

Float (1e-5 is acceptable)

DatasetReaders for ConstituencyParsers

When training, the models read in the data using an AllenNLP DatasetReader and take in only a particular data format.

We currently use the default DatasetReader provided by AllenNLP for constituency parsing. It expects the Penn Treebank format.

An example from the Kethu dataset is as follows:

(NP (NN Kera) (SBAR (IN untuk) (S (NP-SBJ (-NONE- *)) (VP (VB amankan) (NP (NP (NN pesta) (NN olahraga)))))))
(S (PP (IN Selama) (NP (NN bertahun-tahun))) (NP-SBJ (NN monyet)) (VP (VB mengganggu) (NP (NN warga) (NNP Delhi))) (. .))

Training Details for Native Models

We currently only have trained models for the Indonesian language using the Kethu dataset.

The official train/test split provided is 925: 125 sentences (1030 in total). For validation, we further split the train set into 850: 75 (train/val).

When training using the Berkeley Neural Parser architecture, there were certain bugs encountered with a few of the sentences in the dataset. We skipped those sentences, leaving us with 843 sentences for training. This problem was not seen with the AllenNLP architecture.