Constituency Parsing Module¶
Constituency Parsing is the task of breaking down a text into constituents or smaller phrases.
seacorenlp.parsing.constituency¶
Module for Constituency Parsing
Base class to instantiate specific ConstituencyParser (AllenNLP Predictor) |
Constituency Parsers¶
Model Performance¶
Language |
Package |
Model Name |
Architecture |
Size |
Dataset |
F1 (%) |
---|---|---|---|---|---|---|
Indonesian |
SEACoreNLP |
cp-id-kethu-benepar-xlmr-best |
Benepar |
825.9MB |
82.85 |
|
cp-id-kethu-xlmr |
AllenNLP |
15.2MB (Classifier layers) |
77.05 |
|||
XLNET |
498.0MB |
83.31 |
||||
BERT |
470.0MB |
80.35 |
||||
ALBERT |
180.0MB |
79.01 |
||||
Tiny-BERT |
125.0MB |
76.79 |
||||
Tiny-ALBERT |
56.7MB |
70.84 |
-
class
seacorenlp.parsing.constituency.
ConstituencyParser
¶ Base class to instantiate specific ConstituencyParser (AllenNLP Predictor)
- Options for model_name:
E.g.
cp-id-kethu-xlmr
Refer to table containing Constituency Parser performance for full list
- Options for library_name:
malaya
(For Indonesian/Malay)
-
classmethod
from_library
(library_name, **kwargs)¶ Returns a third-party Predictor based on the name of the library provided.
Keyword arguments can be passed as necessary.
- Parameters
library_name (
str
) – Name of third-party library**kwargs – Additional keyword arguments specific to each library
- Return type
Keyword arguments for ConstituencyParser.from_library¶
- If choosing
malaya
: model
- Transformer model to use (Default =xlnet
)bert
tiny-bert
albert
tiny-albert
xlnet
quantized
- Boolean for whether to use a quantized transformer (Default =False
)
AllenNLP Predictor¶
The ConstituencyParser
instance returned by ConstituencyParser.from_library
or
ConstituencyParser.from_pretrained
is an instance of the Predictor
class from the AllenNLP library.
It first segments the text into sentences and then parses each sentence into a NLTK tree.
Command Line Training¶
There are two architectures used to train constituency parsers in SEACoreNLP.
Berkeley Neural Parser¶
The first follows that of Berkeley Neural Parser. This architecture comprises embeddings followed by a variable number of self-attention layers. This architecture performs better than AllenNLP’s architecture so far. For more details, please refer to their original paper.
We currently do not provide a way to train models using this architecture as it does not follow the AllenNLP framework. Please refer to Berkeley Neural Parser’s Github for instructions on how to train a constituency parser using their architecture.
AllenNLP Constituency Parser¶
We provide a CLI for training models using AllenNLP’s own Constituency Parser architecture. The architecture is as follows:
Embeddings > Bi-LSTM Encoder > Bi-directional Span Extraction > Feedforward Neural Network
Please refer to AllenNLP’s original paper “Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (2018)” for more details.
The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface.
To train a model with AllenNLP’s Constituency Parser, the general script to run is as follows:
[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=constituency
The arguments that can/need to be specified for CLI training are detailed in the following table:
Argument |
Details |
Options |
---|---|---|
model_name |
Name of pre-trained model to use |
Huggingface transformer name (e.g. |
use_pretrained |
If using pre-trained model, set this to |
|
freeze |
Whether to freeze pre-trained model’s parameters when training |
|
embedding_dim |
No. of dimensions to use for word embeddings (if not using pre-trained model) |
Integer |
lstm_input_dim |
No. of dimensions of input to Bi-LSTM (i.e. dimensions of embedding output) |
Integer |
lstm_hidden_dim |
No. of dimensions to use for Bi-LSTM’s hidden state |
Integer |
lstm_layers |
No. of layers of Bi-LSTM to use |
Integer |
lstm_dropout |
% of dropout between Bi-LSTM layers (if >1) |
Integer |
ff_hidden_dim |
No. of dimensions for final feedforward layer |
Integer |
num_epochs |
No. of epochs to train for |
Integer |
batch_size |
Batch size for training |
Integer |
patience |
Early stopping patience |
Integer |
lr |
Learning rate |
Float ( |
DatasetReaders for ConstituencyParsers¶
When training, the models read in the data using an AllenNLP DatasetReader
and
take in only a particular data format.
We currently use the default DatasetReader
provided by AllenNLP for constituency parsing.
It expects the Penn Treebank format.
An example from the Kethu dataset is as follows:
(NP (NN Kera) (SBAR (IN untuk) (S (NP-SBJ (-NONE- *)) (VP (VB amankan) (NP (NP (NN pesta) (NN olahraga)))))))
(S (PP (IN Selama) (NP (NN bertahun-tahun))) (NP-SBJ (NN monyet)) (VP (VB mengganggu) (NP (NN warga) (NNP Delhi))) (. .))
Training Details for Native Models¶
We currently only have trained models for the Indonesian language using the Kethu dataset.
The official train/test split provided is 925: 125 sentences (1030 in total). For validation, we further split the train set into 850: 75 (train/val).
When training using the Berkeley Neural Parser architecture, there were certain bugs encountered with a few of the sentences in the dataset. We skipped those sentences, leaving us with 843 sentences for training. This problem was not seen with the AllenNLP architecture.