.. currentmodule:: seacorenlp.parsing.constituency ########################### Constituency Parsing Module ########################### Constituency Parsing is the task of breaking down a text into constituents or smaller phrases. ******************************* seacorenlp.parsing.constituency ******************************* .. automodule:: seacorenlp.parsing.constituency .. autosummary:: ConstituencyParser ******************** Constituency Parsers ******************** Model Performance ================= .. csv-table:: :file: tables/constituency-parsers.csv :header-rows: 1 .. autoclass:: ConstituencyParser :members: from_library, from_pretrained Keyword arguments for ConstituencyParser.from_library ===================================================== If choosing ``malaya``: * ``model`` - Transformer model to use (Default = ``xlnet``) * ``bert`` * ``tiny-bert`` * ``albert`` * ``tiny-albert`` * ``xlnet`` * ``quantized`` - Boolean for whether to use a quantized transformer (Default = ``False``) AllenNLP Predictor ================== The ``ConstituencyParser`` instance returned by ``ConstituencyParser.from_library`` or ``ConstituencyParser.from_pretrained`` is an instance of the ``Predictor`` class from the AllenNLP library. It first segments the text into sentences and then parses each sentence into a NLTK tree. .. class:: allennlp.predictors.Predictor .. method:: predict(text) :param text: Text to predict on :type text: str :return: List of NLTK Trees, one for each sentence :rtype: List[Tree] ********************* Command Line Training ********************* There are two architectures used to train constituency parsers in SEACoreNLP. Berkeley Neural Parser ====================== The first follows that of `Berkeley Neural Parser `_. This architecture comprises embeddings followed by a variable number of self-attention layers. This architecture performs better than AllenNLP's architecture so far. For more details, please refer to their original `paper `_. We currently do not provide a way to train models using this architecture as it does not follow the AllenNLP framework. Please refer to Berkeley Neural Parser's `Github `_ for instructions on how to train a constituency parser using their architecture. AllenNLP Constituency Parser ============================ We provide a CLI for training models using AllenNLP's own Constituency Parser architecture. The architecture is as follows: Embeddings > Bi-LSTM Encoder > Bi-directional Span Extraction > Feedforward Neural Network Please refer to AllenNLP's original paper `"Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (2018)" `_ for more details. The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface. To train a model with AllenNLP's Constituency Parser, the general script to run is as follows: .. code-block:: shell [ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=constituency The arguments that can/need to be specified for CLI training are detailed in the following table: .. csv-table:: :file: tables/cp-arguments.csv :header-rows: 1 DatasetReaders for ConstituencyParsers ====================================== When training, the models read in the data using an AllenNLP ``DatasetReader`` and take in only a particular data format. We currently use the default ``DatasetReader`` provided by AllenNLP for constituency parsing. It expects the Penn Treebank format. An example from the `Kethu `_ dataset is as follows: .. code-block:: (NP (NN Kera) (SBAR (IN untuk) (S (NP-SBJ (-NONE- *)) (VP (VB amankan) (NP (NP (NN pesta) (NN olahraga))))))) (S (PP (IN Selama) (NP (NN bertahun-tahun))) (NP-SBJ (NN monyet)) (VP (VB mengganggu) (NP (NN warga) (NNP Delhi))) (. .)) Training Details for Native Models ================================== We currently only have trained models for the Indonesian language using the `Kethu `_ dataset. The official train/test split provided is 925: 125 sentences (1030 in total). For validation, we further split the train set into 850: 75 (train/val). When training using the Berkeley Neural Parser architecture, there were certain bugs encountered with a few of the sentences in the dataset. We skipped those sentences, leaving us with 843 sentences for training. This problem was not seen with the AllenNLP architecture.