.. currentmodule:: seacorenlp.parsing.constituency

###########################
Constituency Parsing Module
###########################

Constituency Parsing is the task of breaking down a text into constituents or smaller phrases.


*******************************
seacorenlp.parsing.constituency
*******************************

.. automodule:: seacorenlp.parsing.constituency

.. autosummary::

   ConstituencyParser

********************
Constituency Parsers
********************

Model Performance
=================

.. csv-table::
   :file: tables/constituency-parsers.csv
   :header-rows: 1

.. autoclass:: ConstituencyParser
   :members: from_library, from_pretrained

Keyword arguments for ConstituencyParser.from_library
=====================================================

If choosing ``malaya``:
  * ``model`` - Transformer model to use (Default = ``xlnet``)

    * ``bert``
    * ``tiny-bert``
    * ``albert``
    * ``tiny-albert``
    * ``xlnet``

  * ``quantized`` - Boolean for whether to use a quantized transformer (Default = ``False``)

AllenNLP Predictor
==================

The ``ConstituencyParser`` instance returned by ``ConstituencyParser.from_library`` or
``ConstituencyParser.from_pretrained`` is an instance of the ``Predictor`` class from the AllenNLP library.
It first segments the text into sentences and then parses each sentence into a NLTK tree.

.. class:: allennlp.predictors.Predictor

   .. method:: predict(text)

      :param text: Text to predict on
      :type text: str
      :return: List of NLTK Trees, one for each sentence
      :rtype: List[Tree]

*********************
Command Line Training
*********************

There are two architectures used to train constituency parsers in SEACoreNLP.

Berkeley Neural Parser
======================

The first follows that of `Berkeley Neural Parser <https://github.com/nikitakit/self-attentive-parser>`_.
This architecture comprises embeddings followed by a variable number of self-attention layers. This architecture
performs better than AllenNLP's architecture so far. For more details, please refer to their original
`paper <https://arxiv.org/abs/1812.11760>`_.

We currently do not provide a way to train models using this architecture as it does not follow the
AllenNLP framework. Please refer to Berkeley Neural Parser's
`Github <https://github.com/nikitakit/self-attentive-parser>`_ for instructions on how to train a
constituency parser using their architecture.

AllenNLP Constituency Parser
============================

We provide a CLI for training models using AllenNLP's own Constituency Parser architecture.
The architecture is as follows:

Embeddings > Bi-LSTM Encoder > Bi-directional Span Extraction > Feedforward Neural Network

Please refer to AllenNLP's original paper `"Extending a Parser
to Distant Domains Using a Few Dozen Partially Annotated Examples (2018)"
<https://arxiv.org/abs/1805.06556>`_ for more details.

The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface.

To train a model with AllenNLP's Constituency Parser, the general script to run is as follows:

.. code-block:: shell

   [ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=constituency

The arguments that can/need to be specified for CLI training are detailed in the following table:

.. csv-table::
   :file: tables/cp-arguments.csv
   :header-rows: 1

DatasetReaders for ConstituencyParsers
======================================

When training, the models read in the data using an AllenNLP ``DatasetReader`` and
take in only a particular data format.

We currently use the default ``DatasetReader`` provided by AllenNLP for constituency parsing.
It expects the Penn Treebank format.

An example from the `Kethu <https://github.com/ialfina/kethu/tree/master/kethu-2.0>`_ dataset
is as follows:

.. code-block::

   (NP (NN Kera) (SBAR (IN untuk) (S (NP-SBJ (-NONE- *)) (VP (VB amankan) (NP (NP (NN pesta) (NN olahraga)))))))
   (S (PP (IN Selama) (NP (NN bertahun-tahun))) (NP-SBJ (NN monyet)) (VP (VB mengganggu) (NP (NN warga) (NNP Delhi))) (. .))


Training Details for Native Models
==================================

We currently only have trained models for the Indonesian language using the
`Kethu <https://github.com/ialfina/kethu/tree/master/kethu-2.0>`_ dataset.

The official train/test split provided is 925: 125 sentences (1030 in total).
For validation, we further split the train set into 850: 75 (train/val).

When training using the Berkeley Neural Parser architecture, there were certain
bugs encountered with a few of the sentences in the dataset. We skipped those
sentences, leaving us with 843 sentences for training. This problem was not
seen with the AllenNLP architecture.