.. currentmodule:: seacorenlp.tagging.ner

###############################
Named Entity Recognition Module
###############################

Named Entity Recognition (NER) is the task of identifying named entities in the input text.
While a strict definition of a named entity is that of a proper noun, the definition is often
expanded to include other categories that may be useful in information extraction.

Typical named entity categories include ``ORG`` (Organization), ``PER`` (Person),
``GPE`` (Geopolitical Entities) and ``LOC`` (Locations).


**********************
seacorenlp.tagging.ner
**********************

.. automodule:: seacorenlp.tagging.ner

.. autosummary::

   NERTagger

***********
NER Taggers
***********

Model Performance
=================

.. csv-table::
   :file: ../intro/tables/ner-taggers.csv
   :header-rows: 1

.. autoclass:: NERTagger
   :members: from_library, from_pretrained

Keyword arguments for NERTagger.from_library
============================================

If choosing ``malaya``:
  * ``engine`` - Transformer model to use (Default = ``alxlnet``)

    * ``bert``
    * ``tiny-bert``
    * ``albert``
    * ``tiny-albert``
    * ``xlnet``
    * ``alxlnet``

  * ``quantized`` - Boolean for whether to use a quantized transformer (Default = ``False``)

AllenNLP Predictor
==================

The ``NERTagger`` instance returned by ``NERTagger.from_library`` or ``NERTagger.from_pretrained``
is an instance of the ``Predictor`` class from the AllenNLP library.

.. class:: allennlp.predictors.Predictor

   .. method:: predict(text)

      :param text: Text to predict on
      :param use_bio_format: Use BIO format for NER tags (if ``malaya`` is chosen as library), defaults to True
      :type text: str
      :type use_bio_format: bool (, optional)
      :return: List of tuples containing the token's text and its predicted NER tag
      :rtype: List[Tuple[str, str]]

*********************
Command Line Training
*********************

The default architecture used for NER in SEACoreNLP models is as follows:

Embeddings > Bi-LSTM Encoder > CRF

The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface.
Non-pretrained embeddings can be purely word embeddings or a combination of word embeddings and character embeddings.

SEACoreNLP provides a CLI for training NER taggers. The general script to run is as follows:

.. code-block:: shell

   [ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=ner

The arguments that can/need to be specified for CLI training are detailed in the following table:

.. csv-table::
   :file: tables/ner-arguments.csv
   :header-rows: 1

DatasetReaders for NER Taggers
==============================

When training, the models read in the data using an AllenNLP ``DatasetReader`` and
take in only a particular data format.

There are currently two dataset readers provided: ``id-nergrit`` and ``th-thainer``.
They are similar and are used for the Indonesian
`NERGrit <https://github.com/indobenchmark/indonlu/tree/master/dataset/nergrit_ner-grit>`_
dataset and Thai `ThaiNER 1.3 <https://github.com/wannaphong/thai-ner/>`_
datasets respectively. The data format expected is a ``.txt`` file with two columns,
the first being the token and the second being the NER tag. There should be an empty
line separating every sentence.

An example from the `NERGrit <https://github.com/indobenchmark/indonlu/tree/master/dataset/nergrit_ner-grit>`_
dataset is shown here:

.. code-block::

   Obama	B-PERSON
   belakangan	O
   memicu	O
   kontroversi	O
   ketika	O
   ia	O
   meminta	O
   Warren	B-PERSON

   Tiga O
   hari O
   sebelum O


Training Details for Native Models
==================================

All our native models are trained, validated and tested on official train/val/test splits
from the dataset providers.