.. currentmodule:: seacorenlp.tagging.ner ############################### Named Entity Recognition Module ############################### Named Entity Recognition (NER) is the task of identifying named entities in the input text. While a strict definition of a named entity is that of a proper noun, the definition is often expanded to include other categories that may be useful in information extraction. Typical named entity categories include ``ORG`` (Organization), ``PER`` (Person), ``GPE`` (Geopolitical Entities) and ``LOC`` (Locations). ********************** seacorenlp.tagging.ner ********************** .. automodule:: seacorenlp.tagging.ner .. autosummary:: NERTagger *********** NER Taggers *********** Model Performance ================= .. csv-table:: :file: ../intro/tables/ner-taggers.csv :header-rows: 1 .. autoclass:: NERTagger :members: from_library, from_pretrained Keyword arguments for NERTagger.from_library ============================================ If choosing ``malaya``: * ``engine`` - Transformer model to use (Default = ``alxlnet``) * ``bert`` * ``tiny-bert`` * ``albert`` * ``tiny-albert`` * ``xlnet`` * ``alxlnet`` * ``quantized`` - Boolean for whether to use a quantized transformer (Default = ``False``) AllenNLP Predictor ================== The ``NERTagger`` instance returned by ``NERTagger.from_library`` or ``NERTagger.from_pretrained`` is an instance of the ``Predictor`` class from the AllenNLP library. .. class:: allennlp.predictors.Predictor .. method:: predict(text) :param text: Text to predict on :param use_bio_format: Use BIO format for NER tags (if ``malaya`` is chosen as library), defaults to True :type text: str :type use_bio_format: bool (, optional) :return: List of tuples containing the token's text and its predicted NER tag :rtype: List[Tuple[str, str]] ********************* Command Line Training ********************* The default architecture used for NER in SEACoreNLP models is as follows: Embeddings > Bi-LSTM Encoder > CRF The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface. Non-pretrained embeddings can be purely word embeddings or a combination of word embeddings and character embeddings. SEACoreNLP provides a CLI for training NER taggers. The general script to run is as follows: .. code-block:: shell [ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=ner The arguments that can/need to be specified for CLI training are detailed in the following table: .. csv-table:: :file: tables/ner-arguments.csv :header-rows: 1 DatasetReaders for NER Taggers ============================== When training, the models read in the data using an AllenNLP ``DatasetReader`` and take in only a particular data format. There are currently two dataset readers provided: ``id-nergrit`` and ``th-thainer``. They are similar and are used for the Indonesian `NERGrit `_ dataset and Thai `ThaiNER 1.3 `_ datasets respectively. The data format expected is a ``.txt`` file with two columns, the first being the token and the second being the NER tag. There should be an empty line separating every sentence. An example from the `NERGrit `_ dataset is shown here: .. code-block:: Obama B-PERSON belakangan O memicu O kontroversi O ketika O ia O meminta O Warren B-PERSON Tiga O hari O sebelum O Training Details for Native Models ================================== All our native models are trained, validated and tested on official train/val/test splits from the dataset providers.