Named Entity Recognition Module

Named Entity Recognition (NER) is the task of identifying named entities in the input text. While a strict definition of a named entity is that of a proper noun, the definition is often expanded to include other categories that may be useful in information extraction.

Typical named entity categories include ORG (Organization), PER (Person), GPE (Geopolitical Entities) and LOC (Locations).

seacorenlp.tagging.ner

Module for Named Entity Recognition

NERTagger()

Base class to instantiate specific NERTagger (AllenNLP Predictor)

NER Taggers

Model Performance

Language

Package

Model Name

Architecture

Size

Dataset

F1 (%)

Indonesian

SEACoreNLP

ner-id-nergrit-xlmr-best

XLM-R (Base) + Bi-LSTM + CRF

797.7MB

NERGrit

79.85

ner-id-nergrit-xlmr

XLM-R (Base) + Bi-LSTM + CRF

9.3MB (BiLSTMCRF)

NERGrit

75.31

Malaya

XLNET

Transformer Embedding + CRF

446.6MB

Malaya

98.73

BERT

Transformer Embedding + CRF

425.4MB

Malaya

98.54

ALXLNET

Transformer Embedding + CRF

46.8MB

Malaya

98.34

ALBERT

Transformer Embedding + CRF

48.6MB

Malaya

96.49

Tiny-BERT

Transformer Embedding + CRF

57.7MB

Malaya

96.13

Tiny-ALBERT

Transformer Embedding + CRF

22.4MB

Malaya

92.37

Thai

SEACoreNLP

ner-th-thainer-xlmr-best

XLM-R (Base) + Bi-LSTM + CRF

790.8MB

ThaiNER 1.3

89.49

ner-th-thainer-xlmr

XLM-R (Base) + Bi-LSTM + CRF

9.4MB (BiLSTMCRF)

ThaiNER 1.3

87.07

ner-th-thainer-scratch

Embeddings + Bi-LSTM + CRF

12.3MB

ThaiNER 1.3

80.11

PyThaiNLP

ThaiNER 1.3

CRF

?

ThaiNER 1.3

87.00

WangchanBERTa*

?

?

ThaiNER 1.3

86.49

?

?

LST20

78.01

Vietnamese

UnderTheSea

CRF

172KB

?

?

VnCoreNLP*

Dynamic Feature Induction

69.5MB

VLSP 2016

88.55

class seacorenlp.tagging.ner.NERTagger

Base class to instantiate specific NERTagger (AllenNLP Predictor)

Options for model_name:
  • E.g. ner-th-thainer-xlmr

  • Refer to table containing NER Tagger performance for full list

Options for library_name:
  • malaya (For Indonesian/Malay)

  • pythainlp (For Thai)

  • underthesea (For Vietnamese)

classmethod from_library(library_name, **kwargs)

Returns a third-party Predictor based on the name of the library provided.

Keyword arguments can be passed as necessary.

Parameters
  • library_name (str) – Name of third-party library

  • **kwargs – Additional keyword arguments specific to each library

Return type

Predictor

classmethod from_pretrained(model_name)

Returns a natively trained AllenNLP Predictor based on the model name provided

Parameters

model_name (str) – Name of the model

Return type

Predictor

Keyword arguments for NERTagger.from_library

If choosing malaya:
  • engine - Transformer model to use (Default = alxlnet)

    • bert

    • tiny-bert

    • albert

    • tiny-albert

    • xlnet

    • alxlnet

  • quantized - Boolean for whether to use a quantized transformer (Default = False)

AllenNLP Predictor

The NERTagger instance returned by NERTagger.from_library or NERTagger.from_pretrained is an instance of the Predictor class from the AllenNLP library.

class allennlp.predictors.Predictor
predict(text)
Parameters
  • text (str) – Text to predict on

  • use_bio_format (bool (, optional)) – Use BIO format for NER tags (if malaya is chosen as library), defaults to True

Returns

List of tuples containing the token’s text and its predicted NER tag

Return type

List[Tuple[str, str]]

Command Line Training

The default architecture used for NER in SEACoreNLP models is as follows:

Embeddings > Bi-LSTM Encoder > CRF

The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface. Non-pretrained embeddings can be purely word embeddings or a combination of word embeddings and character embeddings.

SEACoreNLP provides a CLI for training NER taggers. The general script to run is as follows:

[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=ner

The arguments that can/need to be specified for CLI training are detailed in the following table:

Argument

Details

Options

dataset_reader

DatasetReader to use

th-thainer/id-nergrit

model_name

Name of pre-trained model to use

Huggingface transformer name (e.g. xlm-roberta-base)

use_pretrained

If using pre-trained model, set this to true

true/ false

freeze

Whether to freeze pre-trained model’s parameters when training

true/ false

embedding_dim

No. of dimensions to use for word embeddings (if not using pre-trained model)

Integer

use_char_embeddings

Whether to use CNN character embeddings alongside word embeddings

true/ false

char_embedding_dim

No. of dimensions to use for CNN character embeddings

Integer

ngram_filter_sizes

Select the size of the ngram filter for CNN character embeddings

Integer

num_filters

No. of ngram filters to use

Integer

char_cnn_dropout

% dropout for CNN character embeddings

Float

transformer_hidden_dim

No. of output dimensions for pre-trained model (if selected)

Integer

lstm_hidden_dim

No. of dimensions to use for Bi-LSTM’s hidden state

Integer

lstm_layers

No. of layers of Bi-LSTM to use

Integer

lstm_dropout

% of dropout between Bi-LSTM layers (if >1)

Integer

num_epochs

No. of epochs to train for

Integer

batch_size

Batch size for training

Integer

patience

Early stopping patience

Integer

lr

Learning rate

Float (1e-5 is acceptable)

DatasetReaders for NER Taggers

When training, the models read in the data using an AllenNLP DatasetReader and take in only a particular data format.

There are currently two dataset readers provided: id-nergrit and th-thainer. They are similar and are used for the Indonesian NERGrit dataset and Thai ThaiNER 1.3 datasets respectively. The data format expected is a .txt file with two columns, the first being the token and the second being the NER tag. There should be an empty line separating every sentence.

An example from the NERGrit dataset is shown here:

Obama        B-PERSON
belakangan   O
memicu       O
kontroversi  O
ketika       O
ia   O
meminta      O
Warren       B-PERSON

Tiga O
hari O
sebelum O

Training Details for Native Models

All our native models are trained, validated and tested on official train/val/test splits from the dataset providers.