Named Entity Recognition Module¶

Named Entity Recognition (NER) is the task of identifying named entities in the input text. While a strict definition of a named entity is that of a proper noun, the definition is often expanded to include other categories that may be useful in information extraction.

Typical named entity categories include ORG (Organization), PER (Person), GPE (Geopolitical Entities) and LOC (Locations).

seacorenlp.tagging.ner¶

Module for Named Entity Recognition

NERTagger()

Base class to instantiate specific NERTagger (AllenNLP Predictor)

NER Taggers¶

Model Performance¶

Language	Package	Model Name	Architecture	Size	Dataset	F1 (%)
Indonesian	SEACoreNLP	ner-id-nergrit-xlmr-best	XLM-R (Base) + Bi-LSTM + CRF	797.7MB	NERGrit	79.85
		ner-id-nergrit-xlmr	XLM-R (Base) + Bi-LSTM + CRF	9.3MB (BiLSTMCRF)	NERGrit	75.31
	Malaya	XLNET	Transformer Embedding + CRF	446.6MB	Malaya	98.73
		BERT	Transformer Embedding + CRF	425.4MB	Malaya	98.54
		ALXLNET	Transformer Embedding + CRF	46.8MB	Malaya	98.34
		ALBERT	Transformer Embedding + CRF	48.6MB	Malaya	96.49
		Tiny-BERT	Transformer Embedding + CRF	57.7MB	Malaya	96.13
		Tiny-ALBERT	Transformer Embedding + CRF	22.4MB	Malaya	92.37
Thai	SEACoreNLP	ner-th-thainer-xlmr-best	XLM-R (Base) + Bi-LSTM + CRF	790.8MB	ThaiNER 1.3	89.49
		ner-th-thainer-xlmr	XLM-R (Base) + Bi-LSTM + CRF	9.4MB (BiLSTMCRF)	ThaiNER 1.3	87.07
		ner-th-thainer-scratch	Embeddings + Bi-LSTM + CRF	12.3MB	ThaiNER 1.3	80.11
	PyThaiNLP	ThaiNER 1.3	CRF	?	ThaiNER 1.3	87.00
		WangchanBERTa*	?	?	ThaiNER 1.3	86.49
			?	?	LST20	78.01
Vietnamese	UnderTheSea		CRF	172KB	?	?
	VnCoreNLP*		Dynamic Feature Induction	69.5MB	VLSP 2016	88.55

class seacorenlp.tagging.ner.NERTagger¶

Base class to instantiate specific NERTagger (AllenNLP Predictor)

Options for model_name:

E.g. ner-th-thainer-xlmr
Refer to table containing NER Tagger performance for full list

Options for library_name:

malaya (For Indonesian/Malay)
pythainlp (For Thai)
underthesea (For Vietnamese)

classmethod from_library(library_name, **kwargs)¶

Returns a third-party Predictor based on the name of the library provided.

Keyword arguments can be passed as necessary.

Parameters

library_name (str) – Name of third-party library
**kwargs – Additional keyword arguments specific to each library

Return type

Predictor

classmethod from_pretrained(model_name)¶

Returns a natively trained AllenNLP Predictor based on the model name provided

Parameters: model_name (str) – Name of the model
Return type: Predictor

Keyword arguments for NERTagger.from_library¶

If choosing malaya:

engine - Transformer model to use (Default = alxlnet)
- bert
- tiny-bert
- albert
- tiny-albert
- xlnet
- alxlnet
quantized - Boolean for whether to use a quantized transformer (Default = False)

AllenNLP Predictor¶

The NERTagger instance returned by NERTagger.from_library or NERTagger.from_pretrained is an instance of the Predictor class from the AllenNLP library.

class allennlp.predictors.Predictor¶

predict(text)¶

Parameters

text (str) – Text to predict on
use_bio_format (bool (, optional)) – Use BIO format for NER tags (if malaya is chosen as library), defaults to True

Returns

List of tuples containing the token’s text and its predicted NER tag

Return type

List[Tuple[str, str]]

Command Line Training¶

The default architecture used for NER in SEACoreNLP models is as follows:

Embeddings > Bi-LSTM Encoder > CRF

The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface. Non-pretrained embeddings can be purely word embeddings or a combination of word embeddings and character embeddings.

SEACoreNLP provides a CLI for training NER taggers. The general script to run is as follows:

[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=ner

The arguments that can/need to be specified for CLI training are detailed in the following table:

Argument	Details	Options
dataset_reader	DatasetReader to use	`th-thainer`/`id-nergrit`
model_name	Name of pre-trained model to use	Huggingface transformer name (e.g. `xlm-roberta-base`)
use_pretrained	If using pre-trained model, set this to `true`	`true`/ `false`
freeze	Whether to freeze pre-trained model’s parameters when training	`true`/ `false`
embedding_dim	No. of dimensions to use for word embeddings (if not using pre-trained model)	Integer
use_char_embeddings	Whether to use CNN character embeddings alongside word embeddings	`true`/ `false`
char_embedding_dim	No. of dimensions to use for CNN character embeddings	Integer
ngram_filter_sizes	Select the size of the ngram filter for CNN character embeddings	Integer
num_filters	No. of ngram filters to use	Integer
char_cnn_dropout	% dropout for CNN character embeddings	Float
transformer_hidden_dim	No. of output dimensions for pre-trained model (if selected)	Integer
lstm_hidden_dim	No. of dimensions to use for Bi-LSTM’s hidden state	Integer
lstm_layers	No. of layers of Bi-LSTM to use	Integer
lstm_dropout	% of dropout between Bi-LSTM layers (if >1)	Integer
num_epochs	No. of epochs to train for	Integer
batch_size	Batch size for training	Integer
patience	Early stopping patience	Integer
lr	Learning rate	Float (`1e-5` is acceptable)

DatasetReaders for NER Taggers¶

When training, the models read in the data using an AllenNLP DatasetReader and take in only a particular data format.

There are currently two dataset readers provided: id-nergrit and th-thainer. They are similar and are used for the Indonesian NERGrit dataset and Thai ThaiNER 1.3 datasets respectively. The data format expected is a .txt file with two columns, the first being the token and the second being the NER tag. There should be an empty line separating every sentence.

An example from the NERGrit dataset is shown here:

Obama        B-PERSON
belakangan   O
memicu       O
kontroversi  O
ketika       O
ia   O
meminta      O
Warren       B-PERSON

Tiga O
hari O
sebelum O

Training Details for Native Models¶

All our native models are trained, validated and tested on official train/val/test splits from the dataset providers.