Named Entity Recognition Module¶
Named Entity Recognition (NER) is the task of identifying named entities in the input text. While a strict definition of a named entity is that of a proper noun, the definition is often expanded to include other categories that may be useful in information extraction.
Typical named entity categories include ORG
(Organization), PER
(Person),
GPE
(Geopolitical Entities) and LOC
(Locations).
seacorenlp.tagging.ner¶
Module for Named Entity Recognition
Base class to instantiate specific NERTagger (AllenNLP Predictor) |
NER Taggers¶
Model Performance¶
Language |
Package |
Model Name |
Architecture |
Size |
Dataset |
F1 (%) |
---|---|---|---|---|---|---|
Indonesian |
SEACoreNLP |
ner-id-nergrit-xlmr-best |
XLM-R (Base) + Bi-LSTM + CRF |
797.7MB |
79.85 |
|
ner-id-nergrit-xlmr |
XLM-R (Base) + Bi-LSTM + CRF |
9.3MB (BiLSTMCRF) |
75.31 |
|||
XLNET |
Transformer Embedding + CRF |
446.6MB |
98.73 |
|||
BERT |
Transformer Embedding + CRF |
425.4MB |
98.54 |
|||
ALXLNET |
Transformer Embedding + CRF |
46.8MB |
98.34 |
|||
ALBERT |
Transformer Embedding + CRF |
48.6MB |
96.49 |
|||
Tiny-BERT |
Transformer Embedding + CRF |
57.7MB |
96.13 |
|||
Tiny-ALBERT |
Transformer Embedding + CRF |
22.4MB |
92.37 |
|||
Thai |
SEACoreNLP |
ner-th-thainer-xlmr-best |
XLM-R (Base) + Bi-LSTM + CRF |
790.8MB |
89.49 |
|
ner-th-thainer-xlmr |
XLM-R (Base) + Bi-LSTM + CRF |
9.4MB (BiLSTMCRF) |
87.07 |
|||
ner-th-thainer-scratch |
Embeddings + Bi-LSTM + CRF |
12.3MB |
80.11 |
|||
ThaiNER 1.3 |
CRF |
? |
87.00 |
|||
WangchanBERTa* |
? |
? |
86.49 |
|||
? |
? |
78.01 |
||||
Vietnamese |
CRF |
172KB |
? |
? |
||
Dynamic Feature Induction |
69.5MB |
VLSP 2016 |
88.55 |
-
class
seacorenlp.tagging.ner.
NERTagger
¶ Base class to instantiate specific NERTagger (AllenNLP Predictor)
- Options for model_name:
E.g.
ner-th-thainer-xlmr
Refer to table containing NER Tagger performance for full list
- Options for library_name:
malaya
(For Indonesian/Malay)pythainlp
(For Thai)underthesea
(For Vietnamese)
-
classmethod
from_library
(library_name, **kwargs)¶ Returns a third-party Predictor based on the name of the library provided.
Keyword arguments can be passed as necessary.
- Parameters
library_name (
str
) – Name of third-party library**kwargs – Additional keyword arguments specific to each library
- Return type
Keyword arguments for NERTagger.from_library¶
- If choosing
malaya
: engine
- Transformer model to use (Default =alxlnet
)bert
tiny-bert
albert
tiny-albert
xlnet
alxlnet
quantized
- Boolean for whether to use a quantized transformer (Default =False
)
AllenNLP Predictor¶
The NERTagger
instance returned by NERTagger.from_library
or NERTagger.from_pretrained
is an instance of the Predictor
class from the AllenNLP library.
-
class
allennlp.predictors.
Predictor
¶ -
predict
(text)¶ - Parameters
text (str) – Text to predict on
use_bio_format (bool (, optional)) – Use BIO format for NER tags (if
malaya
is chosen as library), defaults to True
- Returns
List of tuples containing the token’s text and its predicted NER tag
- Return type
List[Tuple[str, str]]
-
Command Line Training¶
The default architecture used for NER in SEACoreNLP models is as follows:
Embeddings > Bi-LSTM Encoder > CRF
The embeddings can be trained from scratch or using a pretrained transformer model from Huggingface. Non-pretrained embeddings can be purely word embeddings or a combination of word embeddings and character embeddings.
SEACoreNLP provides a CLI for training NER taggers. The general script to run is as follows:
[ARGUMENTS ...] train_data_path=PATH validation_data_path=PATH seacorenlp train --task=ner
The arguments that can/need to be specified for CLI training are detailed in the following table:
Argument |
Details |
Options |
---|---|---|
dataset_reader |
DatasetReader to use |
|
model_name |
Name of pre-trained model to use |
Huggingface transformer name (e.g. |
use_pretrained |
If using pre-trained model, set this to |
|
freeze |
Whether to freeze pre-trained model’s parameters when training |
|
embedding_dim |
No. of dimensions to use for word embeddings (if not using pre-trained model) |
Integer |
use_char_embeddings |
Whether to use CNN character embeddings alongside word embeddings |
|
char_embedding_dim |
No. of dimensions to use for CNN character embeddings |
Integer |
ngram_filter_sizes |
Select the size of the ngram filter for CNN character embeddings |
Integer |
num_filters |
No. of ngram filters to use |
Integer |
char_cnn_dropout |
% dropout for CNN character embeddings |
Float |
transformer_hidden_dim |
No. of output dimensions for pre-trained model (if selected) |
Integer |
lstm_hidden_dim |
No. of dimensions to use for Bi-LSTM’s hidden state |
Integer |
lstm_layers |
No. of layers of Bi-LSTM to use |
Integer |
lstm_dropout |
% of dropout between Bi-LSTM layers (if >1) |
Integer |
num_epochs |
No. of epochs to train for |
Integer |
batch_size |
Batch size for training |
Integer |
patience |
Early stopping patience |
Integer |
lr |
Learning rate |
Float ( |
DatasetReaders for NER Taggers¶
When training, the models read in the data using an AllenNLP DatasetReader
and
take in only a particular data format.
There are currently two dataset readers provided: id-nergrit
and th-thainer
.
They are similar and are used for the Indonesian
NERGrit
dataset and Thai ThaiNER 1.3
datasets respectively. The data format expected is a .txt
file with two columns,
the first being the token and the second being the NER tag. There should be an empty
line separating every sentence.
An example from the NERGrit dataset is shown here:
Obama B-PERSON
belakangan O
memicu O
kontroversi O
ketika O
ia O
meminta O
Warren B-PERSON
Tiga O
hari O
sebelum O
Training Details for Native Models¶
All our native models are trained, validated and tested on official train/val/test splits from the dataset providers.