Segmentation Module

Following in the footsteps of the AllenNLP library on which SEACoreNLP is built, we have employed the terms Tokenizer and SentenceSplitter to refer to word tokenizers and sentence segmenters respectively.

seacorenlp.data.tokenizers

Module for segmentation tasks (word tokenization & sentence segmentation).

Tokenizer()

Base class to instantiate specific tokenizer for language.

SentenceSplitter()

Base class to instantiate specific sentence segmenter for language.

Tokenizers

Model Performance

Language

Name

Architecture

Test Dataset

F1 (%)

Indonesian

Trankit*

SentencePiece + FFNN (XLM-R Large)

UD-ID-GSD

99.89

Stanza

1D-CNN + Bi-LSTM

UD-ID-GSD

99.99

Malaya

Regex

?

?

Thai

PyThaiNLP

Deepcut (CNN + FFNN)

InterBEST

93.00

Attacut (3-layer Dilated CNN)

InterBEST

91.00

newmm (Dictionary-based)

InterBEST

67.00

Vietnamese

Trankit*

SentencePiece + FFNN (XLM-R Base)

UD-VI-VTB

95.22

Stanza

1D-CNN + Bi-LSTM

UD-VI-VTB

87.25

VnCoreNLP*

SCRDR (Rule-based)

VLSP 2013

97.90

UnderTheSea

CRF + Regex

?

?

PyVI*

CRF

?

98.50

class seacorenlp.data.tokenizers.Tokenizer

Base class to instantiate specific tokenizer for language.

Options for library_name:
  • stanza (For Indonesian and Vietnamese)

  • pythainlp (For Thai)

  • underthesea (For Vietnamese)

Defaults available for the following languages:
  • id: Indonesian

  • ms: Malay

  • th: Thai

  • vi: Vietnamese

classmethod from_default(lang)

Returns a default segmenter based on the language specified.

Parameters

lang (str) – The 2-letter ISO 639-1 code of the desired language

Return type

Union[Tokenizer, SentenceSplitter]

classmethod from_library(library_name, **kwargs)

Returns a third-party segmenter based on the name of the library provided.

Keyword arguments can be passed as necessary to specify the corpora and engines etc. for the third-party segmenter.

Parameters
  • library_name (str) – Name of third-party library

  • **kwargs – Additional keyword arguments specific to each library

Return type

Union[Tokenizer, SentenceSplitter]

Keyword arguments for Tokenizer.from_library

If choosing pythainlp:
  • engine - Tokenizing engine for pythainlp (Default = attacut)

    • attacut - Good balance between accuracy and speed

    • newmm - Dictionary-based, uses Thai Character Clusters and maximal matching, may produce longer tokens

If choosing stanza:
  • lang - Language for stanza

    • id - Indonesian

    • vi - Vietnamese

AllenNLP Tokenizer

The Tokenizer instance returned by Tokenizer.from_library or Tokenizer.from_default is an instance of the Tokenizer class from the AllenNLP library.

class allennlp.data.tokenizers.Tokenizer
tokenize(text)
Parameters

text (str) – Text to be tokenized

Returns

List of AllenNLP tokens

Return type

List[Token]

Sentence Segmenters

Model Performance

Language

Name

Architecture

Test Dataset

F1 (%)

Indonesian

Trankit*

SentencePiece + FFNN (XLM-R Large)

UD-ID-GSD

95.54

Stanza

1D-CNN + Bi-LSTM

UD-ID-GSD

93.78

Malaya

Regex

?

?

Thai

PyThaiNLP

CRFCut (CRF trained on TED dataset)

ORCHID

87.00 1

Vietnamese

Trankit*

SentencePiece + FFNN (XLM-R Large)

UD-VI-VTB

96.63

Stanza

1D-CNN + Bi-LSTM

UD-VI-VTB

93.15

UnderTheSea

?

?

?

1

Refer to original CRFCut Github for more details on performance when trained and tested on different datasets.

class seacorenlp.data.tokenizers.SentenceSplitter

Base class to instantiate specific sentence segmenter for language.

Options for library_name:
  • pythainlp (For Thai)

  • underthesea (For Vietnamese)

Defaults available for the following languages:
  • id: Indonesian

  • ms: Malay

  • th: Thai

  • vi: Vietnamese

classmethod from_default(lang)

Returns a default segmenter based on the language specified.

Parameters

lang (str) – The 2-letter ISO 639-1 code of the desired language

Return type

Union[Tokenizer, SentenceSplitter]

classmethod from_library(library_name, **kwargs)

Returns a third-party segmenter based on the name of the library provided.

Keyword arguments can be passed as necessary to specify the corpora and engines etc. for the third-party segmenter.

Parameters
  • library_name (str) – Name of third-party library

  • **kwargs – Additional keyword arguments specific to each library

Return type

Union[Tokenizer, SentenceSplitter]

AllenNLP SentenceSplitter

The SentenceSplitter instance returned by SentenceSplitter.from_library or SentenceSplitter.from_default is an instance of the SentenceSplitter class from the AllenNLP library.

class allennlp.data.tokenizers.SentenceSplitter
split_sentences(text)
Parameters

text (str) – Text to be split into sentences

Returns

List of sentences

Return type

List[str]