Segmentation Module¶

Following in the footsteps of the AllenNLP library on which SEACoreNLP is built, we have employed the terms Tokenizer and SentenceSplitter to refer to word tokenizers and sentence segmenters respectively.

seacorenlp.data.tokenizers¶

Module for segmentation tasks (word tokenization & sentence segmentation).

`Tokenizer`()	Base class to instantiate specific tokenizer for language.
`SentenceSplitter`()	Base class to instantiate specific sentence segmenter for language.

Tokenizers¶

Model Performance¶

Language	Name	Architecture	Test Dataset	F1 (%)
Indonesian	Trankit*	SentencePiece + FFNN (XLM-R Large)	UD-ID-GSD	99.89
	Stanza	1D-CNN + Bi-LSTM	UD-ID-GSD	99.99
	Malaya	Regex	?	?
Thai	PyThaiNLP	Deepcut (CNN + FFNN)	InterBEST	93.00
		Attacut (3-layer Dilated CNN)	InterBEST	91.00
		newmm (Dictionary-based)	InterBEST	67.00
Vietnamese	Trankit*	SentencePiece + FFNN (XLM-R Base)	UD-VI-VTB	95.22
	Stanza	1D-CNN + Bi-LSTM	UD-VI-VTB	87.25
	VnCoreNLP*	SCRDR (Rule-based)	VLSP 2013	97.90
	UnderTheSea	CRF + Regex	?	?
	PyVI*	CRF	?	98.50

class seacorenlp.data.tokenizers.Tokenizer¶

Base class to instantiate specific tokenizer for language.

Options for library_name:

stanza (For Indonesian and Vietnamese)
pythainlp (For Thai)
underthesea (For Vietnamese)

Defaults available for the following languages:

id: Indonesian
ms: Malay
th: Thai
vi: Vietnamese

classmethod from_default(lang)¶

Returns a default segmenter based on the language specified.

Parameters: lang (str) – The 2-letter ISO 639-1 code of the desired language
Return type: Union[Tokenizer, SentenceSplitter]

classmethod from_library(library_name, **kwargs)¶

Returns a third-party segmenter based on the name of the library provided.

Keyword arguments can be passed as necessary to specify the corpora and engines etc. for the third-party segmenter.

Parameters

library_name (str) – Name of third-party library
**kwargs – Additional keyword arguments specific to each library

Return type

Union[Tokenizer, SentenceSplitter]

Keyword arguments for Tokenizer.from_library¶

If choosing pythainlp:

engine - Tokenizing engine for pythainlp (Default = attacut)
- attacut - Good balance between accuracy and speed
- newmm - Dictionary-based, uses Thai Character Clusters and maximal matching, may produce longer tokens

If choosing stanza:

lang - Language for stanza
- id - Indonesian
- vi - Vietnamese

AllenNLP Tokenizer¶

The Tokenizer instance returned by Tokenizer.from_library or Tokenizer.from_default is an instance of the Tokenizer class from the AllenNLP library.

class allennlp.data.tokenizers.Tokenizer¶

tokenize(text)¶

Parameters: text (str) – Text to be tokenized
Returns: List of AllenNLP tokens
Return type: List[Token]

Sentence Segmenters¶

Model Performance¶

Language	Name	Architecture	Test Dataset	F1 (%)
Indonesian	Trankit*	SentencePiece + FFNN (XLM-R Large)	UD-ID-GSD	95.54
	Stanza	1D-CNN + Bi-LSTM	UD-ID-GSD	93.78
	Malaya	Regex	?	?
Thai	PyThaiNLP	CRFCut (CRF trained on TED dataset)	ORCHID	87.00 1
Vietnamese	Trankit*	SentencePiece + FFNN (XLM-R Large)	UD-VI-VTB	96.63
	Stanza	1D-CNN + Bi-LSTM	UD-VI-VTB	93.15
	UnderTheSea	?	?	?

1: Refer to original CRFCut Github for more details on performance when trained and tested on different datasets.

class seacorenlp.data.tokenizers.SentenceSplitter¶

Base class to instantiate specific sentence segmenter for language.