Segmentation Module¶
Following in the footsteps of the AllenNLP library on which SEACoreNLP is built,
we have employed the terms Tokenizer
and SentenceSplitter
to refer
to word tokenizers and sentence segmenters respectively.
seacorenlp.data.tokenizers¶
Module for segmentation tasks (word tokenization & sentence segmentation).
Base class to instantiate specific tokenizer for language. |
|
Base class to instantiate specific sentence segmenter for language. |
Tokenizers¶
Model Performance¶
Language |
Name |
Architecture |
Test Dataset |
F1 (%) |
---|---|---|---|---|
Indonesian |
SentencePiece + FFNN (XLM-R Large) |
99.89 |
||
1D-CNN + Bi-LSTM |
99.99 |
|||
Regex |
? |
? |
||
Thai |
Deepcut (CNN + FFNN) |
93.00 |
||
Attacut (3-layer Dilated CNN) |
91.00 |
|||
newmm (Dictionary-based) |
67.00 |
|||
Vietnamese |
SentencePiece + FFNN (XLM-R Base) |
95.22 |
||
1D-CNN + Bi-LSTM |
87.25 |
|||
SCRDR (Rule-based) |
VLSP 2013 |
97.90 |
||
CRF + Regex |
? |
? |
||
CRF |
? |
98.50 |
-
class
seacorenlp.data.tokenizers.
Tokenizer
¶ Base class to instantiate specific tokenizer for language.
- Options for library_name:
stanza
(For Indonesian and Vietnamese)pythainlp
(For Thai)underthesea
(For Vietnamese)
- Defaults available for the following languages:
id
: Indonesianms
: Malayth
: Thaivi
: Vietnamese
-
classmethod
from_default
(lang)¶ Returns a default segmenter based on the language specified.
- Parameters
lang (
str
) – The 2-letter ISO 639-1 code of the desired language- Return type
Union
[Tokenizer
,SentenceSplitter
]
-
classmethod
from_library
(library_name, **kwargs)¶ Returns a third-party segmenter based on the name of the library provided.
Keyword arguments can be passed as necessary to specify the corpora and engines etc. for the third-party segmenter.
- Parameters
library_name (
str
) – Name of third-party library**kwargs – Additional keyword arguments specific to each library
- Return type
Union
[Tokenizer
,SentenceSplitter
]
Keyword arguments for Tokenizer.from_library¶
- If choosing
pythainlp
: engine
- Tokenizing engine for pythainlp (Default =attacut
)attacut
- Good balance between accuracy and speednewmm
- Dictionary-based, uses Thai Character Clusters and maximal matching, may produce longer tokens
- If choosing
stanza
: lang
- Language for stanzaid
- Indonesianvi
- Vietnamese
AllenNLP Tokenizer¶
The Tokenizer
instance returned by Tokenizer.from_library
or Tokenizer.from_default
is an instance of the Tokenizer
class from the AllenNLP library.
Sentence Segmenters¶
Model Performance¶
Language |
Name |
Architecture |
Test Dataset |
F1 (%) |
---|---|---|---|---|
Indonesian |
SentencePiece + FFNN (XLM-R Large) |
95.54 |
||
1D-CNN + Bi-LSTM |
93.78 |
|||
Regex |
? |
? |
||
Thai |
CRFCut (CRF trained on TED dataset) |
ORCHID |
87.00 1 |
|
Vietnamese |
SentencePiece + FFNN (XLM-R Large) |
96.63 |
||
1D-CNN + Bi-LSTM |
93.15 |
|||
? |
? |
? |
- 1
Refer to original CRFCut Github for more details on performance when trained and tested on different datasets.
-
class
seacorenlp.data.tokenizers.
SentenceSplitter
¶ Base class to instantiate specific sentence segmenter for language.
- Options for library_name:
pythainlp
(For Thai)underthesea
(For Vietnamese)
- Defaults available for the following languages:
id
: Indonesianms
: Malayth
: Thaivi
: Vietnamese
-
classmethod
from_default
(lang)¶ Returns a default segmenter based on the language specified.
- Parameters
lang (
str
) – The 2-letter ISO 639-1 code of the desired language- Return type
Union
[Tokenizer
,SentenceSplitter
]
-
classmethod
from_library
(library_name, **kwargs)¶ Returns a third-party segmenter based on the name of the library provided.
Keyword arguments can be passed as necessary to specify the corpora and engines etc. for the third-party segmenter.
- Parameters
library_name (
str
) – Name of third-party library**kwargs – Additional keyword arguments specific to each library
- Return type
Union
[Tokenizer
,SentenceSplitter
]
AllenNLP SentenceSplitter¶
The SentenceSplitter
instance returned by SentenceSplitter.from_library
or
SentenceSplitter.from_default
is an instance of the SentenceSplitter
class
from the AllenNLP library.