.. currentmodule:: seacorenlp.data.tokenizers ################### Segmentation Module ################### Following in the footsteps of the AllenNLP library on which SEACoreNLP is built, we have employed the terms ``Tokenizer`` and ``SentenceSplitter`` to refer to word tokenizers and sentence segmenters respectively. ************************** seacorenlp.data.tokenizers ************************** .. automodule:: seacorenlp.data.tokenizers .. autosummary:: Tokenizer SentenceSplitter ********** Tokenizers ********** Model Performance ================= .. csv-table:: :file: ../intro/tables/tokenizers.csv :header-rows: 1 .. autoclass:: Tokenizer :inherited-members: Keyword arguments for Tokenizer.from_library ============================================ If choosing ``pythainlp``: * ``engine`` - Tokenizing engine for pythainlp (Default = ``attacut``) * ``attacut`` - Good balance between accuracy and speed * ``newmm`` - Dictionary-based, uses Thai Character Clusters and maximal matching, may produce longer tokens If choosing ``stanza``: * ``lang`` - Language for stanza * ``id`` - Indonesian * ``vi`` - Vietnamese AllenNLP Tokenizer ================== The ``Tokenizer`` instance returned by ``Tokenizer.from_library`` or ``Tokenizer.from_default`` is an instance of the ``Tokenizer`` class from the AllenNLP library. .. class:: allennlp.data.tokenizers.Tokenizer .. method:: tokenize(text) :param text: Text to be tokenized :type text: str :return: List of AllenNLP tokens :rtype: List[Token] ******************* Sentence Segmenters ******************* Model Performance ================= .. csv-table:: :file: ../intro/tables/sentence-segmenters.csv :header-rows: 1 .. [#f1] Refer to original `CRFCut Github `_ for more details on performance when trained and tested on different datasets. .. autoclass:: SentenceSplitter :inherited-members: AllenNLP SentenceSplitter ========================= The ``SentenceSplitter`` instance returned by ``SentenceSplitter.from_library`` or ``SentenceSplitter.from_default`` is an instance of the ``SentenceSplitter`` class from the AllenNLP library. .. class:: allennlp.data.tokenizers.SentenceSplitter .. method:: split_sentences(text) :param text: Text to be split into sentences :type text: str :return: List of sentences :rtype: List[str]