########## Quickstart ########## There are six main tasks supported by SEACoreNLP at the moment, namely: * Word Tokenization * Sentence Segmentation * Part-of-speech Tagging * Named Entity Recognition * Constituency Parsing * Dependency Parsing SEACoreNLP provides classes that can perform each of these tasks. ******* Classes ******* Segmentation Tasks ================== For tokenization and sentence segmentation, we provide the classes ``Tokenizer`` and ``SentenceSplitter``. As we do not provide natively trained segmenters for now, the only segmenters available are from third-party libraries. In order to instantiate a segmenter, use the ``from_library`` method if you have a specific one in mind, or use the ``from_default`` method if you would like to use the default segmenter. .. code-block:: python from seacorenlp.data.tokenizers import Tokenizer, SentenceSplitter text = 'ผมอยากกินข้าว' # Default Tokenizer tokenizer = Tokenizer.from_default('th') tokenizer.tokenize(text) # Output: [ผม, อยาก, กิน, ข้าว] # Specific Tokenizer tokenizer = Tokenizer.from_library('pythainlp', engine='newmm') tokenizer.tokenize(text) # Output: [ผม, อยาก, กินข้าว] longer_text = 'Tôi muốn ăn cơm. Chị muốn đi du lịch.' # Default SentenceSplitter splitter = SentenceSplitter.from_default('vi') splitter.split_sentences(longer_text) # Output: ['Tôi muốn ăn cơm.', 'Chị muốn đi du lịch.'] Tagging & Parsing Tasks ======================= For tagging (POS, NER) and parsing (constituency, dependency) tasks, we provide natively trained models as well as third-party models which can be used by instantiating the relevant class with the ``from_pretrained`` and ``from_library`` methods respectively. .. code-block:: python from seacorenlp.tagging import POSTagger th_text = 'ผมอยากกินข้าว' # Native Models native_tagger = POSTagger.from_pretrained('pos-th-ud-xlmr') native_tagger.predict(th_text) # Output: [('ผม', 'PRON'), ('อยาก', 'VERB'), ('กิน', 'VERB'), ('ข้าว', 'NOUN')] # External Models # Include keyword arguments as necessary (see respective class documentation) external_tagger = POSTagger.from_library('pythainlp', corpus='orchid') external_tagger.predict(th_text) # Output: [('ผม', 'PPRS'), ('อยาก', 'XVMM'), ('กิน', 'VACT'), ('ข้าว', 'NCMN')] For the full list of models available, refer to `Model Performance `_. Here are some examples for each task: .. code-block:: python from seacorenlp.tagging import POSTagger, NERTagger from seacorenlp.parsing import ConstituencyParser, DependencyParser # POS Tagging pos_text = 'Tôi muốn ăn cơm.' pos_tagger = POSTagger.from_pretrained('pos-vi-ud-xlmr') pos_tagger.predict(pos_text) # Output: [('Tôi', 'PROPN'), ('muốn', 'VERB'), ('ăn', 'VERB'), ('cơm', 'NOUN'), ('.', 'PUNCT')] # NER ner_text = 'Thủ tướng Trung Quốc Ôn Gia Bảo đã đến thăm Việt Nam vào năm 2004.' ner_tagger = NERTagger.from_library('underthesea') ner_tagger.predict(ner_text) # Output: # [('Thủ tướng', 'O'), # ('Trung Quốc', 'B-LOC'), # ('Ôn', 'B-PER'), # ('Gia Bảo', 'I-PER'), # ('đã', 'O'), # ('đến', 'O'), # ('thăm', 'O'), # ('Việt Nam', 'B-LOC'), # ('vào', 'O'), # ('năm', 'O'), # ('2004', 'O'), # ('.', 'O')] # Constituency Parsing const_text = 'Saya pergi ke sekolah' const_parser = ConstituencyParser.from_pretrained('cp-id-kethu-benepar-xlmr-best') trees = const_parser.predict(const_text) print(trees[0]) # Output: # (TOP # (S # (NP-SBJ (PRP Saya)) # (VP (VB pergi) (PP (IN ke) (NP (NN sekolah)))))) # Dependency Parsing dep_text = 'Saya pergi ke sekolah' dep_parser = DependencyParser.from_pretrained('dp-id-ud-xlmr') results = dep_parser.predict(dp_text) print(results[0]) # Output: [('Saya', 2, 'nsubj'), ('pergi', 0, 'root'), ('ke', 4, 'case'), ('sekolah', 2, 'obl')]