Packages for CoreNLPΒΆ

In this section, we have consolidated (mostly Python) packages that are useful for core NLP tasks in Southeast Asian languages.

🌏 Multilingual Packages¢

Name

Organization

Year

id

ms

ta

th

vi

Trankit

University of Oregon

2021

βœ…

βœ…

βœ…

Stanza

Stanford University

2020

βœ…

βœ…

βœ…

UDify

Charles University

2019

βœ…

βœ…

βœ…

βœ…

Polyglot

Stony Brook University

2013

βœ…

βœ…

βœ…

βœ…

βœ…

Legend

id: Indonesian | ms: Malay | ta: Tamil | th: Thai | vi: Vietnamese

Note

Trankit and Stanza are both trained on the Universal Dependencies v2.5 datasets and therefore cover the same languages and tasks. Trankit has overall better performance than Stanza (see respective model performances on their websites).

Polyglot does not cover all tasks for all the languages shown above. Please check their documentation to see which languages are supported for each task.

Monolingual PackagesΒΆ

Language

Name

Year

tk

ss

pos

ner

cp

dp

Indonesian/Malay

Malaya

2018

βœ…

βœ…

βœ…

βœ…

βœ…

βœ…

Thai

PyThaiNLP

2016

βœ…

βœ…

βœ…

βœ…

spaCy-Thai

2020

βœ…

βœ…

βœ…

Vietnamese

PhoNLP

2021

βœ…

βœ…

βœ…

UnderTheSea

2017

βœ…

βœ…

βœ…

βœ…

βœ…

VnCoreNLP

2018

βœ…

βœ…

βœ…

βœ…

βœ…

PyVI

2020

βœ…

βœ…

Legend

tk: Tokenization | ss: Sentence Segmentation | pos: Part-of-speech Tagging

ner: Named Entity Recognition | cp: Constituency Parsing | dp: Dependency Parsing