Packages for CoreNLPΒΆ
In this section, we have consolidated (mostly Python) packages that are useful for core NLP tasks in Southeast Asian languages.
π Multilingual PackagesΒΆ
Name |
Organization |
Year |
id |
ms |
ta |
th |
vi |
---|---|---|---|---|---|---|---|
University of Oregon |
2021 |
β |
β |
β |
|||
Stanford University |
2020 |
β |
β |
β |
|||
Charles University |
2019 |
β |
β |
β |
β |
||
Stony Brook University |
2013 |
β |
β |
β |
β |
β |
Legend
id: Indonesian | ms: Malay | ta: Tamil | th: Thai | vi: Vietnamese
Note
Trankit and Stanza are both trained on the Universal Dependencies v2.5 datasets and therefore cover the same languages and tasks. Trankit has overall better performance than Stanza (see respective model performances on their websites).
Polyglot does not cover all tasks for all the languages shown above. Please check their documentation to see which languages are supported for each task.
Monolingual PackagesΒΆ
Language |
Name |
Year |
tk |
ss |
pos |
ner |
cp |
dp |
---|---|---|---|---|---|---|---|---|
Indonesian/Malay |
2018 |
β |
β |
β |
β |
β |
β |
|
Thai |
2016 |
β |
β |
β |
β |
|||
2020 |
β |
β |
β |
|||||
Vietnamese |
2021 |
β |
β |
β |
||||
2017 |
β |
β |
β |
β |
β |
|||
2018 |
β |
β |
β |
β |
β |
|||
2020 |
β |
β |
Legend
tk: Tokenization | ss: Sentence Segmentation | pos: Part-of-speech Tagging
ner: Named Entity Recognition | cp: Constituency Parsing | dp: Dependency Parsing