Packages for CoreNLP¶

In this section, we have consolidated (mostly Python) packages that are useful for core NLP tasks in Southeast Asian languages.

🌏 Multilingual Packages¶

Name	Organization	Year	id	ms	ta	th	vi
Trankit	University of Oregon	2021	✅		✅		✅
Stanza	Stanford University	2020	✅		✅		✅
UDify	Charles University	2019	✅		✅	✅	✅
Polyglot	Stony Brook University	2013	✅	✅	✅	✅	✅

Legend

id: Indonesian | ms: Malay | ta: Tamil | th: Thai | vi: Vietnamese

Note

Trankit and Stanza are both trained on the Universal Dependencies v2.5 datasets and therefore cover the same languages and tasks. Trankit has overall better performance than Stanza (see respective model performances on their websites).

Polyglot does not cover all tasks for all the languages shown above. Please check their documentation to see which languages are supported for each task.

Monolingual Packages¶

Language	Name	Year	tk	ss	pos	ner	cp	dp
Indonesian/Malay	Malaya	2018	✅	✅	✅	✅	✅	✅
Thai	PyThaiNLP	2016	✅	✅	✅	✅
	spaCy-Thai	2020	✅		✅			✅
Vietnamese	PhoNLP	2021			✅	✅		✅
	UnderTheSea	2017	✅	✅	✅	✅		✅
	VnCoreNLP	2018	✅	✅	✅	✅		✅
	PyVI	2020	✅		✅

Legend

tk: Tokenization | ss: Sentence Segmentation | pos: Part-of-speech Tagging

ner: Named Entity Recognition | cp: Constituency Parsing | dp: Dependency Parsing