Datasets for CoreNLP¶

This section details the various datasets available for CoreNLP tasks in ASEAN languages. We have grouped them by task and we also provide links to the relevant repositories where available.

Part-of-speech Tagging¶

Language	Dataset	POS	Classes	Sentences	Tokens	Domain
Indonesian	POSP	XPOS	26	8400		News
	BaPOS	XPOS	23	10029		News
	IndonesianPOS	XPOS	29	210740	3553580	News
	UD-ID-GSD	UPOS	16	5593	120581	News, Blog
	UD-ID-CSUI	UPOS	17	1030	28117	News
	UD-ID-PUD	UPOS	17	1000	19032	News, Wiki
Thai	LST20	XPOS	16	78931	3163034	News
	UD-TH-PUD	UPOS	15	1000	22322	News, Wiki
Vietnamese	UD-VI-VTB	UPOS	14	3000	43754	News
	VLSP 2013	XPOS		30000		News ++
Tamil	UD_Tamil-TTB	UPOS	14	600	8635	News
	UD_Tamil-MWTT	UPOS	13	534	2536	Grammar Book
Tagalog	UD_Tagalog-TRG	UPOS	13	128	734	Grammar Book
	UD_Tagalog-Ugnayan	UPOS	14	94	1011	Educational Text
Burmese	Asian Language Treebank	NOVA	7	20106		News
Khmer	Asian Language Treebank	NOVA	7	20106		News
Lao	None

Named Entity Recognition¶

Language	Dataset	Classes	Format	Sentences	Tokens	Domain
Indonesian	NERGrit	3	BIO	2000	64000
	NERP	5	BIO	8400		News
Thai	LST20	10	BIOE	78931	3164002	News
	ThaiNER	13	BIO	6456
Vietnamese	VLSP 2016	3		19692		News
Malay	Malaya Entities	8	None			News
	Malaya OntoNotes5	20	None			News, Blogs, Speech
Tamil	FIRE 2013
	FIRE 2014			7160	100264	Wiki, Blogs, Forums
	FIRE 2015

Constituency Parsing¶

Language	Dataset	Sentences	Tokens	Domain
Indonesian	Kethu	1030	28117	News
	IDN Treebank	1030	30953	News
	Cendana	552	5850	Chat
	JATI	1253		Dictionary
	TUFS Indonesian Constituency Treebank	1385		Textbook
	ICON	10000	182115	News, Wiki
Vietnamese	Vietnamese Treebank	10200	220000	News
	NIIVTB	20588		News
Burmese	Asian Language Treebank	20106		News

Note

There are no Thai constituency treebanks (that we are aware of). As the Thai language is more amenable to analysis via dependency grammar, only dependency treebanks are available at the moment. Shallow parsing/chunking is available in many of the open-source Thai datasets if that is of interest (e.g. LST20, ThaiNER).

Dependency Parsing¶

Language	Dataset	Sentences	Tokens	Domain
Indonesian	UD-ID-GSD	5593	120581	News, Blog
	UD-ID-CSUI	1030	28117	News
	UD-ID-PUD	1000	19032	News, Wiki
Thai	UD-TH-PUD	1000	22322	News, Wiki
Vietnamese	UD-VI-VTB	3000	43754	News
	VLSP 2020	10000
Tamil	UD_Tamil-TTB	600	8635	News
	UD_Tamil-MWTT	534	2536	Grammar Book
Tagalog	UD_Tagalog-TRG	128	734	Grammar Book
	UD_Tagalog-Ugnayan	94	1011	Educational Text

Coreference Resolution¶

Language	Dataset	Year	Size	Tokens	Mentions	Links	Sources
Indonesian	COIN	2022	2500P	730187	74222	62262	News, Wiki
	Artari et al.	2021	201D	150877	16460	10195	Wiki
	Suherik and Purwarianti	2017	1030S	24992		2304	News

Note

D = Document // P = Paragraph // S = Sentence