Datasets for CoreNLP

This section details the various datasets available for CoreNLP tasks in ASEAN languages. We have grouped them by task and we also provide links to the relevant repositories where available.

Part-of-speech Tagging

Language

Dataset

POS

Classes

Sentences

Tokens

Domain

Indonesian

POSP

XPOS

26

8400

News

BaPOS

XPOS

23

10029

News

IndonesianPOS

XPOS

29

210740

3553580

News

UD-ID-GSD

UPOS

16

5593

120581

News, Blog

UD-ID-CSUI

UPOS

17

1030

28117

News

UD-ID-PUD

UPOS

17

1000

19032

News, Wiki

Thai

LST20

XPOS

16

78931

3163034

News

UD-TH-PUD

UPOS

15

1000

22322

News, Wiki

Vietnamese

UD-VI-VTB

UPOS

14

3000

43754

News

VLSP 2013

XPOS

30000

News ++

Tamil

UD_Tamil-TTB

UPOS

14

600

8635

News

UD_Tamil-MWTT

UPOS

13

534

2536

Grammar Book

Tagalog

UD_Tagalog-TRG

UPOS

13

128

734

Grammar Book

UD_Tagalog-Ugnayan

UPOS

14

94

1011

Educational Text

Burmese

Asian Language Treebank

NOVA

7

20106

News

Khmer

Asian Language Treebank

NOVA

7

20106

News

Lao

None

Named Entity Recognition

Language

Dataset

Classes

Format

Sentences

Tokens

Domain

Indonesian

NERGrit

3

BIO

2000

64000

NERP

5

BIO

8400

News

Thai

LST20

10

BIOE

78931

3164002

News

ThaiNER

13

BIO

6456

Vietnamese

VLSP 2016

3

19692

News

Malay

Malaya Entities

8

None

News

Malaya OntoNotes5

20

None

News, Blogs, Speech

Tamil

FIRE 2013

FIRE 2014

7160

100264

Wiki, Blogs, Forums

FIRE 2015

Constituency Parsing

Language

Dataset

Sentences

Tokens

Domain

Indonesian

Kethu

1030

28117

News

IDN Treebank

1030

30953

News

Cendana

552

5850

Chat

JATI

1253

Dictionary

TUFS Indonesian Constituency Treebank

1385

Textbook

ICON

10000

182115

News, Wiki

Vietnamese

Vietnamese Treebank

10200

220000

News

NIIVTB

20588

News

Burmese

Asian Language Treebank

20106

News

Note

There are no Thai constituency treebanks (that we are aware of). As the Thai language is more amenable to analysis via dependency grammar, only dependency treebanks are available at the moment. Shallow parsing/chunking is available in many of the open-source Thai datasets if that is of interest (e.g. LST20, ThaiNER).

Dependency Parsing

Language

Dataset

Sentences

Tokens

Domain

Indonesian

UD-ID-GSD

5593

120581

News, Blog

UD-ID-CSUI

1030

28117

News

UD-ID-PUD

1000

19032

News, Wiki

Thai

UD-TH-PUD

1000

22322

News, Wiki

Vietnamese

UD-VI-VTB

3000

43754

News

VLSP 2020

10000

Tamil

UD_Tamil-TTB

600

8635

News

UD_Tamil-MWTT

534

2536

Grammar Book

Tagalog

UD_Tagalog-TRG

128

734

Grammar Book

UD_Tagalog-Ugnayan

94

1011

Educational Text

Coreference Resolution

Language

Dataset

Year

Size

Tokens

Mentions

Links

Sources

Indonesian

COIN

2022

2500P

730187

74222

62262

News, Wiki

Artari et al.

2021

201D

150877

16460

10195

Wiki

Suherik and Purwarianti

2017

1030S

24992

2304

News

Note

D = Document // P = Paragraph // S = Sentence