What is SEACoreNLP?

SEACoreNLP is an initiative by NLPHub of AI Singapore that aims to provide a one-stop solution for Natural Language Processing (NLP) in Southeast Asia.

The raison d’être of SEACoreNLP lies in the fact that many of the languages used in Southeast Asia do not have adequate NLP resources, be it open-source datasets, models or tools. With the growing demand for such capabilities in the industry but no one to supply them, SEACoreNLP hopes to lead the way in spearheading projects and gathering like-minded entities across the region to build a livelier NLP ecosystem for Southeast Asia.

As the name suggests, SEACoreNLP focuses on “core” NLP tasks, such as part-of-speech tagging, syntactic parsing or semantic role labeling, as opposed to higher-level tasks such as machine translation or question answering. This is because we believe that features engineered through such core tasks will be paramount in boosting the performance of downstream models for higher-level tasks, given that the languages of the region are low-resource languages and cannot (as of now) rely on training huge language models with heaps of data.

Our Goals

We hope to accomplish the following:

  • Provide an open-source Python library for core NLP tasks in the official ASEAN languages

  • Provide a one-stop information hub for progress in NLP in Southeast Asia

  • Build high-quality benchmark datasets for core NLP tasks in the relevant languages

  • Improve NLP capabilities for regional languages with core NLP, state-of-the-art models and multilingual pre-trained models

Core NLP Tasks

The core NLP tasks that we aim to cover are as follows:

  • Word Tokenization

  • Sentence Segmentation

  • Lemmatization

  • Part-of-speech tagging

  • Named Entity Recognition

  • Constituency Parsing

  • Dependency Parsing

  • Coreference Resolution

  • Semantic Role Labeling

Demo

We have a demo that demonstrates the aforementioned core NLP tasks. Click here to check out the demo.

SEACoreNLP Demo

SEACoreNLP Library

In our SEACoreNLP library, we hope to provide users with an easy way to train, evaluate and perform inference with models for core NLP tasks in ASEAN languages.

Our library is a light wrapper over the AllenNLP library which itself is a wrapper over Huggingface and Pytorch. We use AllenNLP as a base for development as we believe that its framework allows for easy and quick experimentation of different architectures. Furthermore, it already supports all the core NLP tasks that we are aiming to cover.

Acknowledgements

We would like to thank the creators of the third-party libraries that we use in our package for their great work in furthering NLP in SEA.

  • malaya: Husein Zolkepli

  • pythainlp: Wannaphong Phatthiyaphaibun and his team

  • attacut: Pattarawat Chormai, Ponrawee Prasertsom and Prof. Attapol Rutherford

  • spacy-thai: Prof. Koichi Yasuoka

  • underthesea: Vu Anh

License

The SEACoreNLP package is released under the GPLv3 license.

Contact

For any collaboration or enquiries, please contact us at seacorenlp@aisingapore.org.