SEACoreNLP (NLP for Southeast Asian Languages)

  • SEACoreNLP (NLP for Southeast Asian Languages)

     weiqi updated 2 weeks ago 2 Members · 5 Posts
  • weiqi

    Organizer
    July 5, 2021 at 2:38 pm

    Welcome to the discussion thread for SEACoreNLP. We welcome all discussions related to Natural Language Processing (NLP) for Southeast Asian (SEA) languages.

    What is SEACoreNLP?

    SEACoreNLP aims to be the central hub for Natural Language Processing (NLP) in Southeast Asia. The raison d’être of SEACoreNLP lies in the fact that many of the languages used in Southeast Asia do not have adequate NLP resources, be it open-source datasets, models or tools. With the growing demand for such capabilities in the industry but no one to supply them, SEACoreNLP hopes to lead the way in spearheading projects and gathering like-minded entities across the region to build a livelier NLP ecosystem for Southeast Asia.

  • williamtjhi

    Member
    July 6, 2021 at 1:35 pm

    The main languages of Southeast Asia are: Thai, Vietnamese, Malay, Indonesian, Lao, Khmer Burmese, Tagalog and Tetum. Tamil are used in Singapore and Malaysia, and so are English and Chinese. The latter two however are considered high-resource languages, and therefore are not a strong focus of the SEACoreNLP project.

    • williamtjhi

      Member
      July 29, 2021 at 10:54 am

      We are trying consolidate existing public NLP resources and tools for Southeast Asian languages. Below are what we have found so far. They are for sure not exhaustive. Do post your reply if you know any Southeast Asian NLP resources that we miss. Thank you.

      Vietnamese

      PhoNLP

      PhoBERT

      Trankit

      VnCoreNLP

      Stanza

      NNVLP

      NIIVTB

      VLSP

      Thai

      WangchanBERTa

      PyThaiNLP

      Thai2Fit

      Thai Treebank

      Thaikeras/bert

      Indonesian and Malay

      Malaya

      Stanza

      Trankit

      Stanford Parser

      IndoNLU

      Burmese

      khPOS

      Tagalog

      UDify

      FilipinoSPOST

      Tagalog Transformers

      Filipino ULMFit

      Polyglot

      Tamil

      Stanza

      Trankit

      Khmer

      myPOS

      Lao

      LaoNLP

  • williamtjhi

    Member
    August 5, 2021 at 11:45 am

    Just learned that Prof. Joty has two multilingual papers presented in the currently ongoing ACL conference. One on Cross-lingual NER (https://aclanthology.org/2021.acl-long.453/) and another on improving zero-shot cross-lingual transfer (https://aclanthology.org/2021.acl-long.154/). Works such as these can be helpful to do SEA NLP despite its low resources situation.

  • weiqi

    Organizer
    September 9, 2021 at 12:47 pm

    We are pleased to share that we have just released the beta version of our SEACoreNLP Python package. In this current early release, we cover the following tasks and languages:

    Languages

    • Indonesian/Malay
    • Vietnamese
    • Thai

    Tasks

    • Tokenization
    • Sentence Segmentation
    • Part-of-speech Tagging
    • Named Entity Recognition
    • Constituency Parsing
    • Dependency Parsing

    You may install the package using the shell command “pip install seacorenlp”, and the entry on PyPI can be found at https://pypi.org/project/seacorenlp/

    The demo website for the package can also be found at https://seacorenlp.aisingapore.net. The link to the documentation for the package is on the same website.

    We are currently working on building open-source datasets not just for the tasks mentioned above but for other tasks such as Coreference Resolution and Semantic Role Labeling which currently do not have any data for Southeast Asian languages. Once we are done, we will be able to train and publish models and add these to our package and demo, so stay tuned for the next release!

Viewing 1 - 4 of 4 replies

Log in to reply.

Original Post
0 of 0 posts June 2018
Now