This project has been done as the part of Minor Project submission at Heritage Institute of Technology under the Mentorship of Prof. Sandipan Ganguly (HIT-K).
A library with pre-trained model for POS Tagging, Word Embedding, Name Entity Recognition, FastText, Bengali StopWords, Bengali Corpus Class recognition etc.
pip install bnlp_toolkit
or Upgrade
pip install -U bnlp_toolkit
Raw Text-> Tokenization -> POS Tagging
We have first used Natural Language ToolKit or NLTK library to define & apply basic POS tagging on English Corpus.
In the next step, we took a small Bengali Corpus & tokenized each Bengali words from sentences individually using BasicTokenizer from BNLP under Rule-Based Approach. Then the same applied on two larger Bengali corpora.
In next step, we have used NLTKTokenizer from BNLP to tokenize Bengali small corpus into two phases. One is in Word Tokenizing & second one is in Sentence Tokenizing under Rule-based approach. Word Tokenizer tokenized Bengali Words while Sentence Tokenizer tokenized each sentences separately. Then applied the same on two larger Bengali Corpora.
Next we used SentencePieceTokenizer to apply Unsupervised Learning on two Bengali Corpora.
In the next step, we used POS function with pre-trained model from BNLP & took a small Bengali Corpus to tag Bengali words & categorize them into different Parts of Speeches under Conditional Random Field based approach.
In the next we have embedded Bengali Words of a corpus using BengaliWord2Vector with pre-trained model from BNLP to get the vector shape of words & their values under Deep Learning approach.
We found false positive result as well & calculated Confusion Matrices to get Precision, Recall & F1 value.
OR, you can click via DOI:http://dx.doi.org/10.13140/RG.2.2.35358.41287/1
Subject: Project Technical Report (Publication no. 359257508)
Thank you for visiting.
© Rajdeep Das