Sms Phishing Detection in Sinhala Language messages Using Rule-Based Filtering and NLP

Today, people in many parts of the world such as Sri Lanka, prefer to use SMS for communication rather than other services, especially because most of them have mobile phones. Because of how widespread SMS is, cybercriminals often target it. SMS phishing or smishing, has turned into a major threat over the years. In this attack, people are sent fraudulent texts in an attempt to obtain their passwords or credit card numbers. A large number of tools exist for spotting English-based phishing attempts, but there are not many available for Sinhala. This blog outlines how an SMS phishing detection system was developed in Sinhala using automatic procedures and Natural Language Processing.

Understanding SMS Phishing

With SMS phishing, scammers work on creating a sense of emergency to convince you to give away your private information. Many times, they will use fake notice of winning, charge you, then either suspend your account or demand immediate payment. Some messages may include links to dangerous websites or encourage users to provide confidential info. The risk is greater with these viruses since they work easily and users may not realize what is happening. As more people use their phones in Sri Lanka, instances of smishing have risen. Unfortunately, some users are not equipped with tools that review and identify fishy Sinhala messages. There is a real need for tools that help identify phishing messages quickly and easily in Sinhala.

Challenges in Sinhala Language SMS Phishing Detection

Because Sinhala is a morphologically rich and low-resource language, it creates certain challenges. There is a limited quantity of SMS datasets in Sinhala, especially ones centering on phishing problems. The sentence structure in Sinhala is complex, so it becomes harder to analyze text. When we send text messages on our phones, we commonly use informal and short forms of words which is challenging for NLP. Sometimes people mix English and Sinhala terms in the same conversation. As a result, it is not easy to transfer models developed in English or other major languages to other languages.

How to implement

Comments