iAMStemmer: A Comprehensive Approach of Bahasa Indonesia (Indonesian language) Stemming Algorithm Ashari Imamuddin, Mufid Junaedi, Mohamad Anas Sobarnas, Iskandar
Computer Science Department, Sekolah Tinggi Teknologi Muhammadiyah Cileungsi
Jl. Anggrek No.25, Perum. PTSC, Cileungsi, Kec. Cileungsi, Bogor, Jawa Barat 16820, Indonesia
ashari[at]sttmcileungsi.ac.id. mufid[at]sttmcileungsi.ac.id, anas[at]sttmcileungsi.ac.id, iskandar[at]sttmcileungsi.ac.id
Abstract
Nazief and Adriani had been successful as pioneers in developing the confix stripping stemming algorithm of Bahasa Indonesia (Indonesian language) by removing affix (prefix and suffix) then seeking the new word to the stem dictionary. However, the algorithm, SNA - stemmer of Nazief and Adriani - algorithm, has ambiguities on words which are ended by syllables ""ku"", ""mu"", and ""nya"" such as ""berlaku"" and ""sebeku"". It assumes that ""ku"" in both words is possessive, so it is under-stemming because they would be ""berla"" and ""sebe"" which are meaningless. The algorithm also cannot solve the word ""seolah-olah"" because it treats and removes syllable lah as an article. The recent algorithm was developed by Asian who successfully improved the algorithm by enhancing confix stripping (CSS) for regular repetition-words such as ""berlari-lari"", ""bersama-sama"", and ""terbata-bata"". The most recent stemmer was developed by Suhartono. However, the stemmers failed to fix irregular repetition-words such as ""menari-nari"", ""memutar-mutar"", and ""menyama-nyamakan"". We developed iAMStemmer to flesh out with a comprehensive approach and tune out the deficit of the existing algorithms. Our methods were matching new word as a result of affix removal to the dictionary; developing more stems; repeat-stemming for a word with on a par prefix as in ""dikesampingkan"" or ""diketahui"" which has two equal prefixes ""di"" and ""ke""; removing once time for double prefix as in ""seseorang"" or ""sesekali"" with double prefixes ""se""; improving stemming rule on confix ""meny-"", ""memper-""; and enhancing rule on repetition-word. It changed the method and added more rules to the algorithms. Our stemmer reduces the number of under-stemming or over-stemming words, enhances repeating-word, and improves accuracy and increased success stemming about 3% compared to the existing algorithm which is 93%. Besides, our system produces the root word of confixed compound-words such as ""mempertanggungjawabkan"" with compound-word ""tanggung jawab"" stem.
Keywords: algorithm, stemming, Indonesian, stemming bahasa Indonesia