การตัดคำภาษาไทยโดยใช้คุณลักษณะ / ไพศาล เจริญพรสวัสดิ์ = Feature-based Thai word segmentation / Paisarn Charoenpornsawat

Author	ไพศาล เจริญพรสวัสดิ์
Title	การตัดคำภาษาไทยโดยใช้คุณลักษณะ / ไพศาล เจริญพรสวัสดิ์ = Feature-based Thai word segmentation / Paisarn Charoenpornsawat
Imprint	2541
Connect to	http://cuir.car.chula.ac.th/handle/123456789/11711
Descript	[8], 71 แผ่น : แผนภูมิ

SUMMARY

เนื่องจากลักษณะการเขียนของภาษาไทยนั้นไม่มีการใช้ตัวอักษรหรือสัญลักษณ์ที่นำมาใช้คั่นระหว่างคำ และงานต่างๆ ในด้านการประมวลผลภาษาธรรมชาตินั้นจำเป็นต้องทราบขอบเขตของคำก่อนถึงจะสามารถนำไปประมวลผลต่อไปได้ ดังเช่นการแปลภาษาไทย-อังกฤษ การสังเคราะห์เสียงภาษาไทย หรือการแก้ไขคำที่สะกดผิด เป็นต้น ทำให้การตัดคำนั้นถือได้ว่าเป็นปัญหาที่สำคัญปัญหาหนึ่งสำหรับงานด้านการประมวลผลภาษาธรรมชาติ ในการตัดคำนั้นประกอบไปด้วยปัญหาหลัก 2 ปัญหาคือ 1. ปัญหาความกำกวม 2. ปัญหาคำศัพท์ที่ไม่ปรากฏในพจนานุกรม สำหรับแนวคิดในการตัดคำนั้นมีอยู่หลายแนวคิด เช่นการตัดคำแบบเลือกคำยาวที่สุด การตัดคำโดยเลือกแบบเหมือนมากที่สุด และการตัดคำโดยโมเดลไตรแกรม อย่างไรก็ตามแนวคิดต่างๆ เหล่านั้นไม่สามารถให้ความถูกต้องที่สูงในการแก้ปัญหาการตัดคำ เพราะว่ามีการใช้เพียงวิทยาการศึกษาสำนึก สำหรับการตัดคำโดยแบบเลือกคำยาวที่สุดและการตัดคำโดยเลือกแบบที่เหมือนมากที่สุด และสำหรับการตัดคำโดยใช้โมเดลไตรแกรมนั้นมีการพิจารณาแค่คำบริบทก่อนหน้าแค่เพียง 2 คำเท่านั้น ส่วนความถูกต้องในการแก้ปัญหาความกำกวมนั้นมีความถูกต้องประมาณ 53% และ 73% สำหรับการตัดคำโดยเลือกแบบเหมือนมากที่สุดและการตัดคำโดยใช้โมเดลไตรแกรมตามลำดับ ในวิทยานิพนธ์นี้เสนอแนวคิดการนำคุณลักษณะโดยใช้การเรียนรู้ของเครื่อง 2 แบบ คือ ริปเปอร์และวินโนว์ในการแก้ปัญหาการตัดคำภาษาไทย โดยคุณลักษณะคือข้อมูลที่อยู่รอบๆ ซึ่งสามารถนำมาประยุกต์ใช้ในการแก้ปัญหาได้ สำหรับคุณลักษณะที่นำมาใช้ในการแก้ปัญหาการตัดคำทั้ง 2 ปัญหา คือคำบริบท และสิ่งที่เกิดร่วมกันโดยมีลำดับ ในการทดลองมีการนำคลังข้อความที่มีการกำหนดหน้าที่คำจำนวน 80% เข้ามาใช้ในการเรียนรู้และส่วนที่เหลือนำมาใช้ในการทดสอบ สำหรับการวัดประสิทธิภาพนั้นได้มีการแบ่งออกเป็น 2 ส่วนคือ 1. วัดค่าความถูกต้องของการแก้ปัญหาความกำกวม 2. วัดค่าความถูกต้องของการแก้ปัญหาคำศัพท์ที่ไม่ปรากฏในพจนานุกรม สำหรับความถูกต้องโดยการใช้ริปเปอร์และวินโนว์ในการแก้ปัญหาความกำกวมนั้นให้ความถูกต้องมากกว่า 85% และ 90% ตามลำดับ ส่วนความถูกต้องในการแก้ปัญหาคำศัพท์ที่ไม่ปรากฏในพจนานุกรมนั้นให้ความถูกต้องมากกว่า 70% และ 80% สำหรับริปเปอร์และวินโนว์ตามลำดับ จากผลการทดลองแสดงให้เห็นว่าการตัดคำโดยใช้คุณลักษณะให้ประสิทธิภาพในการแก้ปัญหาได้ดีกว่าการตัดคำโดยใช้ไตรแกรมโมเดลและการตัดคำโดยเลือกแบบเหมือนมากที่สุด และยังแสดงให้เห็นว่าวินโนว์สามารถดึงคุณลักษณะต่างๆจากคลังข้อความ เพื่อใช้ในการแก้ปัญหาการตัดคำได้ดีกว่าริปเปอร์
In a Thai text, a delimiter for indicating the word boundary is not explicitly used. Many tasks of Natural Language Processing (NLP) such as Thai-English machine translation, Thai speech synthesis and spelling correction require boundaries of words. Therefore, word segmentation is one of the main problems in NLP. There are two main problems in word segmentation. The first is the ambiguity problem and the second is the unknown word boundary problem. Many approaches such as longest matching, maximal matching and trigram model have been proposed. However, these approaches can not give high accuracy because longest matching and maximal matching use only heuristics and trigram model consider only two previous context words for solving the problems. The accuracy in solving ambiguity problem is about 53% and 73% for maximal matching and trigram model respectively. This thesis proposes to use a feature-based approach with two learning algorithms namely RIPPER and Winnow in solving the problems in Thai word segmentation. A feature can be anything that tests for specific information in the context around the word in question, such as context words and collocations. In the experiment we train the system by using RIPER and Winnow algorithm separately, on an 80% of part-of-speech tagged corpus and the rest is used for testing. We divided the evaluation into two parts. One is the accuracy in solving the ambiguity problem and the other is the accuracy in solving the unknown word boundary problem. The accuracy using RIPPER and Winnow in solving the ambiguity problem is more than 85% and 90% respectively. On the other hand, the accuracy in solving the unknown word boundary problem is more than 70% and 80% for RIPPER and Winnow respectively. The experiment results show the feature-based approach outperform trigram model and maximal matching, and Winnow is superior to RIPPER for extracting the features from the corpus.

SUBJECT

การแจกแจงรูปประโยค
ภาษาไทย
การตัดคำ
วินโนว์
ริปเปอร์
คำกำกวม

LOCATION	CALL#	STATUS
Engineering Library : Thesis	วิทยานิพนธ์	LIB USE ONLY