Abstract
The character vocabulary can be very large in non-alphabetic languages such as Chinese and Japanese, which makes neural network models huge to process such languages. We explored a model for sentiment classification that takes the embeddings of the radicals of the Chinese characters, i.e, hanzi of Chinese and kanji of Japanese. Our model is composed of a CNN word feature encoder and a bi-directional RNN document feature encoder. The results achieved are on par with the character embedding-based models, and close to the state-of-the-art word embedding-based models, with 90% smaller vocabulary, and at least 13% and 80% fewer parameters than the character embedding-based models and word embedding-based models respectively. The results suggest that the radical embedding-based approach is cost-effective for machine learning on Chinese and Japanese.
Original language | English |
---|---|
Pages (from-to) | 561-573 |
Number of pages | 13 |
Journal | Journal of Machine Learning Research |
Volume | 77 |
Publication status | Published - 2017 |
Event | 9th Asian Conference on Machine Learning, ACML 2017 - Seoul, Korea, Republic of Duration: 2017 Nov 15 → 2017 Nov 17 |
Keywords
- Natural Language Processing
- Sentiment Analysis
ASJC Scopus subject areas
- Software
- Control and Systems Engineering
- Statistics and Probability
- Artificial Intelligence