Volume 5, Issue 3, June 2016, Page: 42-47
Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus
Seyede Roya Mohammadi, Computer Engineering Department, Alzahra University, Tehran, Iran
Noushin Riahi, Computer Engineering Department, Alzahra University, Tehran, Iran
Received: Mar. 23, 2016;       Accepted: Jun. 7, 2016;       Published: Jun. 18, 2016
DOI: 10.11648/j.ijiis.20160503.12      View  3478      Downloads  117
Multilingual corpora are the main sources in language information retrieval fields. The quality of many researches such as machine translation strongly depends on the quality of these corpora. One of these corpora's is comparable corpus. Considering their quality, these corpora contain broad range of information but constructing them has its special problems which lead to a few numbers of pairs in comparable corpus unlike its large dataset. In this paper we present a new method for increasing the quality and quantity of comparable corpus. We built a Persian-English comparable corpus from two independent news collections: BBC news in English and Hamshahri news in Persian.
Comparable Corpus, Corpus Quality, Hamshahri Corpus, Query, RATF Factor
To cite this article
Seyede Roya Mohammadi, Noushin Riahi, Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus, International Journal of Intelligent Information Systems. Vol. 5, No. 3, 2016, pp. 42-47. doi: 10.11648/j.ijiis.20160503.12
Copyright © 2016 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License ( which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
A. Blets, E. kow, “Extracting Parallel Fragments from Comparable Corpora for Date-to-Text Generation”, Proceeding INLG’10 Procedeeing of the 6th International Natural Language Generation Conference, 2007, pp. 167-171.
P. Fung, “Finding terminology translations from nonparallel corpora”, Proceedings of the Fifth Workshop on Very Large Corpora, pages 192–202, 1997.
R. Rapp, Automatic identification of word translations fromunrelated english and german corpora. In Proceedings of the 37th annual meeting of the association for Computational Linguistics on Computational Linguistics, pages 519–526, Morristown.
D. Herv´e, E. Gaussier, and F. Sadat, An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics, COLING, pages 1–7, Taipei, Taiwan.
R. Xavier, Y. Sasaki, M. Tonoike, S. Sato, and T. Utsuro, Compiling French-Japanese terminologies from the web. In proceedings of the 11st EACL, 2006, pages 225–232, Trento, Italy.
E. Morin, D. B´eatrice, T. Koichi and K. Kyo, Bilingual terminology mining - using brain, not brawn comparable corpora. In Proceedings of the 45th ACL, 2007, pages 664– 671, Prague, Czech Republic.
J. Xu, W. Croft, “Query expansion using local and global document analysis”, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 18–22 August 1996, pages 4–11.
R. Xiao and X. Hu, Corpus-Based Studies of Translational Chinese in English-Chinese Translation, Springer Heidelberg New York Dordrecht London, 2015, ISSN 2197-8689, ISSN 2197-8697 (electronic), New Frontiers in Translation Studies, ISBN 978-3-642-41362-9, ISBN 978-3-642-41363-6 (eBook), DOI 10.1007/978-3-642-41363-6.
K. Benjamin Tsou, Augmented Comparative Corpora and Monitoring Corpus in Chinese: LIVAC and Sketch Search Engine Compared, Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 1–2, Beijing, China, July 30, 2015.
P. Fung and P. Cheung, “Mining very Non-parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM”, In EMNLP 2004, pages 57-63.
T. Tao, C. X. Zhai, “Mining Comparable Bilingual Text Corpora for Cross-Language Information Integration,” in SIGKDD, 2005, pp. 691-696.
T. Talvensaari, J. Laurikkala, K. Jarvelin, M. Juhola, H. Keskustalo, “Creating and Exploiting a Comparable Corpus in Cross-Language Information Retrieval”, ACM Trans. Inf. Syst., Vol. 25, No. 1, 2007, pp. 4.
T. Talvensaari, “Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR,” Advances in Information Retrieval, 2008, pp. 114-125.
L. Shao and H. T. Ng, “Mining New Word Translations from Comparable Corpora”, In: COLING 2004.
M. Tonoike, T. Utsuro, and S. Sato, “Compositional Translation Estimation of Technical Terms using a Domain/Topic-Specific Corpus collected from the Web”, Journal of Natural Language Processing, Vol. 14, No. 2, pp. 33-68, April 2007.
D. Shezaf and A. Rappoport,. Bilingual Lexicon Generation Using Non-Aligned Signatures. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala, Sweden, 2010, pp. 98–07.
X. Saralegi, I. San Vicente and A. Gurrutxaga, =Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain. In Proc. of the 1st Workshop on Building and Using Comparable Corpora (BUCC) at LREC 2008.
B. Li, E. Gaussier, “Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora,” in Proceeding of the 23rd International Conference on Computational Linguistics, Beijing, China: Coling Organizing Committee, 2010, pp. 644-652.
NJ, USA. Association for Computational Linguistics. Ghayoomi, Momtazi, Bijankhan, A study of corpus development for Persian, International Journal of Asian Language Processing 20(1), 2010.
H. Hashemi, A. Shakery, H. Faili, Creating Persian English Comparable Corpus, CLEF, 2010.
Browse journals by subject