• Priyanka P. Pattnaik

Extractive Odia Text Summarization System: An OCR based Approach

Automatic text summarization is considered as a challenging task in natural language processing field. In the case of multilingual scenarios particularly for the low-resource, morphologically complex languages the availability of summarization data set is rare and difficult to construct. In this work, we propose a novel technique to extract Odia text from the image files using optical character recognition (OCR) and summarize the obtained text using extractive summarization techniques. Also, we performed a manual evaluation to measure the quality of summaries to validate our techniques. The proposed approach is found suitable for generating summarized Odia text and the same technique can also extend to other low-resource languages for extractive summarization systems.

Experimental Setup:


After given the perfect shape, the “Tesseract” tool kit performs Odia character extraction. For text summarization, we have used “Term Frequency-Inverse Document Frequency”. The sentences which are extracted from the image are tokenized which split them into sentences. After sentences are tokenized, the sentences are split into words. To remove unnecessary words that are present in the sentences, the stop-word filtration process is performed. As in Odia language, less number of a stop-word dataset is present. We have made our dataset. After removing of stop-words, the rest of the words “Term-Frequency (TF)” are calculated by the given formula below

Result And Conclusion:

When the proposed technique applied to the selected data, we got the summarized text as per our desire. The extracted Odia text and the generated Odia summaries are shown in Figure. To judge the summarization, we have evaluated our results by human evaluators. We have chosen four human evaluators who can read, write, and understands Odia properly. We have set five parameters for the manual evaluation as mentioned in the Table. We have decided to do the human evaluation as in our case we find it difficult for automatic evaluation. So, we provide the Odia extracted text and the generated summaries to the four experts (person who know Odia, who can write Odia properly, who can read Odia properly, and who can understand Odia properly). According to their evaluation, we find that all our results are purely related to the extracted Odia text hence the summarization is related to the topic. According to our result, they also have gone through our evaluation criteria and they give us results in percentile format. The manual evaluation results are shown in the Table.

Link to paper: