Automatic Topic Extraction from Research Articles Using N-gram Analysis

University essay from Göteborgs universitet/Institutionen för data- och informationsteknik

Abstract: Identifying the topic of an article can involve a lot of manual work. The manual processes canbe exhaustive when it comes to a large volume of articles. In order to tackle this problem, wepropose an automated topic extraction approach, which is able to extract topics for a largenumber of articles with a consideration to efficiency. To support the automatic topicextraction, our research focuses on existing N-gram analysis, which only calculates the wordsappearing frequency in a document. But in our research, we apply our customized filteringstandards to improve the efficiency. And also to eliminate the irrelevant or noncritical phrasesas many as possible. By doing that, we can make sure that our final selected keyphrases toeach article are unique labels, which can represent the core idea of each specific article. In ourcase, we choose to focus on the research papers within the autonomous vehicle domainbecause the research papers are highly demanded in our daily life. Since most of the researchpapers are available only in PDF format, we need to process the PDF format files into theeditable file types such as TXT. In order to realize the automation, we have selected a largenumber of autonomous vehicle-related articles to test our proposed idea. Then we observe theresult and compare it with the manual topic extraction result to evaluate our approach.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)