摘要详情

ID / 提交时间

1 / 2018-05-16 12:12:44

标题

An Similar Entity Identification Method for Text Big Data Based on Spark Parallel Framework

关键字

Spark,Text big data,Similar entity identification,Graph theory

主题及专题

全体主题

状态

全文待审

作者

Tong YU / Northeast Electric Power University

LI Hongbiao / Northeast electric power university

摘要

Aiming at the problem that the similar entity identification for high-dimensional
and massive text data, a method based on Spark parallel framework is proposed. Firstly,
convert the corresponding records of entities into Simhash fingerprints(binary strings) by
using Simhash algorithm to realize the conversion of high-dimensional text data and lowdimensional
Simhash fingerprints. Secondly, a Simhash fingerprint recognition strategy
(SFRS, Simhash Fingerprint Recognition Strategy) based on Graph theory is designed so as to
identify the similar Simhash fingerprints, proceeding to identify the corresponding records,
realize the similar entities identification. Finally, a similar entity identification algorithm
based on the SFRS and Spark is proposed, which is applied to the similar entity identification
of high-dimensional and massive text data, then a comparatively experimental analysis about
text data from UCI is conducted, the experimental results show the good performances and
applicability of the presented method.