Aiming at the problem that the similar entity identification for high-dimensional
and massive text data, a method based on Spark parallel framework is proposed. Firstly,
convert the corresponding records of entities into Simhash fingerprints(binary strings) by
using Simhash algorithm to realize the conversion of high-dimensional text data and lowdimensional
Simhash fingerprints. Secondly, a Simhash fingerprint recognition strategy
(SFRS, Simhash Fingerprint Recognition Strategy) based on Graph theory is designed so as to
identify the similar Simhash fingerprints, proceeding to identify the corresponding records,
realize the similar entities identification. Finally, a similar entity identification algorithm
based on the SFRS and Spark is proposed, which is applied to the similar entity identification
of high-dimensional and massive text data, then a comparatively experimental analysis about
text data from UCI is conducted, the experimental results show the good performances and
applicability of the presented method.