The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content
Identification
Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin
The Framework of Network Public Opinion Monitoring and Analyzing
System Based on Semantic Content Identification
Cheng Xian-Yi11,2
, Zhu Ling-ling1,Zhu Qian
2,Wang Jin
1
1.School of Computer Science and Technology Nantong University, Nantong 226019, China
2.School of Computer Science and Communications Engineering Jiangsu University,
Zhenjiang 212013, China
E-MIAL: [email protected] doi:10.4156/jcit.vol5. issue10.7
AbstractNetwork has become important public platform for the public to express the opinion, to discuss
public affairs, to participate in economic social and political life. The spread of information network
public opinion geometric progression growth, it is necessary to monitor and analyze network Public
Opinion for the government to manage the public opinion information and to timely discover hot
spots and to correctly guide public opinion trends. Therefore, Network Public Opinion monitoring
and analyzing have become a hot issue in recent years. Now the main mature technology is the
statistical analysis based on key word. However, there is still much room for improving its
effectiveness. This paper describes a framework of network Public Opinion monitoring and analyzing
system based on semantic content identification to solve some key problems of the public opinion.
Keywords: Network Public Opinion, Natural Language Process, Semantic
1. Introduction
In recent years, with the development of Internet and the increasing number of netizens, some
people disclosure and spread the sensitive and bad information through forums, IM, e-mail and so on,
which threat the social stability and people's life and property. On the one hand, the national
legislation and regulations should put forward higher attentions in the focuses of public opinion to
server the public better; on the other hand, Government should takes on vital responsibilities in takes
on vital responsibilities in correctly monitoring the sensitive public opinion and guiding them which
protect network users from bad information and build a harmonious socialist country. According to
the preliminary statistics data from Internet center, there have been large directly and indirectly losses
since 1996, which we can be seen from Figure 1. Therefore, Network Public Opinion monitoring and
analyzing have become an urgent and important issue [1]
.
Where the Y Axis express State loss (in million dollars on units)
Figure 1. The statistics of the harm to society by spreading the Network Public Opinion
The most important technologies about network Public Opinion analysis include text filtering, text
classifying, clustering, viewpoint tendentiousness recognizing, tracking topics, automatic
- 48 -
Journal of Convergence Information Technology
Volume 5, Number 10. December 2010
summarizing and so on, which have been concerned about for a long time by domestic and foreign
workers . In order to control information more effectively, This paper describes a framework of
network Public Opinion monitoring and analyzing system based on semantic content identification.
2. Research Situation
Researchers from DARPA 、CMU、University of Massachusetts and Dragon Systems, Inc have
began to define topic detection and tracking study and developed TDT. The important technology of
this project is content classification of information, which resolves a contradiction between the
processing speed and safety monitoring of the real-time monitoring and make it feasible. There are
some studies about it abroad such as the PICS of the W3C which have become classification standard
on WWW. There are two International general classification standards: SACi and Safesurf, which are
both accord with the PICS. On the one hand, the classification technique is used for web page
classification and filtering; on the other hand, the foreign policy and standards are not fully suitable
for China's national conditions for various reasons.
In China, Founder ZhiSi public opinion warning DSS [2]
designed by Institute Founder is
successful. The system has successfully achieved automatic real-time monitoring and analysis of the
massive public opinion. It is more effective for government to monitor the public option than
traditional manual mode .It also do some to strengthen the supervision of internet information and
play a certain role for the network sudden public events .This DSS provide the function including
such as full text retrieval, automatic sorting, automatic cluster, subject examination/tracing, related
recommendation and disappear heavy, connection and tendency analysis, automatic abstract and key
word extraction, thunderbolt analyzes, generate statistics and so on.
Goonie network Public Opinion and information monitor system combines internet search
technology; information intellectualized process technology and knowledge manage method. It
realize network public opinion monitor and special news trace to briefing, report and so on through
auto collect, auto classify combine, subject collection, focus special topic. Therefore Goonie can
master public opinion, make proper consensus and provide report analyze [3]
.
A framework of content security monitoring system was designed based on human-computer
combination in literature [4]
. The framework is a hierarchy, there are three levels: the data acquisition
layer, content analysis, output layer. It’s function mainly examine the information based on content by
the content analysis and identify the bad information; on the same time, it can provide electronic
evidence for the bad use of information by recording the source and content of information and
tracking them by effective audit analysis.
Although there are many units engaged in domestic internet content filtering direction of the
research, and try to achieve the purpose of purifying network environment. But these techniques are
still in the bud, there are still some deficiencies in "the semantic information filtering"
3. System Framework
The purpose of the system is to achieve a large-scale network environment monitoring report of
Network Public Opinion through testing, acquirement, theme, hot topics and events tracking,
experiments monitoring and so on, which can form many representation modes of analysis results,
such as brief, reports, charts etc. Therefore, the system can master public opinion, make proper
consensus and provide report analyze. Monitoring system for network Public Opinion module
function block diagram in Figure 2.There are five stages including resource discovery, information
selection, pattern discovery, information extraction, public opinion handling.
- 49 -
The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content
Identification
Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin
Structure information knowledge
Text analysis、audit
Text purification
unstructure text structure text
Theme
managmentwarning、filter、counter、monitor、decision
event
searchreportabstract
Topic
pattern
Semantic
indexEvent
pattern
Tendentious
analysis
filtere
noise
extracte the
subjectexpressionlusterfilter
(Digital Collection、format shifting、Data Import / Export) text
BBSblogE-
Chatting
roomWEB
public
opinion
handling
information
extraction
Pattern
Discovery
select
information
Resources
discovery
client
server
Topic
search trend
analysis
Figure 2. logic structure of the network public opinion monitoring system function module
Figure 3 is the system workflow. The systems include the following five databases:
1) Public opinion planning information database: To collect demand information of the Public
opinion including the online news, BBS, RSS, chat room, blog, polymerization news (RSS), etc.
2) Public opinion analysis information database: To collect the storage data through classification
and clustering, keyword extraction, removal the duplication and filter, named entity recognition,
semantic computing etc. which information database is structured.
3) Public opinion database: To storage products related to public opinion analysis report, survey
report, experience summary and related information.
4) Semantic dictionary: Ontology knowledge, etc
5) HNC knowledge: 466 sentence knowledge, etc [6]
.
Public SentimentResources
discovery
select
information
information
extraction
Pattern
Discovery
Public opinion
database
programming
Public opinion
information
analysis
HNC
knowledge
Public
opinion
Products
semantic
dictionary
Popular feelings
handling
Figure 3. System workflow of the network public opinion monitoring
- 50 -
Journal of Convergence Information Technology
Volume 5, Number 10. December 2010
Figure 4 is the client workflow
Show topic list
modify the theme ?
Theme management
yes
Theme selection
no
Server-side processing
information extraction
popular feelings information handling
user
continue exit
warn decisionminorcounterfilter
Topic retrieval
event retrieval
Topic retrieval
Summary of
public opinion
public report
Figure 4. Client workflow
Figure 5 is data flow charts of the system. The interaction between the various modules are
different: Data interaction is based on file between the resources discovery module and select
information module; Select information module deal with information from the text to vector or
ontology; Use GATE tagging to name entity in pattern discovery module and determine the
relationship between entities and then discover the event pattern or the topic pattern; Information
extraction module mainly do the semantic computing and transform the patterns into templates, which
will make the unstructured information into structure information; Public opinion handling module
need to carry on the inquiry according to the user and give these results to user with the suitable
manifestation. simultaneously, the module receive the user’s establishment and inquiry request.
Resources discovery
Pattern Discovery
select information
Popular feelings
handling
information extraction
Network information
Server client
Analysis of unstructured
text
Vector, ontology
Structured
Results Database
user requests
Results show
Figure 5.data flow charts of the system
- 51 -
The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content
Identification
Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin
Figure 6 is an entire system network topology. The system can be a lot of users; each user can
connect to a server. Servers can share data and exchange information each other by network, network
connection scenarios can be P2P or client-server. Future which will be constantly modified and
optimized
Network media
law enforcementSpecial content
management
Network Communication OfficeAdminis
trative
Extranet
Figure 6. network topology
4. Working Process
4.1. Resources discovery based on latent semantic analysis
Resources discovery, which retrieve the necessary network resources, is a process to integrate 、
consolidate、mapping data by the different network information pattern. There are different retrieval
tools and the strategies between the recourses.
The BBS, chat room, the e-mail are short and random. First, use the DTS to Import / Export the
document, and then eliminate the problem of the algorithm which ignores the environment and
synonyms misjudgment based on the theme of latent semantic analysis, while using SVD to achieve
the information filtering and noise removal purposes. We can find topics drift effectively and timely
and meet the requirement of the public surveillance better according to the content of the document
similarity calculation and clustering analysis.
4.2. select information
Select information is to achieve specialized information from the network by automatically
selecting and pre-processing .First, filtered noise, recognize the named entity, extract the subject and
the event ;Secondly, classification、luster、filter the text according to topics or events; Finally,
discriminate the text.
1) Text classification based on semi-supervised learning
Distinctive feature of public opinion information is a short text, which should deal with massive
data. The traditional text classification algorithm is a supervised learning, which to learn the
calibration samples by the category tag settled and to determinate its category according to the text
semantic content. It needs a large label samples trains to a good classifier. It’s easy to access the large
number of unmarked data but to be high costs and impractical for marked data, which will create a
bottleneck When dealing with huge amounts of data by the traditional text classification. We use text
classification based on semi-supervised learning to overcome the sparsely of the short text and to
improve the accuracy of short text classification algorithm. And in order to increase the robustness of
the algorithm, better to avoid falling into local optimal solution; will integrate the Bagging algorithm
integrated into semi-supervised learning.
2) Bad information detection
- 52 -
Journal of Convergence Information Technology
Volume 5, Number 10. December 2010
Bad information detection is one of the key factor of monitoring system about website content. It
is only based on keywords on network information for recognizing and filtration for traditional
network detection system. If you want to mask a number of cult sites, those who criticize the cult will
aloes be filtered out. Therefore, we put forward a method to test poor information content based on
HNC(figure 7),which are not by way of matching keywords and to judge what text information
filtering needs according to the meaning of sentences.
4.3. Pattern Discovery
Pattern discovery will achieve hot topic detection and concern about the incident tracking and
orientation analysis by data mining and semantic computing based on the data from selecting
information module. The module is core of the system.
Pattern Discovery is presented as follows:
1) First, we obtain four tables by using the ICTCLAS researched by the computer software of
Chinese Academy of Sciences to achieve word segmentation and POS tagging:
Theme Table (ID, title, text, author, time, vector)
Comment table (ID, theme, title ID, text, author, time, tendentious value)
Topics table (ID, Keywords group, the number of participants, time, polarity, viewpoint opposition,
Notes)
Topic - theme map table (topic ID, theme ID)
A theme ID will be progressive distribution when inserting a database, we will keep on
corresponding relations between comments and topic by the theme ID When saving comments
information. Moreover, the third table holds basic clustering information; the fourth table holds the
theme of each cluster contained which is the subject of the topics.
Article Pretreat
Sentence Analysis
Context Generation
short-term memory
Position to judge the semantic
Red and black check
Network map object position
HNC概念知识库、词语知识
库
HNC Judgments Semantic Library
Red and black objects Library
Elements of the framework text
Sentence semantic structure
Text nature: 1 absolutely black, 2 absolutely red, 3 black, 4 suspicious III,5 suspicious II,6 suspicious 1,7
neutralities
Figure 7. bad information detection algorithm diagrams
2) Tendency Analysis
First we get ready for tendency dictionary to achieve first dictionary based on marked polarity and
strengthen by artificial labeled method in How-Net, and then manually add some common words. We
should establish a good tendency dictionary using hash table provided by Java language because that
there need to quickly check the inclination.
- 53 -
The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content
Identification
Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin
Next, read the text, process it sentence by sentence., remove the stop-words for each sentence,
query tendency dictionary word by word, calculate its context polarity and strength for the polarity of
the words. Then, add up all the polar components, receive sentences density situation divided by the
square root of the number of comments
Finally, represent with tendency value according to distributed situation division commentary
tendency and rank.
3) Popular feelings key point analysis: Query the comments from the database according to the
topic - theme map table and rank the basis for hot spot.
Calculate incident concern by the opposition of the topic view combined with its comments. select
an initial point based on the basic accumulation unit on a unit of time (for example: days), And then
calculate the time point of the topic view by the opposition of counting only the comments before the
point in time, latter, opinion in opposition to the added value of this time are received by subtracting
the value of a point in time to previous time value .that the trend of events can be obtained
4.4. information extraction
This module mainly gets structure data and obtains several databases for analysis and confirm or
esplanade the mode mined out. we can use GATE [6]
: entity recognition, entity-relationship
recognition, events recognition, summary generation, etc.
4.5. popular feelings information handling
1) The warning
Warning module of public opinion collects network information; discover the problem (things) and
feedback. Warning is active at a given time period show with the theme related events, the topic of the
trend.
2) Filtering
Filtering is just too bad information. The network management gets rid of negative news by
monitoring at all times. Collect sensitive phrase from different fields and set a weight value for each
phrase and use intelligent software to find sensitive phrase matching according to weights. The
information will be shielded beyond a certain threshold established.
3) Counter
First, gain its IP, and then lock it. We can use each effective attack method to carry on the
fixed-point attack disseminate for unsafe information of Hub the website (for example information
seepage technology, viral technology, advanced hacker attack technology and so on).It can prevent the
unsafe information from spreading and countering.
4) Monitoring
The system lists all the events or topics about the subject after entering the start time of monitoring,
the users select the suspected event or topic, monitoring module will continuously monitor.
Monitoring and early warning is different that the former is passive surveillance, early warning is
active.
5) Decision
A complete decision-making is often not possible, but an iterative process. In this process,
human-computer interaction can be used by policy makers in the parameters of different options and
alternatives.
5. Conclusions
There are heavy workloads for the traditional machine learning methods which need to be
manually tagging train classifiers netizens. This paper application content identification technology
based on semantics to design a framework of analysis and monitoring network Public Opinion system
for the comment being relatively short and broad emotional vocabulary. The next step we will pass
the experiments to show that the system can achieve a more satisfactory result.
- 54 -
Journal of Convergence Information Technology
Volume 5, Number 10. December 2010
6. References
[1] Li Yonghao. Simulation and analysis of Rapid screening algorithms about network hot topics.
Computer Communication Laboratory, Beijing Jiaotong University, internal communication
documents. 2006.14-16
[2] Founder Technology Research Institute. Public opinion on science and technology means to
support network monitoring and analysis of unexpected events - Founder ZhiSi public opinion
warning DSS. Informatization. 2005:50-52
[3] http://www.goonie.cn/news/industrynews/2008/05/2008-05-03122.html
[4]Li Yanling. Security monitoring system framework and its key technology of BBS content.
Research Institute of China Electronics. 2007,2(4):144-149
[5] Jin Yaohong. HNC language understanding technology and its applications [M]. Beijing: Science
Press. 2006.
[6]http://gate.ac.uk/
- 55 -