8
The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng Xian-Yi1 1,2 , Zhu Ling-ling 1 ,Zhu Qian 2 ,Wang Jin 1 1 .School of Computer Science and Technology Nantong University, Nantong 226019, China 2 .School of Computer Science and Communications Engineering Jiangsu UniversityZhenjiang 212013, China E-MIAL: [email protected] doi:10.4156/jcit.vol5. issue10.7 Abstract Network has become important public platform for the public to express the opinion, to discuss public affairs, to participate in economic social and political life. The spread of information network public opinion geometric progression growth, it is necessary to monitor and analyze network Public Opinion for the government to manage the public opinion information and to timely discover hot spots and to correctly guide public opinion trends. Therefore, Network Public Opinion monitoring and analyzing have become a hot issue in recent years. Now the main mature technology is the statistical analysis based on key word. However, there is still much room for improving its effectiveness. This paper describes a framework of network Public Opinion monitoring and analyzing system based on semantic content identification to solve some key problems of the public opinion. Keywords: Network Public Opinion, Natural Language Process, Semantic 1. Introduction In recent years, with the development of Internet and the increasing number of netizens, some people disclosure and spread the sensitive and bad information through forums, IM, e-mail and so on, which threat the social stability and people's life and property. On the one hand, the national legislation and regulations should put forward higher attentions in the focuses of public opinion to server the public better; on the other hand, Government should takes on vital responsibilities in takes on vital responsibilities in correctly monitoring the sensitive public opinion and guiding them which protect network users from bad information and build a harmonious socialist country. According to the preliminary statistics data from Internet center, there have been large directly and indirectly losses since 1996, which we can be seen from Figure 1. Therefore, Network Public Opinion monitoring and analyzing have become an urgent and important issue [1] . Where the Y Axis express State loss (in million dollars on units) Figure 1. The statistics of the harm to society by spreading the Network Public Opinion The most important technologies about network Public Opinion analysis include text filtering, text classifying, clustering, viewpoint tendentiousness recognizing, tracking topics, automatic - 48 -

The Framework of Network Public Opinion Monitoring and ... · The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Framework of Network Public Opinion Monitoring and ... · The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content

Identification

Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin

The Framework of Network Public Opinion Monitoring and Analyzing

System Based on Semantic Content Identification

Cheng Xian-Yi11,2

, Zhu Ling-ling1,Zhu Qian

2,Wang Jin

1

1.School of Computer Science and Technology Nantong University, Nantong 226019, China

2.School of Computer Science and Communications Engineering Jiangsu University,

Zhenjiang 212013, China

E-MIAL: [email protected] doi:10.4156/jcit.vol5. issue10.7

AbstractNetwork has become important public platform for the public to express the opinion, to discuss

public affairs, to participate in economic social and political life. The spread of information network

public opinion geometric progression growth, it is necessary to monitor and analyze network Public

Opinion for the government to manage the public opinion information and to timely discover hot

spots and to correctly guide public opinion trends. Therefore, Network Public Opinion monitoring

and analyzing have become a hot issue in recent years. Now the main mature technology is the

statistical analysis based on key word. However, there is still much room for improving its

effectiveness. This paper describes a framework of network Public Opinion monitoring and analyzing

system based on semantic content identification to solve some key problems of the public opinion.

Keywords: Network Public Opinion, Natural Language Process, Semantic

1. Introduction

In recent years, with the development of Internet and the increasing number of netizens, some

people disclosure and spread the sensitive and bad information through forums, IM, e-mail and so on,

which threat the social stability and people's life and property. On the one hand, the national

legislation and regulations should put forward higher attentions in the focuses of public opinion to

server the public better; on the other hand, Government should takes on vital responsibilities in takes

on vital responsibilities in correctly monitoring the sensitive public opinion and guiding them which

protect network users from bad information and build a harmonious socialist country. According to

the preliminary statistics data from Internet center, there have been large directly and indirectly losses

since 1996, which we can be seen from Figure 1. Therefore, Network Public Opinion monitoring and

analyzing have become an urgent and important issue [1]

.

Where the Y Axis express State loss (in million dollars on units)

Figure 1. The statistics of the harm to society by spreading the Network Public Opinion

The most important technologies about network Public Opinion analysis include text filtering, text

classifying, clustering, viewpoint tendentiousness recognizing, tracking topics, automatic

- 48 -

Page 2: The Framework of Network Public Opinion Monitoring and ... · The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng

Journal of Convergence Information Technology

Volume 5, Number 10. December 2010

summarizing and so on, which have been concerned about for a long time by domestic and foreign

workers . In order to control information more effectively, This paper describes a framework of

network Public Opinion monitoring and analyzing system based on semantic content identification.

2. Research Situation

Researchers from DARPA 、CMU、University of Massachusetts and Dragon Systems, Inc have

began to define topic detection and tracking study and developed TDT. The important technology of

this project is content classification of information, which resolves a contradiction between the

processing speed and safety monitoring of the real-time monitoring and make it feasible. There are

some studies about it abroad such as the PICS of the W3C which have become classification standard

on WWW. There are two International general classification standards: SACi and Safesurf, which are

both accord with the PICS. On the one hand, the classification technique is used for web page

classification and filtering; on the other hand, the foreign policy and standards are not fully suitable

for China's national conditions for various reasons.

In China, Founder ZhiSi public opinion warning DSS [2]

designed by Institute Founder is

successful. The system has successfully achieved automatic real-time monitoring and analysis of the

massive public opinion. It is more effective for government to monitor the public option than

traditional manual mode .It also do some to strengthen the supervision of internet information and

play a certain role for the network sudden public events .This DSS provide the function including

such as full text retrieval, automatic sorting, automatic cluster, subject examination/tracing, related

recommendation and disappear heavy, connection and tendency analysis, automatic abstract and key

word extraction, thunderbolt analyzes, generate statistics and so on.

Goonie network Public Opinion and information monitor system combines internet search

technology; information intellectualized process technology and knowledge manage method. It

realize network public opinion monitor and special news trace to briefing, report and so on through

auto collect, auto classify combine, subject collection, focus special topic. Therefore Goonie can

master public opinion, make proper consensus and provide report analyze [3]

.

A framework of content security monitoring system was designed based on human-computer

combination in literature [4]

. The framework is a hierarchy, there are three levels: the data acquisition

layer, content analysis, output layer. It’s function mainly examine the information based on content by

the content analysis and identify the bad information; on the same time, it can provide electronic

evidence for the bad use of information by recording the source and content of information and

tracking them by effective audit analysis.

Although there are many units engaged in domestic internet content filtering direction of the

research, and try to achieve the purpose of purifying network environment. But these techniques are

still in the bud, there are still some deficiencies in "the semantic information filtering"

3. System Framework

The purpose of the system is to achieve a large-scale network environment monitoring report of

Network Public Opinion through testing, acquirement, theme, hot topics and events tracking,

experiments monitoring and so on, which can form many representation modes of analysis results,

such as brief, reports, charts etc. Therefore, the system can master public opinion, make proper

consensus and provide report analyze. Monitoring system for network Public Opinion module

function block diagram in Figure 2.There are five stages including resource discovery, information

selection, pattern discovery, information extraction, public opinion handling.

- 49 -

Page 3: The Framework of Network Public Opinion Monitoring and ... · The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content

Identification

Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin

Structure information knowledge

Text analysis、audit

Text purification

unstructure text structure text

Theme

managmentwarning、filter、counter、monitor、decision

event

searchreportabstract

Topic

pattern

Semantic

indexEvent

pattern

Tendentious

analysis

filtere

noise

extracte the

subjectexpressionlusterfilter

(Digital Collection、format shifting、Data Import / Export) text

BBSblogE-

mail

Chatting

roomWEB

public

opinion

handling

information

extraction

Pattern

Discovery

select

information

Resources

discovery

client

server

Topic

search trend

analysis

Figure 2. logic structure of the network public opinion monitoring system function module

Figure 3 is the system workflow. The systems include the following five databases:

1) Public opinion planning information database: To collect demand information of the Public

opinion including the online news, BBS, RSS, chat room, blog, polymerization news (RSS), etc.

2) Public opinion analysis information database: To collect the storage data through classification

and clustering, keyword extraction, removal the duplication and filter, named entity recognition,

semantic computing etc. which information database is structured.

3) Public opinion database: To storage products related to public opinion analysis report, survey

report, experience summary and related information.

4) Semantic dictionary: Ontology knowledge, etc

5) HNC knowledge: 466 sentence knowledge, etc [6]

.

Public SentimentResources

discovery

select

information

information

extraction

Pattern

Discovery

Public opinion

database

programming

Public opinion

information

analysis

HNC

knowledge

Public

opinion

Products

semantic

dictionary

Popular feelings

handling

Figure 3. System workflow of the network public opinion monitoring

- 50 -

Page 4: The Framework of Network Public Opinion Monitoring and ... · The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng

Journal of Convergence Information Technology

Volume 5, Number 10. December 2010

Figure 4 is the client workflow

Show topic list

modify the theme ?

Theme management

yes

Theme selection

no

Server-side processing

information extraction

popular feelings information handling

user

continue exit

warn decisionminorcounterfilter

Topic retrieval

event retrieval

Topic retrieval

Summary of

public opinion

public report

Figure 4. Client workflow

Figure 5 is data flow charts of the system. The interaction between the various modules are

different: Data interaction is based on file between the resources discovery module and select

information module; Select information module deal with information from the text to vector or

ontology; Use GATE tagging to name entity in pattern discovery module and determine the

relationship between entities and then discover the event pattern or the topic pattern; Information

extraction module mainly do the semantic computing and transform the patterns into templates, which

will make the unstructured information into structure information; Public opinion handling module

need to carry on the inquiry according to the user and give these results to user with the suitable

manifestation. simultaneously, the module receive the user’s establishment and inquiry request.

Resources discovery

Pattern Discovery

select information

Popular feelings

handling

information extraction

Network information

Server client

Analysis of unstructured

text

Vector, ontology

Structured

Results Database

user requests

Results show

Figure 5.data flow charts of the system

- 51 -

Page 5: The Framework of Network Public Opinion Monitoring and ... · The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content

Identification

Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin

Figure 6 is an entire system network topology. The system can be a lot of users; each user can

connect to a server. Servers can share data and exchange information each other by network, network

connection scenarios can be P2P or client-server. Future which will be constantly modified and

optimized

Network media

law enforcementSpecial content

management

Network Communication OfficeAdminis

trative

Extranet

Figure 6. network topology

4. Working Process

4.1. Resources discovery based on latent semantic analysis

Resources discovery, which retrieve the necessary network resources, is a process to integrate 、

consolidate、mapping data by the different network information pattern. There are different retrieval

tools and the strategies between the recourses.

The BBS, chat room, the e-mail are short and random. First, use the DTS to Import / Export the

document, and then eliminate the problem of the algorithm which ignores the environment and

synonyms misjudgment based on the theme of latent semantic analysis, while using SVD to achieve

the information filtering and noise removal purposes. We can find topics drift effectively and timely

and meet the requirement of the public surveillance better according to the content of the document

similarity calculation and clustering analysis.

4.2. select information

Select information is to achieve specialized information from the network by automatically

selecting and pre-processing .First, filtered noise, recognize the named entity, extract the subject and

the event ;Secondly, classification、luster、filter the text according to topics or events; Finally,

discriminate the text.

1) Text classification based on semi-supervised learning

Distinctive feature of public opinion information is a short text, which should deal with massive

data. The traditional text classification algorithm is a supervised learning, which to learn the

calibration samples by the category tag settled and to determinate its category according to the text

semantic content. It needs a large label samples trains to a good classifier. It’s easy to access the large

number of unmarked data but to be high costs and impractical for marked data, which will create a

bottleneck When dealing with huge amounts of data by the traditional text classification. We use text

classification based on semi-supervised learning to overcome the sparsely of the short text and to

improve the accuracy of short text classification algorithm. And in order to increase the robustness of

the algorithm, better to avoid falling into local optimal solution; will integrate the Bagging algorithm

integrated into semi-supervised learning.

2) Bad information detection

- 52 -

Page 6: The Framework of Network Public Opinion Monitoring and ... · The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng

Journal of Convergence Information Technology

Volume 5, Number 10. December 2010

Bad information detection is one of the key factor of monitoring system about website content. It

is only based on keywords on network information for recognizing and filtration for traditional

network detection system. If you want to mask a number of cult sites, those who criticize the cult will

aloes be filtered out. Therefore, we put forward a method to test poor information content based on

HNC(figure 7),which are not by way of matching keywords and to judge what text information

filtering needs according to the meaning of sentences.

4.3. Pattern Discovery

Pattern discovery will achieve hot topic detection and concern about the incident tracking and

orientation analysis by data mining and semantic computing based on the data from selecting

information module. The module is core of the system.

Pattern Discovery is presented as follows:

1) First, we obtain four tables by using the ICTCLAS researched by the computer software of

Chinese Academy of Sciences to achieve word segmentation and POS tagging:

Theme Table (ID, title, text, author, time, vector)

Comment table (ID, theme, title ID, text, author, time, tendentious value)

Topics table (ID, Keywords group, the number of participants, time, polarity, viewpoint opposition,

Notes)

Topic - theme map table (topic ID, theme ID)

A theme ID will be progressive distribution when inserting a database, we will keep on

corresponding relations between comments and topic by the theme ID When saving comments

information. Moreover, the third table holds basic clustering information; the fourth table holds the

theme of each cluster contained which is the subject of the topics.

Article Pretreat

Sentence Analysis

Context Generation

short-term memory

Position to judge the semantic

Red and black check

Network map object position

HNC概念知识库、词语知识

HNC Judgments Semantic Library

Red and black objects Library

Elements of the framework text

Sentence semantic structure

Text nature: 1 absolutely black, 2 absolutely red, 3 black, 4 suspicious III,5 suspicious II,6 suspicious 1,7

neutralities

Figure 7. bad information detection algorithm diagrams

2) Tendency Analysis

First we get ready for tendency dictionary to achieve first dictionary based on marked polarity and

strengthen by artificial labeled method in How-Net, and then manually add some common words. We

should establish a good tendency dictionary using hash table provided by Java language because that

there need to quickly check the inclination.

- 53 -

Page 7: The Framework of Network Public Opinion Monitoring and ... · The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content

Identification

Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin

Next, read the text, process it sentence by sentence., remove the stop-words for each sentence,

query tendency dictionary word by word, calculate its context polarity and strength for the polarity of

the words. Then, add up all the polar components, receive sentences density situation divided by the

square root of the number of comments

Finally, represent with tendency value according to distributed situation division commentary

tendency and rank.

3) Popular feelings key point analysis: Query the comments from the database according to the

topic - theme map table and rank the basis for hot spot.

Calculate incident concern by the opposition of the topic view combined with its comments. select

an initial point based on the basic accumulation unit on a unit of time (for example: days), And then

calculate the time point of the topic view by the opposition of counting only the comments before the

point in time, latter, opinion in opposition to the added value of this time are received by subtracting

the value of a point in time to previous time value .that the trend of events can be obtained

4.4. information extraction

This module mainly gets structure data and obtains several databases for analysis and confirm or

esplanade the mode mined out. we can use GATE [6]

: entity recognition, entity-relationship

recognition, events recognition, summary generation, etc.

4.5. popular feelings information handling

1) The warning

Warning module of public opinion collects network information; discover the problem (things) and

feedback. Warning is active at a given time period show with the theme related events, the topic of the

trend.

2) Filtering

Filtering is just too bad information. The network management gets rid of negative news by

monitoring at all times. Collect sensitive phrase from different fields and set a weight value for each

phrase and use intelligent software to find sensitive phrase matching according to weights. The

information will be shielded beyond a certain threshold established.

3) Counter

First, gain its IP, and then lock it. We can use each effective attack method to carry on the

fixed-point attack disseminate for unsafe information of Hub the website (for example information

seepage technology, viral technology, advanced hacker attack technology and so on).It can prevent the

unsafe information from spreading and countering.

4) Monitoring

The system lists all the events or topics about the subject after entering the start time of monitoring,

the users select the suspected event or topic, monitoring module will continuously monitor.

Monitoring and early warning is different that the former is passive surveillance, early warning is

active.

5) Decision

A complete decision-making is often not possible, but an iterative process. In this process,

human-computer interaction can be used by policy makers in the parameters of different options and

alternatives.

5. Conclusions

There are heavy workloads for the traditional machine learning methods which need to be

manually tagging train classifiers netizens. This paper application content identification technology

based on semantics to design a framework of analysis and monitoring network Public Opinion system

for the comment being relatively short and broad emotional vocabulary. The next step we will pass

the experiments to show that the system can achieve a more satisfactory result.

- 54 -

Page 8: The Framework of Network Public Opinion Monitoring and ... · The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng

Journal of Convergence Information Technology

Volume 5, Number 10. December 2010

6. References

[1] Li Yonghao. Simulation and analysis of Rapid screening algorithms about network hot topics.

Computer Communication Laboratory, Beijing Jiaotong University, internal communication

documents. 2006.14-16

[2] Founder Technology Research Institute. Public opinion on science and technology means to

support network monitoring and analysis of unexpected events - Founder ZhiSi public opinion

warning DSS. Informatization. 2005:50-52

[3] http://www.goonie.cn/news/industrynews/2008/05/2008-05-03122.html

[4]Li Yanling. Security monitoring system framework and its key technology of BBS content.

Research Institute of China Electronics. 2007,2(4):144-149

[5] Jin Yaohong. HNC language understanding technology and its applications [M]. Beijing: Science

Press. 2006.

[6]http://gate.ac.uk/

- 55 -