Polarity analysis for sentiment classification

Polarity Analysis for Sentiment Classification

WTF Algorithm

•Threeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee !!!!!! I can’t stop …

•There are three classifiers.

Problem

•Input• 800 positive & 800 negative data

• 400 test data

•Output• polarity of 400 test data

Online Classifier

•Language Model• A statistical model assigns a probability to

a sequence of m words.

•Winnow Classifier• machine learning for learning a linear

classifier

•Passive-Aggressive Classifier• Using Passive-Aggressive Algorithm

From reference paper

• Sieve n-grams by

•特別小心公式的計算，雖然有很多乘除法，可以使用Math.log 降下來，防止 overflow 的可能，但同時也會造成嚴重的浮點數誤差。所以使用恰當的 double 運算即可，即使遇到 NaN也沒有關係。

Algorithm－preprocess

gobal_record(n-grams, score)

for i = 0 to CROSS_VALIDATION_MAX

shuffle_order(training_data)

(ttraining, ttest) = split(training_data, 1 : 1)

Vectraining = n-grams_sieve(ttraining)

(LM, Winnow, PA) = training(Vectraining)

P = test(LM, Winnow, PA, Vectraining)

gobal_record.add(P, Vectraining)

每一次迴圈大約需要一分鐘，目標是挑出普遍性特徵。

Algorithm－preprocess

sort gobal_record(n-grams, score) by descending order.

shuffle_order(training_data)

(ttraining, tretain) = split(training_data, 4 : 1)

N-gramsTable = gobel_record.sublist(K)

Vectraining = parsingInput(ttraining,, N-gramsTable)

Vecretain= parsingInput(tretain,, N-gramsTable)

(LM, Winnow, PA) = training(Vectraining)

(LM, Winnow, PA) = retraining(Vecretain)

根據普遍性特徵訓練，同時防止過度訓練。

simpleTest(testdata, LM, Winnow, PA)

onlineTest(testdata, LM, Winnow, PA)

Algorithm－more classifier

Classifier Vector(x) = (LM_POS, LM_NEG, WINNOW_POS, WINNOW_NEG, PA_POS, PA_NEG)

• LM_POS = 0/1

• LM_NEG = 0/1

• WINNOW_POS = h(x)/selfMAX, h(x) > threshold

• WINNOW_NEG = h(x)/selfMAX, h(x) < threshold

• PA_POS = dot(x, w)/selfMAX, dot(x, w) > 0

• PA_NEG = dot(x, w)/selfMAX, dot(x, w) < 0

發現以閥值作為判斷標準，因此當得到靠近閥值的數據下，判斷能力是相當脆弱。

Why we need online classifier ?

•dynamic modify definition of characteristic vectorDifficult !!!!!!!!!!!!

•From another point of view• we may training model from view face A,

But test data must use view face B.

Algorithm－online test

GROUP_SIZE = 50

groups = splitEach(testdata, GROUP_SIZE)

foreach(group : groups)

appear set = find(group, N-gramsTable)

(LM, Winnow, PA) = retrainingLimited(Vectraining, appear set)

foreach(data : group)

classify(LM, Winnow, PA, data)

如果 group 都是相同極性，從 training data 裡面隨機挑幾筆。

根據當前看得到的特徵，再從 training data 中，照理也應該具有相同效果。根據迭代次數，通常可以再幾秒內完成。

Algorithm－online test fail

GROUP_SIZE = 10

groups = splitEach(testdata, GROUP_SIZE)

foreach(group : groups)

backtracking 210 possible solution

training & test

if (better performance)

record solution.

Work, but not helpful.用一群瘋子看世界的效果

Vector

•選擇 K-top feature n-grams 後，感知機的Vector 如何放置權重仍然是個困難，從實驗中，單純拿n-grams appear times 作為一個 attribute weight 效果並不好，於是嘗試拿Math.log(n-grams appear times)，但是效果並不好，有可能是浮點數誤差造成的差異並不大，而Math.log 本身就很小，尤其是 n-grams appear times = 1 的時候會變成 0，額外加上一個基底 base 來補足也拿以取捨。

Support N-grams Sieve

• AFINN-111.txt

• The file AFINN-111.txt contains a list of sentiment scores

• Stop word list

• Small set, |S| < 20

• Synonymous “Not” list

• unused

• Abbreviation list

• Rule, |R| < 10

• No CRF, No Parsing tree, No Subjective filter

Abbreviation List

• `can't` = `can not`

• `n't` = ` not`

• `'re` = ` are`

• `'m` = ` am`

• `'s` = ` is`

• `'ve` = ` have`

• `'ll` = ` will`

• `&` = ` and`

Stop word List

• the

• it

• he

• she

• they

• we

• you

• -

• a

• an

Detail – (1)

•在挑選 n-grams 時，根據給定的公式，從 800 正向評論、800 反向評論中，大約會得到 50M ~ 100M 不同的 n-grams。當我們篩選 n = 3 時，bad 將可能被儲存為 (bad, null, null)。挑選時，必須保障 high order n-grams 佔有一定的數量，大約落在 n-grams : (n+1)-grams = 7 : 3。

•評分時，額外增加 high order 的評分權重，以下是程式中使用的分配。並確保配額。

Detail – (2)

•當挑選 K-top feature 時，必須將正反兩方的feature N-grams 分別佔有約 50%，並且去掉同時挑到的情況。可考慮 sub-sequence 的重複去除，目前效果不好。

•串法必須盡可能有所歧異性，並不是串越多越好，可以藉較少次的迭代次數、洗牌後的訓練序列來達到歧異性。

•這一類的串許多的分類器的算法，可以參照Adaboost (Adaptive Boosting) 的想法

Feature

• Vector = 0

• 增加兩個不在 top feature 中的 attribute，在 pos/negword weight 中的 n-grams 所評分的結果。在量化這些 n-grams 的分數時，不管正反面的強度，一律取絕對值進行加總，有可能一個正面單詞跟一個負面單詞合併在一起來表示一個更強烈的正面或反面資訊。

• Noise

• 實作判斷主觀、客觀的分類器，subjective classifier

• Filter ! But LM filter not helpful.

Github

morris821028 / NLP-SentimentClassification

Special thanks :moporgic irisshu FlowerHop

https://github.com/morris821028

https://github.com/morris821028/NLP-SentimentClassification

https://github.com/moporgic

https://github.com/irisshu

https://github.com/FlowerHop

Science

Polarity analysis for sentiment classification