Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)

Microsoft Malware Classification Challenge 上位手法の紹介

佐野正太郎

アジェンダ

コンペ概要

ベースラインアプローチ

ワードカウント & ランダムフォレスト

上位手法の紹介

特徴抽出

特徴変換

分類器

優勝チームのモデル

コンペ概要

コンペ概要

タスク：マルウェアの分類

入力：ヘキサダンプと逆アセンブリファイル

ヘキサダンプ

(.bytes)

逆アセンブリ

(.asm)

コンペ概要

.bytes

Your

Model

Malware

Class

Probabilities

.asm

コンペ概要

.bytes

Your

Model

Malware

Class

Probabilities

.asm

10,868 training samples

1,000 GB in total

9 classes

10,873 test samples

コンペ概要

クラス毎の確率を各サンプルに対して出力

モデル評価：Log Loss

1

0

1

0

,,log log1 N

i

K

k

kiki pyN

L



Beat the benchmark (~0.182) with RandomForest [4]

ヘキサダンプからワードカウント特徴量抽出

１バイト＝１単語

そのままランダムフォレストに投げる

.bytes Random

Forest

Classifier

1-byte

Word

Count

Malware

Classes

Probabilities


コンペ初期からフォーラムに登場

It was a surprise that one can achieve the accuracy of

0.96 just by using counts of ‘00’ and, ‘FF’, and ‘??’. [3]

特徴抽出

上位勢の特徴抽出

ヘキサダンプからのワードカウント

逆アセンブリからのワードカウント

ハイブリッドワードカウント

ファイルのメタデータ

テクスチャ画像

ヘキサダンプからのワードカウント

１バイトを１単語として扱う

Nグラムモデルで性能が向上

優勝チームのモデルでは４グラムまで取得

１ラインを１単語とする方法も [2]


ヘッダヘキサダンプコードアセンブリコード



インストラクションのカウントをとる



セグメント名のカウントをとる


DLL関数のインポート情報を特徴量化

ハイブリッド特徴量

DAF (Derived Assembly Features) 特徴量 [6]

(1) ヘキサダンプからNグラム特徴量抽出

(2) (1)を情報ゲインで絞り込み

(3) (2)と共起するアセンブリインストラクションを抽出

(4) (3)を情報ゲインで絞り込み

ヘキサダンプ特徴が重要な場合のみ

インストラクションを特徴としてカウント

ファイルのメタデータ

ヘキサダンプファイルのサイズ

逆アセンブリファイルのサイズ

ヘキサンダンプファイルの圧縮レート

逆アセンブリファイルの圧縮レート

etc.


ヘキサダンプをグレースケール画像に変換

１バイト＝１画素値

適当な画像特徴量を抽出

元論文ではGIST特徴量を使用[7]

((((((( ;ﾟдﾟ)))))))

特徴変換

上位勢の特徴変換

TF-IDF

情報ゲイン

非負値行列因子分解

ランダムフォレスト

TF-IDF

単語頻度をドキュメント長で正規化

小数のドキュメントにしか出現しない単語を強調

idftftfidf *

docword

worddocdocword

docwordtf

'

,}in ' {#

}in {#

} including docs {#

docs} all {#log

wordidfword

情報ゲイン

ある特徴を既知とした場合のエントロピーの差分

計算の簡単化

単語の頻度 => 出現したかどうかの二項値

クラス毎に独立して特徴を選択

)|()()( xYHYHxGain

))(log)(log(

)log()(log)(

22

}1,0{

2

v

v

v

v

v

v

v

v

v

v

t

n

t

n

t

p

t

p

t

t

t

n

t

n

t

p

t

pxGain

ポジティブサンプル数ネガティブサンプル数

トータルサンプル数

対象特徴を固定した場合のサンプル数

非負値行列因子分解

Nグラムワードカウントは多次元な非負値行列

非負値の特性を保ったまま次元圧縮

非負値行列を非負値行列の積に分解

下の例では５次元から２次元に圧縮

20011

01210

13

00

21

01

23641

00000

41232

01210

ランダムフォレスト

分類器ではなく特徴選択手法として利用

学習後にFeature Importanceの低い特徴を捨てる

分類器

XGBoost

高速・多機能な勾配ブースティングの実装

アンサンブル木学習 + 勾配法

勾配法の要領で逐次的に弱い木を学習

))(,()()( 1

1

1 xFyLxFxF ti

n

i

Fttt

前ステップまでに学習したフォレスト

次ステップの木は前ステップの負勾配にフィット

Averaging

複数モデル出力の単純平均をサブミットする幾何平均で性能が向上することも

Averaging multiple different green lines should bring us closer to the black line. [5]

Stacking

複数モデルの出力を統合するモデルを学習

XGBoost Neural

Network XGBoost

Nearest

Neighbors

XGBoost Random

Forest

Extra

Tree

Averaging

まとめ

ワードカウントベースの特徴抽出

情報量ゲインや行列分解による特徴数のバランシング

ファイルのメタ情報


XGBoost

Averaging or Stacking

優勝チームのモデル

Opcode

2-gram

Opcode

3-gram

Opcode

4-gram

Header

1-gram

Hexdump

4-gram

&

Info Grain

DAF

1-gram

DLL

1-gram

Random

Forest

XGBoost

Assembly

Texture

Image

Instruction

1-gram

Hexdump

1-gram

Random

Forest

Semi-supervised Learning with Test Dataset

Averaging

Opcode

2-gram

Opcode

3-gram

Opcode

4-gram

Header

1-gram

Hexdump

4-gram

&

Info Grain

DAF

1-gram

DLL

1-gram

Random

Forest

XGBoost

Assembly

Texture

Image

Instruction

1-gram

Hexdump

1-gram

Random

Forest


Averaging

逆アセンブリによるテクスチャ画像

逆アセンブリファイルのバイト列をテクスチャ化

先頭１０００ピクセルの画素値を特徴量とする

ヘキサダンプテクスチャ

逆アセンブリテクスチャ

テストデータを含めた半教師あり学習

トレーニングデータでモデル学習（中間モデル）

中間モデルで全テストデータをラベル付け

ラベル付きテストデータを複数のチャンクに分割

各チャンクに対し：

対象チャンク以外のトレーニングデータとラベル付きテストデータで最終モデルを学習

最終モデルで対象チャンクのクラス確率を予測

各チャンクの結果を統合

Opcode

2-gram

Opcode

3-gram

Opcode

4-gram

Header

1-gram

Hexdump

4-gram

&

Info Grain

DAF

1-gram

DLL

1-gram

Random

Forest

XGBoost

Assembly

Texture

Image

Instruction

1-gram

Hexdump

1-gram

Random

Forest


Averaging

Opcode

2-gram

Opcode

3-gram

Opcode

4-gram

Segment

1-gram

Hexdump

4-gram

&

Info Grain

DAF

1-gram

DLL

1-gram

Random

Forest

XGBoost

Assembly

Texture

Image

Instruction

1-gram

Hexdump

1-gram

Random

Forest


Averaging

Golden Features

どの特徴が効いていたか？

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

Opcode-count Opcode Count Segment Count

Opcode Count Segment Count ASM Texture

All Features

Cross Validation

Public Leaderboard

Private Leaderboard

Lo

g-l

oss

Private Leaderboard

Private Leaderboard

Public Leaderboard

Thank you!

リファレンス

1. First place code and documents

https://www.kaggle.com/c/malware-

classification/forums/t/13897/first-place-code-and-documents

2. 2nd place code and documentation


classification/forums/t/13863/2nd-place-code-and-documentation

3. 3rd place code and documentation


classification/forums/t/14065/3rd-place-code-and-documentation

https://www.kaggle.com/c/malware-classification/forums/t/13897/first-place-code-and-documents













https://www.kaggle.com/c/malware-classification/forums/t/13863/2nd-place-code-and-documentation













https://www.kaggle.com/c/malware-classification/forums/t/14065/3rd-place-code-and-documentation













リファレンス

4. Beat the benchmark (~0.182) with RandomForest

https://www.kaggle.com/c/malware-classification/forums/t/12490/beat-

the-benchmark-0-182-with-randomforest

5. Kaggle Ensembling Guide

http://mlwave.com/kaggle-ensembling-guide

https://www.kaggle.com/c/malware-classification/forums/t/12490/beat-the-benchmark-0-182-with-randomforest





















リファレンス

6. Masud, M. M., Khan, L., and Thuraisingham, B., “A

Scalable Multi-level Feature Extraction Technique to

Detect Malicious Executables,” Information Systems

Frontiers, Vol. 10, No. 1, pp. 33-45, (2008).

7. Nataraj, L., Yegneswaran, V., Porras, P. and Zhang, J. “A

Comparative Assessment of Malware Classification Using

Binary Texture Analysis and Dynamic Analysis,”

Proceedings of the 4th ACM Workshop on Security and

Artificial Intelligence, 21-30 (2011).

Data & Analytics

Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)