MapReduce : simplified data processing on large clusters abstract

MapReduce : simplified data processing on large clusters

abstract: mapreduce 是一種 programming 的 model 且是一種對於處理或是產生

大量資料聯合的 implementation

introduction:

在 google 的系統中有很多種服務，而要提供這些服務的時候我們必須處理

大量的 raw data，但這 raw data 常常數量是非常大的，所以要怎麼樣在可以接受

的時間內透過成百上千分散的 machine 處理就變成很重要得一個 issue。

而這 issue 要解決的包涵怎麼平行化計算、分散資料….

面對這複雜的東西，我們就設計了一個新的概念讓我們可以表達出我們想要

試著做到的 simple computation（而把那複雜的平行化、資料分散、平衡負載等

東西包到一個 library 裡頭）而這簡單的概念就是先利用一個”map”的 operation

把邏輯上在我們 input 中 record 的那部份處理出中間的 key/value。然後在

用”reduce”把所有在 intermediate 中屬於同一個 key 的 value combine 起來

“這件事最主要的貢獻是我們建立了一個簡單且有力的 interface，使得自動

化的平行運算、分散大範圍的計算…..”

每個 section 的簡介!

programming model:

這個計算的過程的 input 為 a set of key/value pairs 而會產生另一組的

key/value pairs 當作 output，在 mapreduce 的 library 中要做到這樣的事利用的就

是 Map 和 Reduce 兩個 function

#Map:將 input 轉成 intermediate key I 然後將相同 I 的 key value group 在一起

pass 到 Reduce function

#Reduce: 接受 I 且接受 a set of values for that key，然後把這些 values merge

together 形成一個比較小的 set of values

example:在一大堆 document 中計算每個字出現幾次（一個字在一個 document

中出現就算一次這樣）

more example: distributed grep: supplied pattern 記錄下來

count of URL Access Frequency: 處理 log 檔把 count 加起來

reverse web-link graph: reduce 出某些 URL(target)出現在哪些 page

term-vector per host: summarize the most important words in a doc

inverted index: reduce 出哪些字出現在哪些 doc 裡

distributed sort: extract 出 key from record 然後利用別的 fachility

去 sort

Implementation

mapreduce 的 implement 基本上都可以成立，只是根據 run 的機

器不同有不同的 implement 方式，這篇 paper 裡用的是 google 用

來計算的 environment(一堆 PC 經由 switch Ethernet 連結在一起)

1. dual-processor x86 run linux with 2~4GB memory

2. 100M || 1G bps at machine level 不過通常只有 half bandwidth

3. machine failure 是很常發生的

4. 每台 machine provide 不貴的 IDE disk ??????

5. submit job to scheduling system(job 當中有 a set of task)然後經

由 scheduler 丟到 cluster 裡可用的 machine

EXEC OVERVIEW:

Map 再一開始的時候會自動把 input data 分成 M 份

Reduce 會把 intermediate key 切成 R 份

1. 切成 M 份一份大概 16~64M start many copies of program

2. master 會 assign work 給 worker

3. each worker read input spilt 然後用 user-defined 的 Map function

parses ，產生出 intermediate key/value 後 buffered in memory

4. 這 buffered 的東西會被寫到 local disk 裡然後切成 R 份，回報給 master

知道

5. reduce worker 在被 master 告知資料在哪之後，他會用 remote

procedure call 去讀在 local disk 的 data，讀完之後會 sort 把相同 key

的 group 在一起（why 這個 sorting 是必須的呢?）

6. 經過 Reduce function 完會把結果 append to a final output file(幾個

reduce task 就會有幾個 output file)

#通常我們不會把這些 output file 結合起來，通常我們會把他在丟到另

外一個 MapReduce function 裡，或是一個可以處理 input 是被分散在多

個 file 裡的 application。

master data structure

master 需要很多種 data structure，像是對每個 map、reduce task master

都需要儲存 worker 的狀態(idle,in-progress or complete)並且知道這個

worker 是在哪個 machine 上。master 是 intermediate file 從 map task 到

reduce task 的導管，所以對每個 complete 的 map task master 需要儲存

location 以及 R 個 intermediate file，在 map task complete

fault tolerance

--worker failure

master 會每隔一段時間 ping worker，如果 worker 在一段時間內沒有回

master 就會當作這個 worker 已經 failed，這個時候這個 task會被 reset成 idle，

同樣的要是 map(reduce)task 完成了，他會回到 idle state，當他們變成 idle

state 他們就會變成合格的 for rescheduling

當錯誤發生時 completed map task 會需要被重新執行(因為他的 output

是放在 local machine 上錯誤發生的時後會讓他變成 inaccessible) 而

completed reduce task 就不需要重新被執行因為他的 output 已經被放到

global file system 了。

當一個 map task 先被 A 執行(然後 A failed)接著由 B 執行，reduce worker

會被提醒說要重新執行，使得他們知道要去 worker B 所 locate 的地方讀 input

--master failure

如果有多個 master task 那每隔一段時間就可以加個 check point，當一

個 master task dies 新的那一份就從最新的那個 check point 重新開始;如果很

不幸的只有一個 master task 那這個 MapReduce operation 會因為這個master

死掉了所以終止，這時候交給 client 決定要要重新執行。

--semantics in the presence of failure

Locality

因為 bandwidth 是相對稀少的 resource，所以我們想辦法保留 network

bandwidth 經由 input data 是存在 local disk(manage by GFS)GFS 會把每個 file

都切成 64MB 並且在不一樣的 machine 儲存 copy(通常是三份)MapReduce

master 會把 location 的 information 加入考慮，盡量排成 map task 所要的 input

剛好就在他那台 machine 上有 replica(退而求其次就是在同一個 network

switch 上)

Task Granularity(粒度)

？having each worker perform many different tasks improves dynamic load

balancing, and also speeds up recovery when a worker fails?

實際上 M 跟 R 不能無限制的擴張，因為 scheduling 有 O(M+R)個 decisions

然後需要 keep(M*R)個 state，此外 R 的的大小決定於 user 看他要 output 多

少個 file; M 就用 16~64MB 去切。

（we often perform MapReduce computations with M=2000000 and

R=5000 using 2000 workers on machines）

Backup Tasks

常常會造成 total time 增加的原因是因為在 MapReduce 的 operation 中

會有 straggler(落伍者)的存在，為什麼會有 straggler 呢?

1.machine 上有 bad disk 所以要常常 correct error

2.scheduling system 可能 schedule 其他的 task 在 machine 上（這樣需要

和其他的東西搶 cpu memory network bandwidth 之類的東西）

3.在 machine 的 initialization code 有 bug 使得 processor caches 不能用

我們有一個 general mechanism 去減輕 straggler 的問題，當 MapReduce

快要完成的時候，master schedule 會把 in progress 的 task backup，然後調整

什麼東西嗎????

Refinement

partitioning function

default 的 partition function 是用 hash(key)mod R，這個方法看起來

well-partition 的分割方法，但有時候或許別的 partition function 會比較好，

舉例來說，output key 是 URL，然後我們希望同一個 host 會在同一個 outfile

那樣這時候的 partition function 就會變成 hash(hostname(urlkey))mod R 比較

好(就不用在 output file 中切割了!)

ordering guarantees

保證 within a given key intermediate key 會 increasing order

好處：sort 方便、找 key 方便

combiner function

考慮 word count，因為會送出很多<key,1>的 records，所以有 combiner

function 再把 record 送到 network 前執行 partial merging。其實 combiner

function 跟 reduce function 的 code 是一樣的，不同的是怎麼處理 output，一

個是寫到 outfile 另外一個是送到 intermediate。

input and out types

user 可以 support 新的 input type 藉由新增一個 interface(reader)reader

不一定要 read from file，要 define 他 read record 從 database 或是 mapped in

memory 都是可以的

同樣的方法也可以拿來新增 output file

side-effect

skipping bad records

有些 bug 可能會讓 MapReduce，通常的處理方式是去改掉這個 bug，但

是 bug 我們不一定能碰得到。所以我們有時候是可以允許忽略一些 record

paper 提供了一個選擇可以偵測哪些 record 會導致 crash 然後就跳過他繼續

處理別的。每個 worker process 都有一個 signal handler 負責 segmentation

violation 還有 bus error，在開始 Map or Reduce operation 前，會儲存一個

sequence number 在 global variable，然後如果 user code 產生了 signal，那在

crash 之前會送出這個 sequence number 給 master 知道，如果某一個 record

讓 master 知道他 crash 超過一次，那下一次執行他的時候就會 skip 他。

local execution

要在 Map or Reduce debug 有 tricky 的方法（不然實際的 computation 發

生在 distributed system，可能會有好幾千台 machine 且 work assignment 是

master 動態決定的）所以為了幫助 debug 蒐集資料還有 small scale的測試，

我們發展了另外一種 implement MapReduce 的方法，就是 sequential 的執行

所有的 work 在一台 local machine 上。用這種方法可以把 computation 限制

在一個特定的 map task，所以我們就可以輕鬆的 debug

status information

master 會 run 一個 internal HTTP server 然後輸出各個狀態的頁面，status

page show 出進行中的計算（比如說有多少個 task 已經完成了，有多少個還

在進行、input data、intermediate data、 output file 的大小，這個 page 也會

記錄 link of standard error 還有 output file 由哪個 task 所產生）user 可藉由

這些 page 去預測 computation 還要多久，還有是否這個 computation 還需要

更多的 resource，這些 page 當然也可以幫忙我們去找出為什麼 computation

比我們想像中的慢這麼多

此外 top-level 的 status page 會 show 出哪些 worker failed 還有哪些 map

and reduce task 讓他們 failed 的，這些 information 將會對診斷 user code 的

bug 時相當有幫助。

counters

MapReduce 的 library 提供了一個 counter 來幫忙計算不同的 event 發生

了多少次。舉例來說，user code 想要計算有多少 word 被處理，或是有多少

德文的文件被 index

user code 創造了一個叫 counter 的 object 然後在 map or reduce functiong

適當的增加這個 counter 的值。而這從每個 worker 來的 counter value 會定

期的傳送給 master(當 master ping worker 的時候回覆的時候順便回傳

counter value)然後 master 會把這些 value aggregate 起來，然後在 MapReduce

operation 完成後回傳到 user code 裡。而在 computation 當下的 counter value

我們可以在 master status page 裡看得到。master 在把 counter value aggregate

起來的時候會把同一個 map/reduce task 重複執行導致有 duplicate 的值消除

掉。

有些 counter value 會自動被 MapReduce maintain，像是有多少 pair 的

input 被處理以及有多少的 output 產生。

user 會發現 counter 可以清楚的瞭解 MapReduce operation 的行為，舉

例來說，在某些 MapReduce operation 裡，user code 會想要保證 input 的數

量跟 output 的數量一樣多

(或是處理某文件的數量在總共處理的數量是可以容忍的?)

Performance

在這個部份我們用兩種 computation （running on a large cluster of machine）

去測量 performance。第一種是在大約 1T 的地方 search 某個特定的 pattern，第

二種是 sort 大約 1T 的 data

cluster configuration

所有的 program 都是在一個 cluster 上執行（這個 cluster 大約有 1800 台

machine）每台 machine 有兩個 2GHz Intel Xeon processor 4GB 的 memory

兩個 160GB IDE disk 還有 1Gbps 的網路，這些 machine 被 arrange 成 two level

的 tree-shape，switch network 大約 100~200Gps，所有的 machine 都在同一

個地方，所以 round-trip time 比 1ms 還要少。

在這 4GB 的 memory 中，大約 1~1.5GB 是被保留來跑其他的 task，除此

之外測試的時候其他東西 mostly idle。

grep

grep program 要 scan 大約 10 的 10 次方* 個 100-byte 的 record，為了

找一個相對稀少的 3-character pattern(這個 pattern 發生了 92337 次)我們把

input file 切成 15000 份（每份 64MB） output file 就只有一個(R=1)

圖二 show 出了隨著時間處理的情形，速率漸漸的隨著越來越多的

machine 被 assign 而成長，最快可以超過 30GB/s（這個時候有 1764 個 worker

被 assign）當 map task 結束速率就會開始下降，大約在 80s 的時後會接近 0，

不過整個結束是大約在 150 秒的時候，這包涵了啟動時所需要的 overhead，

這 overhead 是因為我們要把 user program 傳送到每個 worker machine 上，

以及 GFS 要開啟 input file 還要 get information 為了 locality optimization。

sort

sort program sort 了 10 的 10 次方個 100byte 的 records(大約 1T 的資料)3

行的 Map function 去取出一行 text 當作 sort key(10byte)然後發送出這個 key

以及原本的 text line 當作 intermediate。我們用一個內建的 Identity function

當作 Reduce operator。這個 function pass the intermediate key as output key，

然後最後 sort 過的 output file 會多寫一份複製檔到 GFS

在 sort 之前，input data 一樣被切成 64MB piece，sort 過的 output file

被我們切成 4000 份。我們對於這個 benchmark

圖三 show 出了正常執行 sort program 的情況，左上角的是 read input

data 所要的時間，最快 13GB/s 很快的降低，大約在 200 秒完成，相較於 grep

速率比較慢，因為 sort map task 花了約一半的時間還有 I/O 的 bandwidth 把

intermediate 寫到 local disk，而在 grep 裡，intermediate 的大小是微不足道

的。中左的圖是 map task 把 data 經由 network 送到 reduce work 顯示出了

map task 經由 network 送 data 到 reduce task 的速率，這個 shuffle(拖曳搬

運的意思嗎)當地一個 map task 完成時就開始，在圖中第一個 hump 是第一

批的 reduce task()，大約在 300s 的計算後，有些第一批的 reduce task 完成

了然後我們就開始傳送 data 到剩下的 reduce task，所有的搬移大約在 600s

的後完成。

左下角的圖顯示出了 reduce task 把 sort 好的 data 寫到 final output 的速

率，在第一批的 shuffle 結束到開始寫到 output 中間有一個小小的 delay，這

個 delay 是因為 machine 在 sort 這些 intermediate data。write 持續以一個

2~4GB 的寫入速度一段時間，所有的 write 在 850s 的時候結束，所以整個 sort

的 program 包含了 startup 的 overhead 還有執行計算的所有時間為 891s，這

跟 benchmark 最佳的時間 1057 相去不遠。

有些事我們可以注意一下，input rate 比 shuffle rate 還有 output rate 還

要快，因為我們有用了 locality optimization(讓資料從 local disk 讀出來的時候

不用受到 network 的限制)而 shuffle rate 比 output rate 還要高因為 output

要多寫兩份複製品，這兩份複製品讓我們基本的 file system 可以提供

reliability and availability，而這 network bandwidth 的需求可以因為我們用了

erasure coding 而降低。

effect of backup task

在圖三的 b，我們 show 出了如果沒有 back up task 的執行結果，執行的

情況其實跟 a 很像，但是在最後會拖得非常長，在 960s 後，只剩下 5 個 task

在執行，但這五個 task 要到 300s 後才會真正結束，總共要花了 1283 秒，比

有 back up task 整整多了 44%的執行時間。

machine failure

在圖三的 c，我們 show 出了當執行 sort program 時故意的刪掉了 200

個 worker process，而 scheduler 會馬上重新啟動新的 worker process(因為我

們只有刪除 process machine 沒有真的壞掉)

由於 worker death 我們可以看到 input rate 會出現負的速率，因為有些

已經完成的 map work 消失了，所以剛剛那些他做的事就必須要重做，不過

重新執行的速度非常的快，這個的結果也花了 933s 而已，比正常執行只多

了 5%的時間。

Experience

我們在 2003 年 2 月的時候寫了第一個版本的 MapReduce library，然後在 2003

年 8 月的時候做了一個重大的加強，這邊包含了 locality optimization, dynamic load

balancing of task execution across worker machine。自從那之後就很驚訝原來

mapreduce 可以用在那麼多地方，在 google 裡我們就在以下的領域用了

mapreduce:

大規模的 machine learning problem

…………………………………..(略)

large-scale indexing

現在用 mapreduce 最重大的地方是完全重寫了 production index 的

system，這 system 產生了某種資料結構是可以給 google web search service，

這個 system 的 input 大約有 20T 那我們跑過五次的 MapReduce 後會有很多

比原本用 ad-hoc 還要好的地方

1. indexing code 比較簡單、小、而且更容易了解，因為在 MapReduce

的 library 中就已經處理好了錯誤容忍、分散資料、平行等東西

2. 我們可以將概念上不相關的計算切開，並且避免在將他們混合在

一起的時候需要額外的 data pass，這個好處讓我們在更換 index

過程的時候方便許多

3. index process 變得更容易去操作，因為大部分的問題是因為

machine failure, slow machine, network hiccups 所引起，而這些已

經自動在 MapReduce library 裡頭處理好，此外，如果我們要藉著

新增額外的機器去增進 index process 的 performance 也變得很簡

單。

Related work

很多系統提供了限制性的 programming model 然後用這些限制自動去平行化

計算，舉例來說，一個 associative function 可以計算 N 個 element 所有的 prefix

用 N 個處理器在 logN 的時間完成。MapReduce 可以被認為在這些 model 中根據

我們在現實生活中的經驗簡化的、去蕪存菁的結果。用更顯著的意義來說，我們

提供了一個可以忍受錯誤的 implementation。相對的，大部分平行處理的系統只

能被實現在較小的規模還有把處理 machine 錯誤的詳細資訊留給 programmer。

Bulk Synchronous Programming 還有某些 MPI primitives 提供了 higher level 的

概念讓 programmer 可以更簡單的去寫平行程式。在這些系統和 MapReduce 一個

重大的差別是 MapReduce 開發了一個限制性的 programming model 自動地去平

行化使用者的程式然後提供了顯而易見的 fault-tolerance

我們的 locality optimization…..

backup task mechianism……

Conclusions

MapReduce model 已經成功的被用在很多 google 裡的用途，我們把這個成

功歸功於很多原因。第一，這個 model 很容易被使用，即使對平行或是 distributed

沒有經驗的 programmer，因為有很多東西隱藏在 MapReduce library 裡。第二，

各式各樣的問題在 MapReduce 可以很簡單的被表達，舉例來說，MapReduce 被

用來產生 google web search service 的 data，這可以被拿來 sort data mining，還有

很多其他的 system。第三，我們讓 MapReduce 可以實坐在一個群集裡面包含了

上千台 machine，這個實做讓我們更有效率的處理所面對到的問題。

我們從這個 work 中了解到了很多事情。第一，限制性的 model 讓我們可以

簡單的去平行化且分散計算，必且讓這個 computation 可以忍受錯誤的發生。還

有很多的 optimization 都是想辦法減少透過 network 來傳輸的量。第三，重複的

執行可以用來減少那些慢機器所導致的影響，並且處理 machine failure and data

loss。

Documents

MapReduce : simplified data processing on large clusters abstract