EfficientProcessingofImageProcessing ApplicationsonCPU/GPU · 2020. 10. 10. · GPUs(graphicalprocessingunits)arebecomingpopularto exploitdata-levelparallelism[1]inembarrassinglyparallel

Research ArticleEfficient Processing of Image ProcessingApplications on CPUGPU

Najia Naz1 Abdul Haseeb Malik1 Abu Bakar Khurshid1 Furqan Aziz2 Bader Alouffi3

M Irfan Uddin 4 and Ahmed AlGhamdi5

1Department of Computer Science University of Peshawar Peshawar Pakistan2Department of Computer Science Institute of Management Sciences Peshawar Pakistan3Department of Computer Science College of Computers and Information Technology Taif University Taif 21944 Saudi Arabia4Institute of Computing Kohat University of Science and Technology Kohat Pakistan5Department of Computer Engineering College of Computers and Information Technology Taif University Taif 21944Saudi Arabia

Correspondence should be addressed to M Irfan Uddin irfanuddinkustedupk

Received 16 July 2020 Accepted 27 September 2020 Published 10 October 2020

Academic Editor Jia-Bao Liu

Copyright copy 2020 Najia Naz et al is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Heterogeneous systems have gained popularity due to the rapid growth in data and the need for processing this big data toextract useful information In recent years many healthcare applications have been developed which use machine learningalgorithms to perform tasks such as image classification object detection image segmentation and instance segmentatione increasing amount of big visual data requires images to be processed efficiently It is common that we use heterogeneoussystems for such type of applications as processing a huge number of images on a single PC may take months of com-putation In heterogeneous systems data are distributed on different nodes in the system However heterogeneous systemsdo not distribute images based on the computing capabilities of different types of processors in the node therefore a slowprocessor may take much longer to process an image compared to a faster processor is imbalanced workload distributionobserved in heterogeneous systems for image processing applications is the main cause of inefficient execution In this paperan efficient workload distribution mechanism for image processing applications is introduced e proposed approachconsists of two phases In the first phase image data are divided into an ideal split size and distributed amongst nodes and inthe second phase image data are further distributed between CPU and GPU according to their computation speeds Javabindings for OpenCL are used to configure both the CPU and GPU to execute the program e results have demonstratedthat the proposed workload distribution policy efficiently distributes the images in a heterogeneous system for imageprocessing applications and achieves 50 improvements compared to the current state-of-the-artprogramming frameworks

1 Introduction

GPUs (graphical processing units) are becoming popular toexploit data-level parallelism [1] in embarrassingly parallelapplications [2] because of the SIMD (single instructionmultiple data) [3 4] architecture However task-level par-allelism [5ndash7] is better exploited in general-purpose pro-cessors because of MIMD (multiple instruction multipledata) [8] architecture More recently general-purposeprocessors are combined with GPUs and provide general-

purpose computation accelerated with GPU (commonlyreferred as GPGPU) [9] A heterogeneous cluster has manyCPUs and GPUs and can exploit both task-level and data-level parallelism in applications

e amount of data generated in the form of images andvideos is enormous because of the fact that many surveil-lance cameras smart phones and many other devices thatcapture imagesvideos are installedused everywhere esedevices are constantly recording scenes and can be processedfor different types of computer vision image processing

HindawiMathematical Problems in EngineeringVolume 2020 Article ID 4839876 14 pageshttpsdoiorg10115520204839876

machine learning and data science tasks Images inherentlyhave data-level parallelism [1 10] (ie individual images canbe processed in embarrassingly parallel fashion) because allpixels can be processed independently and therefore GPUsare commonly used for image processing applications Aheterogeneous cluster (ie a cluster containing CPUs andGPUs) can exploit both task-level (across multiple images)and data-level parallelism in images GPUs enable hetero-geneous clusters to accelerate the processing intensive op-erations of images in big data and machine learningapplications

Data processing applications are commonly processed ina cloud environment using the MapReduce [11] parallelprocessing model for efficient execution e scheduler ofthe MapReduce influences the performance in differentapplications when utilized in a heterogeneous cluster edynamic nature of the cluster and the computing workloadaffect the execution time of the application Data locality isan essential part to reduce the total application executiontime and hence improve the overall throughput

In current technology it is very challenging to provide amechanism that can efficiently utilize the computing re-sources in a heterogeneous cluster [12] Traditionally ap-plications that use both CPUs and GPUs are created usinglow level programming languages Limited support isavailable to programmers to efficiently exploit parallelism inthese clusters A heterogeneous cluster normally has nodesof different computational capabilities [13] For instance anode may have 4 CPUs and 1 GPU and another node mayhave 16 CPUs and 8 GPUs erefore static distribution ofworkload amongst these nodes without taking into accounttheir processing capabilities is not justified [14] as one nodewill be overloaded and the other node will remain idle formost of the time [15] Load balancing can be done dy-namically but it requires extra overheads at runtime [16]

In this paper we present a new technique for the efficientdistribution of images in a heterogeneous cluster e goal isto maximize the utilization of the processing resources[17ndash19] (ie both CPUs and GPUs) and throughput Weprovide a programming framework that ensures efficientworkload distribution amongst the nodes by dividing thedata into equal size splits and then distribute the split databetween CPU and GPU cores based on their computationalcapabilities [20] e aim is to achieve maximum resourceutilization gain high performance by dividing the data intoequal size splits allocate the data locally to the computingunits and minimize data migration to GPUs [21]

e rest of the paper is organized as follows Relatedstudies are given in Section 2 In Section 3 we provide abackground to existing work that use CPU and GPU in-tegration to efficiently solve different problems We alsoprovide a brief overview of the Hadoop programmingframework as this framework is used to test the ideaspresented in this paper We highlight limitations in existingstate-of-the-art frameworks and explain the problems due towhich efficient utilization of computing resources is notpossible e details of the proposed framework are given inSection 4 and demonstration of experiments is given inSection 5 We conclude the paper in Section 6

2 Related Work

In [22] image processing for face detection and tracking isperformed using CPU and GPU integration in Hadoopframework and has improved the performance by 25 Inanother research [23] GPU-based Hadoop framework is usedfor evaluation of Canny Edge Detection algorithm using thedefault scheduler of Hadoop for workload distribution andhas demonstrated two times performance improvement thanHIPI-based image processing [24] In [25 26] face detectionin video frames is performed by using CUDA-based Hadoopframework It is observed that actual data processing takes55 of the processor time and the remaining 45 is wastedas idle or busy in performing other management activities

Another framework named SEIP (system for efficientimage processing) [27] ensured high performance for imageprocessing application by applying in-node pipelineframework However during the processing it is observedthat the number of loadstore to the GPU is equal to thenumber of images which results into overhead and per-formance loss in case of processing a large number of smallimages and also there is no policy of workload distributionbetween CPU and GPU In [28] an integrated frameworkbased on Hadoop and GPU has been developed for pro-cessing massive amount of satellite images In order toachieve application performance image data are split intoparts and then each part is allocated to processing units butthere is no support for efficient workload distribution be-tween CPU and GPU

An energy efficient runtime mapping and thread par-titioning approach has been developed for distribution ofconcurrent OpenCL application between CPU and GPUcores and has demonstrated a 32 increase in the systemperformance [29] e feature extraction algorithm SIFT[30] has been developed in OpenCL that distributesworkload on CPU and GPU It is demonstrated that featureswere extracted with more than 30 FPS (frames per second)on full HD images and an average speed up of 269 wasachieved Another OpenCL-based framework single nodevertically scaled system is developed in [31] where multipleGPUs are combined together and treated as a single com-puting device It automatically distributes OpenCL kernelwritten for a single GPU into multiple CUDA kernels atruntime that is executed on multiple (eight) GPUs eexperiment was performed by combining 8 GPUs in a singlenode environment and the performance speedup of 71x wasachieved as compared to the performance of single GPU Anaverage overhead of 048 is reported

In [32] an algorithm is proposed that improves theefficiency of Hadoop clusters e experiments demon-strated that if a process is defined that can handle differentuse-case scenarios the overall cost of computing can bereduced and get benefits from distributed system for fastexecutions A reinforcement learning-based MapReducescheduler is proposed in [33] for heterogeneous environ-ment e system observes the state of task execution andsuggests speculative execution of slow tasks to other freenodes in the cluster for faster execution e proposedapproach does not need any prior knowledge of the

2 Mathematical Problems in Engineering

environment and adapts itself to the heterogeneous envi-ronmente experiments demonstrate that over a few runsthe system can better map the tasks to the available resourcesin a heterogeneous cluster and hence improve the overallperformance of the system e workload partition and taskgranularity for a given application based on machinelearning techniques are given in [34] e machine learningmodel can train a predictive model off-line and then thetrained model can predict the data partition and taskgranularity for any program at runtime e experimentsdemonstrate a 16times average speedup using a single MICOther studies that involve the improvement in parallelcomputation are given in [35ndash38]

In all techniques presented in this section the objective isto improve the performance of execution in a heterogeneousenvironment e current state-of-the-art techniquesdemonstrate that imbalanced distribution of tasks withoutconsidering the underlying computational capabilities re-sults into inefficient execution of the applications eproposed technique in this paper considers the underlyingcomputational capability of the processor and then assignstasks is results into performance improvement as dem-onstrated in Section 5 To the best of our knowledge noprevious studies have addressed the problem in the sameperspective as undertaken by this research

3 Background

In this section existing frameworks are explained along withthe limitations

31 Programming Frameworks Different programmingframeworks such as Hadoop [39] FastFlow [40] OpenMP[41] pthreads [42] OpenCL [43] DirectCompute [44]OpenGL [45] MapReduce [46] and Spark [47] have beendeveloped for the efficient data processing in a heteroge-neous environment Hadoop is becoming very popularbecause it can efficiently process structured [48] semi-structured [49] and unstructured [50] data Different imageprocessing applications such as face and motion detection[51] face tracking [52] extracting text from video frames inan online lecture video [53] video processing for surveil-lance [54] and content-based image retrieval (CBIR) [55]have demonstrated that Hadoop can be efficiently used forimage-based applications

CUDA (compute unified device architecture) [56] is themost commonly used programming language for GPUsdeveloped by Nvidia CUDA integrated with Hadoop en-hances the application throughput by using the distributedcomputing capability of Hadoop and parallel processingcapability of GPU [57] Mars framework [58] which hasbeen used for processing of web documents (searches andlogs) was the first framework which combined GPU withHadoop Some other popular frameworks that integrateGPUs with Hadoop are MAPCG [59] StreamMR [60] andGPMR [61] However these frameworks are developed forsome specific projects and do not improve the performance

of image processing applications in Hadoop Hadoop Imageprocessing interface (HIPI) has been developed that canefficiently process a massive amount of small sized imagesbut has no support of GPU [24]

A heterogeneous Hadooop cluster with nodes equippedwith GPUs is shown in Figure 1 Hadoop is one of thefamous and easy to use platforms which is a loosely coupledarchitecture and provides a distributed environmentHadoop consists of the Hadoop distributed file system(HDFS) [62] and MapReduce [46] programming modelHDFS is an open-source implementation of the Google FileSystem (GFS) It stores the data on different data nodes in acluster HDFS depends on the mechanism of the master-slave architecture e access permission and data service tothe slave nodes are provided by master node also known asname node while the slaves known as data nodes are usedas storage for the HDFS Large files are handled efficiently bydividing them into chunks and then distributed amongstmultiple data nodes On each node to process their localcopies amap processing job is locatede function of namenode is only to keep record of the metadata and log in-formation while Hadoop API is used for the transfer of datato and from HDFS MapReduce is an enhanced approachwhich provides an abstraction for data synchronization loadbalancing and dynamically allocation of tasks to differentcomputing units in a reliable manner

32 Limitations in Existing Systems Some issues that lead toimbalanced workload distribution in a heterogeneous en-vironment are discussed below

321 Data Locality Data locality means that the mapperand data are located on the same node If data and mapperare on the same node than it is easy for a mapper to effi-ciently map the data for computation but if the data are on adifferent node than the mapper then the mapper have toload data from different node over the network to be dis-tributed Suppose there are 50 mappers which try to copydata from other data nodes simultaneously is situationleads to high network congestion which is not desirablebecause the overall performance of the application is af-fectede situation where mapper and data are on the samenode is shown in Figure 2(a) and the situation wheremapper and data are on different nodes is shown inFigure 2(b) For efficient processing of applications aprogramming framework should be able to ensure datalocality When the data stored in HDFS (Hadoop distributedfile system) is distributed amongst the nodes data localityneeds to be handled very carefully e data are divided intosplits and each split is provided to the data node in thecluster for processing e MapReduce job is executed tomap splits to individual mapper that will process theassigned split at means that moving the computationcloser to the data is better than moving the data closer to thecomputation Hence good data locality means good ap-plication performance

Mathematical Problems in Engineering 3

322 Split Size For efficient execution of programs inheterogeneous cluster the programming framework shouldbe able to distribute the data evenly into splits and thendistribute amongst the available nodes One of the maincharacteristics of MapReduce is to divide the whole data intochunksinput splits according to the block size of HDFS Asby default Hadoop block size is 64MB and the issue of datalocality arises when the input split size is larger than theblock size For better performance input split size should beequal to or less than the block size of HDFS

Block size and split size are not the same terms as theblock size is the physical chunk of data stored in diskwhereas input split size is the logical chunk of data withpointers for start and end locations in a blockWhen the splitsize is more or very small then the default block size thenuneven distribution of data happens which leads to issues indata locality and memory wastage

In Figure 3 the scenario of uneven split size is high-lighted Suppose the block size is 64MB and each split size is50MB e first split will easily fit in block 1 but the secondsplit starts after the first split ending point and will not fullyfit in the block 1 so the remaining part of the second splitwill be partially stored in block 1 and partially stored in block2 When the mapper is assigned to block 1 it reads the firstsplit it will not read the second split data as it is not fullyfitted in block 1 and cannot generate any final result of thesecond split data According to [39 63 64] as quoted fromthe book ldquoe logical records that FileInputFormats definedo not usually fit neatly into HDFS blocks For example a

TextInputFormatrsquos logical records are lines which will crossHDFS boundaries more often than not is has no bearingon the functioning of your programmdashlines are not missedor broken for examplemdashbut itrsquos worth knowing about asit does mean that data-local maps (that is maps that arerunning on the same host as their input data) will performsome remote reads e slight overhead this causes is notnormally significantrdquo If a recordfile span across theHDFS boundaries of two nodes then one of the nodes willperform some remote reads to fetch the missing pieceAnd it will read the data and generate final results but withthe overhead of communication between the two nodesHence the communication overhead arises because thedata are not evenly distributed according to the block sizeand most of the time mapper waits for other mappers togenerate the result and then to synchronize with eachother for final result is problem can be solved byarranging the whole data into ideal input split size Bydividing the data into equal and suitable split size that isless than or equal to the default block size the mapper ofeach block will read its data easily and will not wait forother mapper to send data of the split that is partiallystored in different blocks

323 Data Migration and Inefficient Resource UtilizationData migration is the process of transferring data from onenode or processor to another node or processor as shown inFigure 4 Data migration between systems is usually per-formed programmatically to achieve better performance butin heterogeneous systems where a node contains CPU andGPU the GPU being the faster computing processor willcomplete its task quickly and will fetch the data from CPU aslow computing processor Due to this data migration thescheduler will always be busy in managing the tasksscheduling the GPU will be idle and hence performance inapplications is affected

Above are some of the problems that need to be tackledwhile using heterogeneous systems so that applicationperformance can be increased In this paper we will inte-grate CUDA with Hadoop to increase the performance inprocessing images in heterogeneous clusters We will inte-grate the Hadoop platform that is used for distributedprocessing on clusters with libraries that allow code to beexecuted on GPUs

4 Efficient Workload Distribution Based onProcessor Speed

e issues of data locality input split size data migrationand inefficient resource utilization discussed in Section 32lead to imbalanced workload distribution in heterogeneousenvironment which results into inefficient execution ofapplications e proposed framework will distribute thedata in a balanced form amongst the nodes according totheir computing capabilities as shown in Figure 5 edistribution of data in the framework consists of two phasesIn Phase I the data are distributed amongst the nodes and inPhase II the workload is equally distributed between CPU

Master 1 (active)

Daemons running on slavenode 1

Data nodeNodemanager



Node 1

Node 3

GPU

GPU

CPU



Node 4

GPU CPU


NODE 5

GPU CPU

CPU

Master 2 (stand by)


Node 2

GPU CPU

Figure 1 Heterogeneous Hadoop cluster with nodes equippedwith GPUs


and GPU based on their computing capability Both phasesare explained below in detail

41 Phase I Data Distribution amongst Nodes In Hadoopworkload is organized in splits which are then distributedamongst nodes in the cluster to be processed However thisworkload is not evenly distributed and hence processorswith lower processing capabilities are overwhelmed In theproposed framework the input is in the form of images and

is distributed evenly amongst cluster nodes in order to utilizecomputation and memory resources efficiently We havedeveloped a novel distribution policy for the even distri-bution of same size images in splits where a split containsone or more images We are focusing on same sized imagesonly in order to avoid communication overhead whenimages of different sizes are loaded With images of differentsizes the ratio calculations explained in Section 42 areuseless is idea is mainly inspired from arrays wherecontinuous blocks of the same size are gathered together

Node 1

Map and data

Node 3

Node N

(a)

Node 1

Map

Node 3

Node N

Node 4

Data

(b)

Figure 2 (a) Data and Mapper on the same node (b) Data and Mapper on different nodes

50MBsplit size

64MB block size 64MB block size 64MB block size 64MB block size

50MBsplit size

50MBsplit size

50MBsplit size

50MBsplit size

Figure 3 Mismatch between split size and block size


providing simplicity and performance e distributionpolicy does not allow a single image to be distributed tomultiple splits because of data locality issue e ideal splitsize is set according to the size of the images so that an imagedoes not exceed the boundary of that split Multiple imagesevenly grouped together in a split and ready to be distributedamongst nodes is shown in Figure 6 and the ideal split size iscalculated as per the default block size (ie 64MB inHadoop) as shown in Figure 7

To avoid the problem of uneven splits ie when animage is distributed in multiple splits the split size is set verycarefully based on the image sizeWe first measure the size ofone image and then select an input split size where multipleimages of that size will be placed Let I be the ideal input splitsize which need to be computed d be the default split size inHadoop s be the size of one of the input images no be thenumber of images that can be fully accommodated by thedefault input split Ti be the total number of images in adataset and Sn be the total number of splits in which the dataare divided equally no is calculated by dividing d on s ig-noring the fractional part by taking the floor I is computedby multiplying the image size s with no and Sn is computedby dividing Ti on no as shown in equation (1) is equationcalculates an ideal input split size and no image can occuracross two input splits For instance we have an input imageof size 42MB (s) and 64MB is the default input size (d) and

then the ideal input split size I 15 times 42 63MB(n0

lfloor6442rfloor 15) To calculate the number of splits the datashould be arranged in Sn 9015 6 where 90 is the totalnumber of input images So we will have a total of 6 inputsplits each of size 63MB to store 90 input images

n0 d

s1113888 1113889

I n0 times s

Sn Ti

n0

(1)

42 Phase II Distribution of Workload on CPU and GPUwithin a Node In a heterogeneous cluster every node isequipped with a GPU which is much faster than the CPUerefore for efficient execution of applications it is im-portant that tasks are distributed based on the computingcapabilities of processors e proposed workload distri-bution scheme for heterogeneous Hadoop cluster is shownin Figure 8 A split is a container of a group of images of thesame sizes For every split the map function is invokedwhich takes the inputlt key valuegt pair where the keycontains log file of images in a split and the value contains

CPU

Images Images Images

GPU


Images

Images

Number of imagesallocated to CPU

Number of images allocated to GPU

Imageaccessing

byGPU

from CPU

Data migration

Figure 4 Data migration between CPU and GPU


contents of images (bytes) e map function processes thesplit and reads each image in the split In the proposedalgorithm the map function takes the split and checks theratio for all the available images in that split so that imagescan be assigned to CPU and GPU according to theircomputing capability Before assigning images to the CPUand GPU a sample image is executed on both the CPU andGPU to find out the execution time of both the processors ona specified algorithm From this execution time the pro-cessing power of each processor is identified and a ratio iscalculated demonstrating the number of images assigned toCPU and GPU

In order to compute the computing capability of pro-cessors in a node we initially assign a raw image to bothCPU and GPU to find the execution time of both processorsLet c be the execution time in CPU and g be the executiontime in GPU to execute an image processing algorithm on araw input image We take the ratio of max(g c) and min(gc) e device with larger execution time (ie slower

processor) is assigned a value 1 and the device with smallerexecution time (ie faster processor) is assigned an integervalue of the execution time of slow processor divided byexecution time of fast processor When the processor fetchesimages we assign nPx number of images to the slowerprocessor and nPy images to the faster processor as shown inthe following equation

x 1

y max(g c)

min(g c)1113888 1113889

nPx x

x + y1113888 1113889 times n0

nPy y

x + y1113888 1113889 times n0

(2)

Image input data

Data division into equal size splits

HDFS

GPU

Idealsplit size

CPU


Number of imagesallocated to GPU

Images Images Images Images

Images

Data distribution with-in thenode

Data distribution among thenodes

Image storage

Ratio calculation

Idealsplit size

Idealsplit size

Idealsplit size

Idealsplit size

Figure 5 Proposed framework for workload distribution


Suppose n0 is 15 (as explained above) GPU executiontime (g) for a raw image is 25ms and CPU execution (c) is160ms In this example x 1 and y 16025 6 whichmeans that CPU and GPU are assigned images in the ratio of1 6 erefore nPx (1(1 + 6))times 15 2 and nPy (6(6 + 1))times 1513 ie in a single fetch 2 images are assignedto CPU and 13 images are assigned to GPU is noveldistribution of workload ensures that images are distributedbased on the computing capability of the processor whichcan speedup the execution of images in heterogeneousnodese flow chart of the proposed framework is shown inFigure 9

5 Evaluation

In Section 4 the proposed framework is introduced toefficiently handle the imbalanced workload distribution in aheterogeneous Hadoop cluster for image processing appli-cations and the implementation of such policy is evaluatedby using the commodity computer systems accelerated withGPU As the heterogeneous systems increase the

performance of applications by processing amassive amountof data in parallel in the future these heterogeneous systemswill be commonly used and provide efficient execution ofprograms by adopting policies for efficient workload dis-tribution In this section the evaluation of the proposedprogramming framework is discussed e detail of thedataset used for the evaluation is given in Table 1 e testenvironment includes a master node having a corei5 pro-cessor with speed 25GHz 8GB RAM and 4 processingcores Two worker nodes are used each having a CPUcorei3 18GHz 4GB RAM and 4 cores and a GPUNVIDIA802M 64 cores and 2GB RAM

51 Processing Images of Different Sizes e Figure 10(a)shows the average execution time calculated in millisecondsfor edge detection algorithm on four images of different sizeson a CPU and GPU respectively is experiment dem-onstrates that on CPU execution time of the applicationincreases when the size of the image is increased Howeveron GPU the increase in execution time when processing an

Input data

Split 1 Split 2 Split 3 Split 4 Split 5

Figure 6 Grouping of input data to be partitioned in splits

63MBsplit size

64MB block size 64MB block size 64MB block size 64MB block size 64MB block size

63MBsplit size

63MBsplit size

63MBsplit size

63MBsplit size

Figure 7 Image data partitioning into ideal split size


application on a very large image is not significant isanalysis demonstrates that GPU is a better choice for pro-cessing larger images

52 Processing Different Number of Images In Figure 10(b)the performance of the application in processing differentnumber of images on a GPU is shown Y-axis shows exe-cution time in millisecond and x-axis shows the number ofimages of different sizes Four images with different reso-lution sizes are grouped together as 1 image 2 images 3images and 4 images For instance when the applicationprocesses a single image of size 1024times 768 the executiontime on GPU is 42milliseconds But in processing fourimages of the same size the GPU takes 51millisecondsSimilarly a single image of size 2560times1440 takes 69ms butprocessing four images of the same size takes 87ms isincrease in execution time is mainly because of the com-munication overhead as each image is migrated from CPUmemory to GPU memory and then the result is written backto the CPU memory

53 Performance Comparison We compare the perfor-mance of our approach ie modified split sizes and opti-mized workload distribution based on the computationcapabilities of processors in a node of heterogeneous systemwith existing state-of-the-art techniques such as HIPI HIPIexecuted on GPU and Hadoop executed on GPU as shownin Figure 10(c) e proposed framework is processing

9 images in a split

Node

6 images toGPU

3 images toCPU

CPU GPU

Figure 8 Distribution of workload within a node between CPU and GPU according to their computing capabilities

Data division intosplits

Ratio calculation

HDFSfor data storage

Data distributionbetween CPU

and GPU

Output

Operationperformed

Dec

ision

-mak

ing

Output written back to HDFS

Input data

Figure 9 Basic flowchart of the proposed framework


Table 1 Information of images in the dataset

S no Image resolution Total size (MB) Total images Individual image size (MB)1 1024times 768 216 90 242 1600times 900 387 90 433 1920times1080 558 90 624 2560times1440 999 90 111

8142

110

50

236

61

382

69

0

50

100

150

200

250

300

350

400

450

CPU GPU

Ave

rage

exec

utio

n tim

e (m

s)

Platforms

1024 times 7681600 times 900

1920 times 10802560 times 1449

(a)

42 4549 5150

5458 6061 63

677169

7379

87

0

10

20

30

40

50

60

70

80

90

100

1 image 2 images 3 images 4 imagesNumber of images

1024 times 7681600 times 900

1920 times 10802560 times 1440

Ave

rage

exec

utio

n tim

e (m

s)

(b)

221184

14997

260225

192133

442

371

269

172

686

489

342

191

0

100

200

300

400

500

600

700

800

HIPI HIPI + GPU

Platforms

Hadoop + GPU Proposedapproach

(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

1024 times 7681600 times 900

1920 times 10802560 times 1440

(c)

Figure 10e execution time of CPU and GPU different number of images and average performance comparison with existing platforms(a) Average execution time in CPU and GPU with different number of images for edge detection algorithm (b) Average execution time ofapplication in processing different number of images on a GPU (c) Average execution time (milliseconds) for proposed framework(Hadoop+GPU) vs existing platforms


images faster than the other state-of-the-art frameworks Forinstance processing 90 images of resolutions 2560times1440each of size 111MB takes 686ms in HIPI but only 191ms inthe proposed framework ie a speedup of 35 is achievedis speedup is possible mainly because of the efficientworkload distribution and the efficient arrangement ofimages in input splits which results into the efficient utili-zation of resources

54 Effect of Assigning Different Loads to CPU and GPUIn the proposed approach we have developed a noveltechnique to compute the ratio of images to be assigned toCPU and GPU in a heterogeneous nodee performance ofthis calculation is shown in Figure 11(a) along with com-parison with other state-of-the-art techniques It is observedthat application processing the dataset of different images ofdifferent sizes executes efficiently on the proposed frame-work For instance the speedup in the proposed approach is

212x compared to HIPI Hence it is proved that by efficientload balancing techniques in heterogeneous systems we canincrease the performance by two times

55 Execution Time of Each Split Input data are divided indifferent input splits and distributed amongst nodes andfrom each split images are accessed and processed eaverage execution time taken by images of sizes (1024times 7681600times 900 1920times1080 and 2560times1440) in an input split isshown in Figure 11(b) for all the four platforms whileprocessing edge detection application It is demonstratedthat the proposed framework processes the images morethan two and half times faster than HIPI

56 Execution Time in Processing the Whole Dataset InFigure 11(c) an average of total execution time for the wholedataset shown in Table 1 is calculated for image processing

9984

6840 6776

4690

0

2000

4000

6000

8000

10000

12000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(a)

110

76 75

41

0

20

40

60

80

100

120

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(b)

23419

20270

17354

12036

0

5000

10000

15000

20000

25000

Ave

rage

exec

utio

n tim

e (m

s)

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

(c)

13435 13430

10578

7346

0

2000

4000

6000

8000

10000

12000

14000

16000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(d)

Figure 11 (a) Execution time of a split on a node (b) Execution time calculated for each split (c) Execution time of image dataset processedby edge detection application (d) Execution time of a single image from split to processor


application performing edge detection operation e totalexecution time includes partitioning of image data intosplits distribution amongst the nodes and on each nodefurther distribution between CPU and GPU and afterprocessing the result is written back to the HDFS From thefigure it is clearly shown that by adopting the proposedapproach (Hadoop +GPU) where data is divided into idealsplit size and then distributed to the processors according totheir computing capabilities the total execution time cal-culated in milliseconds is two times less than the otherexisting platforms

57 Minimization of the Communication Overhead by IdealSplit Size Selection e communication overhead of imagesduring processing is shown in Figure 11(d) e access timeof an image in this figure includes the time from split to theassigned processor e execution time taken by a singleimage in recorded From the experiment it is observed thatby arranging the data local to the computing processor theimages can be easily accessed and processed As HIPI has anoverhead of data compression and decompression there-fore the results of both HIPI and HIPI +GPU are relativelysame and high By using Hadoop +GPU the image accesstime is less compared to the HIPI and HIPI +GPU but byovercoming the data locality issue in the proposed frame-work (Hadoop +GPU) the image access time is almost twotimes reduced

6 Conclusion and Future Work

Distributed systems provide parallel frameworks wheremassive amount of data can be efficiently processed Hadoopis one of those frameworks that provides large amount ofdata storage and effective computational capability to handlemassive amount of data in a parallel and distributed mannerby using the cluster of commodity computer systems eseclusters also contain GPUs as they are specialized to handleSIMD efficiently For image processing applications or ap-plications involving the process of big data in differentdomains such as healthcare heterogeneous systems arecommonly used and have shown improvement over singleprocessor and distributed systems However imbalancedworkload distribution between the processor causes datalocality and inefficient workload distribution between slowand fast processors can affect the performance of applica-tions To deal with this imbalanced workload distribution ina heterogeneous cluster this paper has proposed a noveltechnique of dividing data into ideal input splits so that animage is included in one split and does not exceed theboundary of that split is paper also introduces a tech-nique of distributing data as per the computing power of theprocessors in the node is distribution maximizes theutilization of available resources and tackles the issue of datamigration between the fast and slow computing processorse results have demonstrated that the proposed frameworkachieves almost two times improvement compared to thecurrent state-of-the-art programming frameworks eproposed framework provides an efficient mechanism to

compute an ideal split size that is suitable for fixed sizeimages but does not provide support for partitioning var-iable size images into splits In the future we will investigatetechniques that can compute ideal split size for variable sizedimages In the future we will investigate techniques that cancompute ideal split size for variable sized images which willenable us to process images from sources e proposedmodel can be extended for big data applications which haveinherent data-level parallelism in the form of arrays ma-trices images or tables

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Baumstark and L Wills ldquoExposing data-level parallelism insequential image processing algorithmsrdquo in Proceedings of theNinth Working Conference on Reverse Engineering pp 245ndash254 New York NY USA 2002

[2] B Vinter ldquoEmbarrassingly parallel applications on a javaclusterrdquo in Proceedings of the 8th International Conference onHigh-Performance Computing and Networking HPCN Europe2000 Springer-Verlag London UK pp 614ndash617 2000

[3] B Furht SIMD (Single Instruction Multiple Data Processing)Springer US Boston MA USA 2008

[4] I Uddin ldquoMultiple levels of abstractions in the simulation ofmicrothreaded many-core architecturesrdquo Open Journal ofModelling and Simulation vol 3 2015

[5] M Girkar and C D Polychronopoulos ldquoExtracting task-levelparallelismrdquo ACM Transactions on Programming Languagesand Systems vol 17 pp 600ndash634 1995

[6] I Uddin ldquoHigh-level simulation of concurrency operations inmicrothreaded many-core architecturesrdquo GSTF Journal onComputing (JoC) vol 4 p 21 2015

[7] I Uddin ldquoOne-IPC high-level simulation of microthreadedmany-core architecturesrdquo International Journal of HighPerformance Computing Applications vol 4 2015

[8] R Riesen and A B Maccabe MIMD (Multiple InstructionMultiple Data) Machines Springer US Boston MA USA2011

[9] L Hu X Che and S-Q Zheng ldquoA closer look at GPGPUrdquoACM Computing Surveys vol 48 2016

[10] M Fatima ldquoReliable and energy efficient mac mechanism forpatient monitoring in hospitalsrdquo International Journal ofAdvanced Computer Science and Applications vol 9 no 102018

[11] J V Bibal Benifa ldquoPerformance improvement of Mapreducefor heterogeneous clusters based on efficient locality andreplica aware scheduling (ELRAS) strategyrdquoWireless PersonalCommunications vol 95 no 3 pp 2709ndash2733 2017

[12] N Elgendy and A Elragal ldquoBig data analytics a literaturereview paperrdquo in Advances in Data Mining Applications andBeoretical Aspects P Perner Ed pp 214ndash227 SpringerInternational Publishing Berlin Germany 2014


[13] A Khan M A Gul M I Uddin et al ldquoSummarizing onlinemovie reviews a machine learning approach to big dataanalyticsrdquo Scientific Programming vol 2020 pp 1ndash14 2020

[14] F Aziz H Gul I Muhammad and I Uddin ldquoLink predictionusing node information on local pathsrdquo Physica A StatisticalMechanics and its Applications vol 2020 Article ID 1249802020

[15] J Zhu H Jiang J Li E Hardesty K-C Li and Z LildquoEmbedding GPU computations in hadooprdquo InternationalJournal of Networked and Distributed Computing vol 2pp 211ndash220 2017

[16] W Chen S Xu H Jiang et al ldquoGPU computations onhadoop clusters for massive data processingrdquo in Proceedingsof the 3rd International Conference on Intelligent Technologiesand Engineering Systems (ICITES2014) J Juang Ed SpringerInternational Publishing Berlin Germany pp 515ndash521 2016

[17] S Amin M I Uddin S Hassan et al ldquoRecurrent neuralnetworks with TF-IDF embedding technique for detectionand classification in tweets of dengue diseaserdquo IEEE Accessvol 8 pp 131522ndash131533 2020

[18] M I Uddin S A A Shah andM A Al-Khasawneh ldquoA noveldeep convolutional neural network model to monitor peoplefollowing guidelines to avoid COVID-19rdquo Journal of Sensorsvol 2020 pp 1ndash15 2020

[19] Z Ullah A Zeb I Ullah et al ldquoCertificateless proxy reen-cryption scheme (CPRES) based on hyperelliptic curve foraccess control in content-centric network (CCN)rdquo MobileInformation Systems vol 2020 pp 1ndash13 2020

[20] A Khan I Ibrahim M I Uddin et al ldquoMachine learningapproach for answer detection in discussion forums an ap-plication of big data analyticsrdquo Scientific Programmingvol 2020 2020

[21] F Aziz T Ahmad A H Malik M I Uddin S Ahmad andM Sharaf ldquoReversible data hiding techniques with highmessage embedding capacity in imagesrdquo PLoS One vol 15no 1ndash24 2020

[22] B Sharma Rota N Vydyanathan and A Kale ldquoTowards arobust real-time face processing system using CUDA-enabledGPUSrdquo in Proceedings of the 2009 International Conference onHigh Performance Computing (HiPC) pp 368ndash377 NewYork NY USA 2009

[23] H Patel and K Panchal ldquoIncorporating CUDS in hadoopimage processing interface for distributed image processingrdquo inProceedings of the International Journal of Advance Researchand Innovative Ideas pp 93ndash98 New York NY USA 2016

[24] C Sweeney L Liu S Arietta and J Lawrence ldquoHipi ahadoop image processing interface for image-based mapre-duce tasksrdquo 2011

[25] R Malakar and N Vydyanathan ldquoA cuda-enabled hadoopcluster for fast distributed image processingrdquo in Proceedings ofthe 2013 National Conference on Parallel Computing Technol-ogies (PARCOMPTECH) pp 1ndash5 New York NY USA 2013

[26] M I Uddin N Zada F Aziz et al ldquoPrediction of futureterrorist activities using deep neural networksrdquo Complexityvol 2020 pp 1ndash16 2020

[27] T Liu Y Liu Q Li et al ldquoSeip system for efficient imageprocessing on distributed platformrdquo Journal of ComputerScience and Technology vol 30 pp 1215ndash1232 2015

[28] M S Gowda and V G Hulyal ldquoParallel image processingfrom cloud using cuda and hadoop architecture a novelapproachrdquo 2015

[29] A K Singh A Prakash K R Basireddy G V Merrett andB M Al-Hashimi ldquoEnergy-efficient run-time mapping andthread partitioning of concurrent OPENCL applications on

CPU-GPU MPSOCSrdquo ACM Transactions on EmbeddedComputing Systems vol 16 2017

[30] W Burger M J Burge and M J Burge Scale-InvariantFeature Transform (SIFT) Springer London UK 2016

[31] J Kim H Kim J H Lee and J Lee ldquoAchieving a singlecompute device image in OPENCL for multiple GPUSrdquoSIGPLAN Not vol 46 pp 277ndash288 2011

[32] P Dadheech D Goyal S Srivastava and A Kumar ldquoPer-formance improvement of heterogeneous hadoop clustersusing query optimizationrdquo SSRN Electronic Journal vol 462018

[33] N S Naik A Negi and V Sastry ldquoPerformance improve-ment of MapReduce framework in heterogeneous contextusing reinforcement learningrdquo Procedia Computer Sciencevol 50 pp 169ndash175 2015

[34] C Lai Y Chen X Shi M Huang and G Chen ldquoPerformanceimprovement on heterogeneous platforms a machinelearning based approachrdquo in Proceedings of the 2018 Inter-national Conference on Computational Science and Compu-tational Intelligence (CSCI) IEEE Berlin Germany 2018

[35] P Czarnul J Proficz and K Drypczewski ldquoSurvey ofmethodologies approaches and challenges in parallel pro-gramming using high-performance computing systemsrdquoScientific Programming vol 2020 pp 1ndash19 2020

[36] A Rubio-Largo J C Preciado and L Iribarne ldquoData-drivencomputational intelligence for scientific programmingrdquo Sci-entific Programming vol 2019 pp 1ndash4 2019

[37] A Kanan F Gebali A Ibrahim and K F Li ldquoLow-com-plexity scalable architectures for parallel computation ofsimilarity measuresrdquo Scientific Programming vol 2019 2019

[38] W Ma W Yuan and X Hu ldquoImplementation and opti-mization of a CFD solver using overlapped meshes onmultiple MIC coprocessorsrdquo Scientific Programmingvol 2019 pp 1ndash12 2019

[39] T White Hadoop Be Definitive Guide OrsquoReilly Media IncNew York NY USA 1st edition 2009

[40] M Aldinucci M Danelutto P Kilpatrick and M TorquatildquoFastflow high-level and efficient streaming on multi-corerdquoProgramming Multi-core and Many-core Computing Systemsser Parallel and Distributed Computing vol 13 2012

[41] L Dagum and R Menon ldquoOPENMP an industry-standardapi for shared-memory programmingrdquo Computing in Scienceamp Engineering vol 5 pp 46ndash55 1998

[42] B Nichols D Buttlar and J P Farrell Pthreads ProgrammingOrsquoReilly amp Associates Inc Sebastopol CA USA 1996

[43] A Munshi B Gaster T G Mattson J Fung andD Ginsburg OpenCL Programming Guide Addison-WesleyProfessional Boston MA USA 1st edition 2011

[44] J S Harbour D Calkins and R Meuth Multi-Core Pro-gramming with CUDA and OpenCL Course TechnologyPress Boston MA USA 1st edition 2011

[45] D Shreiner and T K O A W Group OpenGL ProgrammingGuide Be Official Guide to Learning OpenGL Versions 30and 31 Addison-Wesley Professional New York NY USA7th edition 2009

[46] J Dean and S Ghemawat ldquoMapreduce simplified dataprocessing on large clustersrdquo Communications of the ACMvol 51 pp 107ndash113 2008

[47] M Zaharia R S Xin P Wendell et al ldquoApache spark aunified engine for big data processingrdquo Communications ofthe ACM vol 59 pp 56ndash65 2016

[48] R V Guha D Brickley and S MacBeth ldquoSchemaorgevolution of structured data on the webrdquo Queue vol 13 p 102015


[49] D Suciu Information Organization and Databases KluwerAcademic Publishers Norwell MA USA 2000

[50] J P Isson Unstructured Data Analytics How to ImproveCustomer Acquisition Customer Retention and Fraud De-tection and Prevention Wiley Publishing New York NYUSA 1st edition 2018

[51] H Tan and L Chen ldquoAn approach for fast and parallel videoprocessing on Apache hadoop clustersrdquo in Proceedings of theIEEE International Conference on Multimedia and Expopp 1ndash6 New York NY USA 2014

[52] C Ryu D Lee M Jang C Kim and E Seo ldquoExtensible videoprocessing framework in Apache hadooprdquo in Proceedings ofthe 2013 IEEE 5th International Conference on Cloud Com-puting Technology and Science pp 305ndash310 New York NYUSA 2013

[53] M Husain A K Sabarad H Hebballi S M Nagaralli andS Shetty ldquoCounting occurrences of textual words in lecturevideo frames using Apache hadoop frameworkrdquo in Pro-ceedings of the 2015 IEEE International Advance ComputingConference (IACC) pp 1144ndash1147 London UK 2015

[54] H Tan and L Chen ldquoAn approach for fast and parallel videoprocessing on Apache hadoop clustersrdquo in Proceedings of the2014 IEEE International Conference on Multimedia and Expo(ICME) pp 1ndash6 London UK 2014

[55] U S N Raju S George V S Praneeth R Deo and P JainldquoContent based image retrieval on hadoop frameworkrdquo inProceedings of the 2015 IEEE International Congress on BigData pp 661ndash664 New Jearsey NJ USA 2015

[56] S Cook CUDA Programming A Developerrsquos Guide to ParallelComputing with GPUs Morgan Kaufmann Publishers IncSan Francisco CA USA 1st edition 2013

[57] Z Wang P Lv and C Zheng ldquoCuda on hadoop a mixedcomputing framework for massive data processingrdquo inFoundations and Practical Applications of Cognitive Systemsand Information Processing F Sun D Hu and H Liu Edspp 253ndash260 Springer Berlin Heidelberg Berlin Heidelberg2014

[58] B He W Fang Q Luo N K Govindaraju and T WangldquoMars a mapreduce framework on graphics processorsrdquo inProceedings of the 17th International Conference on ParallelArchitectures and Compilation Techniques PACT rsquo08 ACMNew York NY USA pp 260ndash269 2008

[59] C Hong D ChenW ChenW Zheng andH Lin ldquoMAPCGWriting Parallel Program Portable between CPU and GPUrdquoin Proceedings of the 2010 19th International Conference onParallel Architectures and Compilation Techniques (PACT)pp 217ndash226 New York NY USA 2010

[60] M Elteir H Lin W Feng and T Scogland ldquoStreammr anoptimized mapreduce framework for amd GPUSrdquo in Pro-ceedings of the 2011 IEEE 17th International Conference onParallel and Distributed Systems pp 364ndash371 New York NYUSA 2011

[61] J A Stuart and J D Owens ldquoMulti-GPUmapreduce on GPUclustersrdquo in Proceedings of the 2011 IEEE International Par-allel Distributed Processing Symposium pp 1068ndash1079 BerlinGermany 2011

[62] K Shvachko H Kuang S Radia and R Chansler ldquoehadoop distributed file systemrdquo in Proceedings of the 2010IEEE 26th Symposium on Mass Storage Systems and Tech-nologies (MSST) MSST rsquo10 IEEE Computer Society Wash-ington DC USA pp 1ndash10 2010

[63] I Uddin A Baig and A A Minhas ldquoA controlled envi-ronment model for dealing with smart phone addictionrdquo

International Journal of Advanced Computer Science andApplications vol 9 no 9 2018

[64] S A A Shah I Uddin F Aziz S Ahmad M A Al-Kha-sawneh and M Sharaf ldquoAn enhanced deep neural networkfor predicting workplace absenteeismrdquo Complexity vol 2020pp 1ndash12 2020


machine learning and data science tasks Images inherentlyhave data-level parallelism [1 10] (ie individual images canbe processed in embarrassingly parallel fashion) because allpixels can be processed independently and therefore GPUsare commonly used for image processing applications Aheterogeneous cluster (ie a cluster containing CPUs andGPUs) can exploit both task-level (across multiple images)and data-level parallelism in images GPUs enable hetero-geneous clusters to accelerate the processing intensive op-erations of images in big data and machine learningapplications

Data processing applications are commonly processed ina cloud environment using the MapReduce [11] parallelprocessing model for efficient execution e scheduler ofthe MapReduce influences the performance in differentapplications when utilized in a heterogeneous cluster edynamic nature of the cluster and the computing workloadaffect the execution time of the application Data locality isan essential part to reduce the total application executiontime and hence improve the overall throughput

In current technology it is very challenging to provide amechanism that can efficiently utilize the computing re-sources in a heterogeneous cluster [12] Traditionally ap-plications that use both CPUs and GPUs are created usinglow level programming languages Limited support isavailable to programmers to efficiently exploit parallelism inthese clusters A heterogeneous cluster normally has nodesof different computational capabilities [13] For instance anode may have 4 CPUs and 1 GPU and another node mayhave 16 CPUs and 8 GPUs erefore static distribution ofworkload amongst these nodes without taking into accounttheir processing capabilities is not justified [14] as one nodewill be overloaded and the other node will remain idle formost of the time [15] Load balancing can be done dy-namically but it requires extra overheads at runtime [16]

In this paper we present a new technique for the efficientdistribution of images in a heterogeneous cluster e goal isto maximize the utilization of the processing resources[17ndash19] (ie both CPUs and GPUs) and throughput Weprovide a programming framework that ensures efficientworkload distribution amongst the nodes by dividing thedata into equal size splits and then distribute the split databetween CPU and GPU cores based on their computationalcapabilities [20] e aim is to achieve maximum resourceutilization gain high performance by dividing the data intoequal size splits allocate the data locally to the computingunits and minimize data migration to GPUs [21]

e rest of the paper is organized as follows Relatedstudies are given in Section 2 In Section 3 we provide abackground to existing work that use CPU and GPU in-tegration to efficiently solve different problems We alsoprovide a brief overview of the Hadoop programmingframework as this framework is used to test the ideaspresented in this paper We highlight limitations in existingstate-of-the-art frameworks and explain the problems due towhich efficient utilization of computing resources is notpossible e details of the proposed framework are given inSection 4 and demonstration of experiments is given inSection 5 We conclude the paper in Section 6

2 Related Work

In [22] image processing for face detection and tracking isperformed using CPU and GPU integration in Hadoopframework and has improved the performance by 25 Inanother research [23] GPU-based Hadoop framework is usedfor evaluation of Canny Edge Detection algorithm using thedefault scheduler of Hadoop for workload distribution andhas demonstrated two times performance improvement thanHIPI-based image processing [24] In [25 26] face detectionin video frames is performed by using CUDA-based Hadoopframework It is observed that actual data processing takes55 of the processor time and the remaining 45 is wastedas idle or busy in performing other management activities

Another framework named SEIP (system for efficientimage processing) [27] ensured high performance for imageprocessing application by applying in-node pipelineframework However during the processing it is observedthat the number of loadstore to the GPU is equal to thenumber of images which results into overhead and per-formance loss in case of processing a large number of smallimages and also there is no policy of workload distributionbetween CPU and GPU In [28] an integrated frameworkbased on Hadoop and GPU has been developed for pro-cessing massive amount of satellite images In order toachieve application performance image data are split intoparts and then each part is allocated to processing units butthere is no support for efficient workload distribution be-tween CPU and GPU

An energy efficient runtime mapping and thread par-titioning approach has been developed for distribution ofconcurrent OpenCL application between CPU and GPUcores and has demonstrated a 32 increase in the systemperformance [29] e feature extraction algorithm SIFT[30] has been developed in OpenCL that distributesworkload on CPU and GPU It is demonstrated that featureswere extracted with more than 30 FPS (frames per second)on full HD images and an average speed up of 269 wasachieved Another OpenCL-based framework single nodevertically scaled system is developed in [31] where multipleGPUs are combined together and treated as a single com-puting device It automatically distributes OpenCL kernelwritten for a single GPU into multiple CUDA kernels atruntime that is executed on multiple (eight) GPUs eexperiment was performed by combining 8 GPUs in a singlenode environment and the performance speedup of 71x wasachieved as compared to the performance of single GPU Anaverage overhead of 048 is reported

In [32] an algorithm is proposed that improves theefficiency of Hadoop clusters e experiments demon-strated that if a process is defined that can handle differentuse-case scenarios the overall cost of computing can bereduced and get benefits from distributed system for fastexecutions A reinforcement learning-based MapReducescheduler is proposed in [33] for heterogeneous environ-ment e system observes the state of task execution andsuggests speculative execution of slow tasks to other freenodes in the cluster for faster execution e proposedapproach does not need any prior knowledge of the




3 Background

















Master 1 (active)





Node 1

Node 3

GPU

GPU

CPU



Node 4

GPU CPU


NODE 5

GPU CPU

CPU

Master 2 (stand by)


Node 2

GPU CPU






Node 1

Map and data

Node 3

Node N

(a)

Node 1

Map

Node 3

Node N

Node 4

Data

(b)


50MBsplit size


50MBsplit size

50MBsplit size

50MBsplit size

50MBsplit size







n0 d

s1113888 1113889

I n0 times s

Sn Ti

n0

(1)


CPU


GPU


Images

Images



Imageaccessing

byGPU

from CPU

Data migration






x 1

y max(g c)

min(g c)1113888 1113889

nPx x

x + y1113888 1113889 times n0

nPy y

x + y1113888 1113889 times n0

(2)

Image input data


HDFS

GPU

Idealsplit size

CPU




Images



Image storage

Ratio calculation

Idealsplit size

Idealsplit size

Idealsplit size

Idealsplit size




5 Evaluation




Input data



63MBsplit size


63MBsplit size

63MBsplit size

63MBsplit size

63MBsplit size






9 images in a split

Node

6 images toGPU

3 images toCPU

CPU GPU



Ratio calculation



and GPU

Output

Operationperformed

Dec

ision

-mak

ing


Input data





8142

110

50

236

61

382

69

0

50

100

150

200

250

300

350

400

450

CPU GPU

Ave

rage

exec

utio

n tim

e (m

s)

Platforms

1024 times 7681600 times 900

1920 times 10802560 times 1449

(a)

42 4549 5150

5458 6061 63

677169

7379

87

0

10

20

30

40

50

60

70

80

90

100


1024 times 7681600 times 900

1920 times 10802560 times 1440

Ave

rage

exec

utio

n tim

e (m

s)

(b)

221184

14997

260225

192133

442

371

269

172

686

489

342

191

0

100

200

300

400

500

600

700

800

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

1024 times 7681600 times 900

1920 times 10802560 times 1440

(c)








9984

6840 6776

4690

0

2000

4000

6000

8000

10000

12000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(a)

110

76 75

41

0

20

40

60

80

100

120

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(b)

23419

20270

17354

12036

0

5000

10000

15000

20000

25000

Ave

rage

exec

utio

n tim

e (m

s)

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

(c)

13435 13430

10578

7346

0

2000

4000

6000

8000

10000

12000

14000

16000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(d)








Data Availability




References








































































3 Background

















Master 1 (active)





Node 1

Node 3

GPU

GPU

CPU



Node 4

GPU CPU


NODE 5

GPU CPU

CPU

Master 2 (stand by)


Node 2

GPU CPU






Node 1

Map and data

Node 3

Node N

(a)

Node 1

Map

Node 3

Node N

Node 4

Data

(b)


50MBsplit size


50MBsplit size

50MBsplit size

50MBsplit size

50MBsplit size







n0 d

s1113888 1113889

I n0 times s

Sn Ti

n0

(1)


CPU


GPU


Images

Images



Imageaccessing

byGPU

from CPU

Data migration






x 1

y max(g c)

min(g c)1113888 1113889

nPx x

x + y1113888 1113889 times n0

nPy y

x + y1113888 1113889 times n0

(2)

Image input data


HDFS

GPU

Idealsplit size

CPU




Images



Image storage

Ratio calculation

Idealsplit size

Idealsplit size

Idealsplit size

Idealsplit size




5 Evaluation




Input data



63MBsplit size


63MBsplit size

63MBsplit size

63MBsplit size

63MBsplit size






9 images in a split

Node

6 images toGPU

3 images toCPU

CPU GPU



Ratio calculation



and GPU

Output

Operationperformed

Dec

ision

-mak

ing


Input data





8142

110

50

236

61

382

69

0

50

100

150

200

250

300

350

400

450

CPU GPU

Ave

rage

exec

utio

n tim

e (m

s)

Platforms

1024 times 7681600 times 900

1920 times 10802560 times 1449

(a)

42 4549 5150

5458 6061 63

677169

7379

87

0

10

20

30

40

50

60

70

80

90

100


1024 times 7681600 times 900

1920 times 10802560 times 1440

Ave

rage

exec

utio

n tim

e (m

s)

(b)

221184

14997

260225

192133

442

371

269

172

686

489

342

191

0

100

200

300

400

500

600

700

800

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

1024 times 7681600 times 900

1920 times 10802560 times 1440

(c)








9984

6840 6776

4690

0

2000

4000

6000

8000

10000

12000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(a)

110

76 75

41

0

20

40

60

80

100

120

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(b)

23419

20270

17354

12036

0

5000

10000

15000

20000

25000

Ave

rage

exec

utio

n tim

e (m

s)

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

(c)

13435 13430

10578

7346

0

2000

4000

6000

8000

10000

12000

14000

16000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(d)








Data Availability




References














































































Master 1 (active)





Node 1

Node 3

GPU

GPU

CPU



Node 4

GPU CPU


NODE 5

GPU CPU

CPU

Master 2 (stand by)


Node 2

GPU CPU






Node 1

Map and data

Node 3

Node N

(a)

Node 1

Map

Node 3

Node N

Node 4

Data

(b)


50MBsplit size


50MBsplit size

50MBsplit size

50MBsplit size

50MBsplit size







n0 d

s1113888 1113889

I n0 times s

Sn Ti

n0

(1)


CPU


GPU


Images

Images



Imageaccessing

byGPU

from CPU

Data migration






x 1

y max(g c)

min(g c)1113888 1113889

nPx x

x + y1113888 1113889 times n0

nPy y

x + y1113888 1113889 times n0

(2)

Image input data


HDFS

GPU

Idealsplit size

CPU




Images



Image storage

Ratio calculation

Idealsplit size

Idealsplit size

Idealsplit size

Idealsplit size




5 Evaluation




Input data



63MBsplit size


63MBsplit size

63MBsplit size

63MBsplit size

63MBsplit size






9 images in a split

Node

6 images toGPU

3 images toCPU

CPU GPU



Ratio calculation



and GPU

Output

Operationperformed

Dec

ision

-mak

ing


Input data





8142

110

50

236

61

382

69

0

50

100

150

200

250

300

350

400

450

CPU GPU

Ave

rage

exec

utio

n tim

e (m

s)

Platforms

1024 times 7681600 times 900

1920 times 10802560 times 1449

(a)

42 4549 5150

5458 6061 63

677169

7379

87

0

10

20

30

40

50

60

70

80

90

100


1024 times 7681600 times 900

1920 times 10802560 times 1440

Ave

rage

exec

utio

n tim

e (m

s)

(b)

221184

14997

260225

192133

442

371

269

172

686

489

342

191

0

100

200

300

400

500

600

700

800

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

1024 times 7681600 times 900

1920 times 10802560 times 1440

(c)








9984

6840 6776

4690

0

2000

4000

6000

8000

10000

12000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(a)

110

76 75

41

0

20

40

60

80

100

120

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(b)

23419

20270

17354

12036

0

5000

10000

15000

20000

25000

Ave

rage

exec

utio

n tim

e (m

s)

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

(c)

13435 13430

10578

7346

0

2000

4000

6000

8000

10000

12000

14000

16000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(d)








Data Availability




References









































































Node 1

Map and data

Node 3

Node N

(a)

Node 1

Map

Node 3

Node N

Node 4

Data

(b)


50MBsplit size


50MBsplit size

50MBsplit size

50MBsplit size

50MBsplit size







n0 d

s1113888 1113889

I n0 times s

Sn Ti

n0

(1)


CPU


GPU


Images

Images



Imageaccessing

byGPU

from CPU

Data migration






x 1

y max(g c)

min(g c)1113888 1113889

nPx x

x + y1113888 1113889 times n0

nPy y

x + y1113888 1113889 times n0

(2)

Image input data


HDFS

GPU

Idealsplit size

CPU




Images



Image storage

Ratio calculation

Idealsplit size

Idealsplit size

Idealsplit size

Idealsplit size




5 Evaluation




Input data



63MBsplit size


63MBsplit size

63MBsplit size

63MBsplit size

63MBsplit size






9 images in a split

Node

6 images toGPU

3 images toCPU

CPU GPU



Ratio calculation



and GPU

Output

Operationperformed

Dec

ision

-mak

ing


Input data





8142

110

50

236

61

382

69

0

50

100

150

200

250

300

350

400

450

CPU GPU

Ave

rage

exec

utio

n tim

e (m

s)

Platforms

1024 times 7681600 times 900

1920 times 10802560 times 1449

(a)

42 4549 5150

5458 6061 63

677169

7379

87

0

10

20

30

40

50

60

70

80

90

100


1024 times 7681600 times 900

1920 times 10802560 times 1440

Ave

rage

exec

utio

n tim

e (m

s)

(b)

221184

14997

260225

192133

442

371

269

172

686

489

342

191

0

100

200

300

400

500

600

700

800

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

1024 times 7681600 times 900

1920 times 10802560 times 1440

(c)








9984

6840 6776

4690

0

2000

4000

6000

8000

10000

12000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(a)

110

76 75

41

0

20

40

60

80

100

120

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(b)

23419

20270

17354

12036

0

5000

10000

15000

20000

25000

Ave

rage

exec

utio

n tim

e (m

s)

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

(c)

13435 13430

10578

7346

0

2000

4000

6000

8000

10000

12000

14000

16000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(d)








Data Availability




References










































































n0 d

s1113888 1113889

I n0 times s

Sn Ti

n0

(1)


CPU


GPU


Images

Images



Imageaccessing

byGPU

from CPU

Data migration






x 1

y max(g c)

min(g c)1113888 1113889

nPx x

x + y1113888 1113889 times n0

nPy y

x + y1113888 1113889 times n0

(2)

Image input data


HDFS

GPU

Idealsplit size

CPU




Images



Image storage

Ratio calculation

Idealsplit size

Idealsplit size

Idealsplit size

Idealsplit size




5 Evaluation




Input data



63MBsplit size


63MBsplit size

63MBsplit size

63MBsplit size

63MBsplit size






9 images in a split

Node

6 images toGPU

3 images toCPU

CPU GPU



Ratio calculation



and GPU

Output

Operationperformed

Dec

ision

-mak

ing


Input data





8142

110

50

236

61

382

69

0

50

100

150

200

250

300

350

400

450

CPU GPU

Ave

rage

exec

utio

n tim

e (m

s)

Platforms

1024 times 7681600 times 900

1920 times 10802560 times 1449

(a)

42 4549 5150

5458 6061 63

677169

7379

87

0

10

20

30

40

50

60

70

80

90

100


1024 times 7681600 times 900

1920 times 10802560 times 1440

Ave

rage

exec

utio

n tim

e (m

s)

(b)

221184

14997

260225

192133

442

371

269

172

686

489

342

191

0

100

200

300

400

500

600

700

800

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

1024 times 7681600 times 900

1920 times 10802560 times 1440

(c)








9984

6840 6776

4690

0

2000

4000

6000

8000

10000

12000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(a)

110

76 75

41

0

20

40

60

80

100

120

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(b)

23419

20270

17354

12036

0

5000

10000

15000

20000

25000

Ave

rage

exec

utio

n tim

e (m

s)

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

(c)

13435 13430

10578

7346

0

2000

4000

6000

8000

10000

12000

14000

16000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(d)








Data Availability




References









































































x 1

y max(g c)

min(g c)1113888 1113889

nPx x

x + y1113888 1113889 times n0

nPy y

x + y1113888 1113889 times n0

(2)

Image input data


HDFS

GPU

Idealsplit size

CPU




Images



Image storage

Ratio calculation

Idealsplit size

Idealsplit size

Idealsplit size

Idealsplit size




5 Evaluation




Input data



63MBsplit size


63MBsplit size

63MBsplit size

63MBsplit size

63MBsplit size






9 images in a split

Node

6 images toGPU

3 images toCPU

CPU GPU



Ratio calculation



and GPU

Output

Operationperformed

Dec

ision

-mak

ing


Input data





8142

110

50

236

61

382

69

0

50

100

150

200

250

300

350

400

450

CPU GPU

Ave

rage

exec

utio

n tim

e (m

s)

Platforms

1024 times 7681600 times 900

1920 times 10802560 times 1449

(a)

42 4549 5150

5458 6061 63

677169

7379

87

0

10

20

30

40

50

60

70

80

90

100


1024 times 7681600 times 900

1920 times 10802560 times 1440

Ave

rage

exec

utio

n tim

e (m

s)

(b)

221184

14997

260225

192133

442

371

269

172

686

489

342

191

0

100

200

300

400

500

600

700

800

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

1024 times 7681600 times 900

1920 times 10802560 times 1440

(c)








9984

6840 6776

4690

0

2000

4000

6000

8000

10000

12000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(a)

110

76 75

41

0

20

40

60

80

100

120

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(b)

23419

20270

17354

12036

0

5000

10000

15000

20000

25000

Ave

rage

exec

utio

n tim

e (m

s)

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

(c)

13435 13430

10578

7346

0

2000

4000

6000

8000

10000

12000

14000

16000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(d)








Data Availability




References







































































5 Evaluation




Input data



63MBsplit size


63MBsplit size

63MBsplit size

63MBsplit size

63MBsplit size






9 images in a split

Node

6 images toGPU

3 images toCPU

CPU GPU



Ratio calculation



and GPU

Output

Operationperformed

Dec

ision

-mak

ing


Input data





8142

110

50

236

61

382

69

0

50

100

150

200

250

300

350

400

450

CPU GPU

Ave

rage

exec

utio

n tim

e (m

s)

Platforms

1024 times 7681600 times 900

1920 times 10802560 times 1449

(a)

42 4549 5150

5458 6061 63

677169

7379

87

0

10

20

30

40

50

60

70

80

90

100


1024 times 7681600 times 900

1920 times 10802560 times 1440

Ave

rage

exec

utio

n tim

e (m

s)

(b)

221184

14997

260225

192133

442

371

269

172

686

489

342

191

0

100

200

300

400

500

600

700

800

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

1024 times 7681600 times 900

1920 times 10802560 times 1440

(c)








9984

6840 6776

4690

0

2000

4000

6000

8000

10000

12000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(a)

110

76 75

41

0

20

40

60

80

100

120

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(b)

23419

20270

17354

12036

0

5000

10000

15000

20000

25000

Ave

rage

exec

utio

n tim

e (m

s)

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

(c)

13435 13430

10578

7346

0

2000

4000

6000

8000

10000

12000

14000

16000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(d)








Data Availability




References









































































9 images in a split

Node

6 images toGPU

3 images toCPU

CPU GPU



Ratio calculation



and GPU

Output

Operationperformed

Dec

ision

-mak

ing


Input data





8142

110

50

236

61

382

69

0

50

100

150

200

250

300

350

400

450

CPU GPU

Ave

rage

exec

utio

n tim

e (m

s)

Platforms

1024 times 7681600 times 900

1920 times 10802560 times 1449

(a)

42 4549 5150

5458 6061 63

677169

7379

87

0

10

20

30

40

50

60

70

80

90

100


1024 times 7681600 times 900

1920 times 10802560 times 1440

Ave

rage

exec

utio

n tim

e (m

s)

(b)

221184

14997

260225

192133

442

371

269

172

686

489

342

191

0

100

200

300

400

500

600

700

800

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

1024 times 7681600 times 900

1920 times 10802560 times 1440

(c)








9984

6840 6776

4690

0

2000

4000

6000

8000

10000

12000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(a)

110

76 75

41

0

20

40

60

80

100

120

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(b)

23419

20270

17354

12036

0

5000

10000

15000

20000

25000

Ave

rage

exec

utio

n tim

e (m

s)

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

(c)

13435 13430

10578

7346

0

2000

4000

6000

8000

10000

12000

14000

16000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(d)








Data Availability




References








































































8142

110

50

236

61

382

69

0

50

100

150

200

250

300

350

400

450

CPU GPU

Ave

rage

exec

utio

n tim

e (m

s)

Platforms

1024 times 7681600 times 900

1920 times 10802560 times 1449

(a)

42 4549 5150

5458 6061 63

677169

7379

87

0

10

20

30

40

50

60

70

80

90

100


1024 times 7681600 times 900

1920 times 10802560 times 1440

Ave

rage

exec

utio

n tim

e (m

s)

(b)

221184

14997

260225

192133

442

371

269

172

686

489

342

191

0

100

200

300

400

500

600

700

800

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

1024 times 7681600 times 900

1920 times 10802560 times 1440

(c)








9984

6840 6776

4690

0

2000

4000

6000

8000

10000

12000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(a)

110

76 75

41

0

20

40

60

80

100

120

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(b)

23419

20270

17354

12036

0

5000

10000

15000

20000

25000

Ave

rage

exec

utio

n tim

e (m

s)

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

(c)

13435 13430

10578

7346

0

2000

4000

6000

8000

10000

12000

14000

16000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(d)








Data Availability




References











































































9984

6840 6776

4690

0

2000

4000

6000

8000

10000

12000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(a)

110

76 75

41

0

20

40

60

80

100

120

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(b)

23419

20270

17354

12036

0

5000

10000

15000

20000

25000

Ave

rage

exec

utio

n tim

e (m

s)

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

(c)

13435 13430

10578

7346

0

2000

4000

6000

8000

10000

12000

14000

16000

HIPI HIPI + GPU

Platforms


(hadoop + GPU)

Ave

rage

exec

utio

n tim

e (m

s)

(d)








Data Availability




References











































































Data Availability




References
















































































































































Documents

EfficientProcessingofImageProcessing ApplicationsonCPU/GPU · 2020. 10. 10. · GPUs(graphicalprocessingunits)arebecomingpopularto exploitdata-levelparallelism[1]inembarrassinglyparallel