11
Research Article An Efficient Stream Data Processing Model for Multiuser Cryptographic Service Li Li , 1,2 Fenghua Li , 3,4 Guozhen Shi, 5 and Kui Geng 3 1 College of Communication Engineering, Xidian University, Xi’an 710071, China 2 Department of Electronic and Information Engineering, Beijing Electronics Science and Technology Institute, Beijing 100070, China 3 State Key Laboratory of Information Security, Institute of Information Engineering, CAS, Beijing 100093, China 4 School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China 5 Department of Information Security, Beijing Electronic Science and Technology Institute, Beijing 100070, China Correspondence should be addressed to Fenghua Li; [email protected] Received 2 April 2018; Accepted 8 July 2018; Published 31 July 2018 Academic Editor: Jar Ferr Yang Copyright © 2018 Li Li et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In view of the demand for high-concurrency massive data encryption and decryption application services in the security field, this paper proposes a dual-channel pipeline parallel data processing model (DPP) according to the characteristics of cryptographic operations and realized cryptographic operations of cross-data streams with different service requirements in a multiuser environment. By encapsulating cryptographic operation requirements in job packages, the input data flow is divided by the dual- channel mechanism and job packages parallel scheduling, which ensures the synchronization between the processing of the dependent job packages and parallel packages and hides the processing of the independent job package in the processing of the dependent job package. Prototyping experiments prove that this model can realize the correct and rapid processing of multiservice cross-data streams. Increasing the pipeline depth and improving the processing performance in each stage of the pipeline are the key to improving the system performance. 1. Introduction With the development of computer and network technology, the large number of users and businesses of all kinds of business systems bring huge challenges to data analysis, processing, and storage of business systems. Meanwhile, the urgent need for the security service capabilities of business systems is also put forward. Not only security needs are reflected in financial business, but also the big data analysis for user behavior can easily expose users’ personal privacy. e vulnerability of information transmission in the Internet of ings can easily become a security risk in the field of industrial control. e use of cryptographic techniques to ensure the security of business and data and the protection of user privacy are urgent tasks at this stage and even in the future. erefore, it is necessary to study fast cryptographic operations for mass data. erefore, considering the security and high-speed processing requirements, it is urgent to design a parallel system that can meet the requirements of different algorithms and different cryptographic working modes. As the mainstream of computer architecture research and design, multicore has an irreplaceable role in improving computing performance. People have done a lot of research on the high-speed design and implementation of crypto- graphic algorithm itself, as well as heterogeneous multicore crypto processors. However, there is a lack of research on the high-speed processing of cryptographic services that cross each other in multiuser scenarios. is dissertation takes the design of high-performance cryptographic server as research background. According to the characteristics of crypto- graphic operations, under the demand of high-concurrency massive data encryption and decryption application service, an efficient stream data processing model for multiuser cryptographic services is proposed to meet the requirements of user-differentiated cryptographic service requirements and achieve high-speed cryptographic service performance. Hindawi Journal of Electrical and Computer Engineering Volume 2018, Article ID 3917827, 10 pages https://doi.org/10.1155/2018/3917827

AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

Research ArticleAn Efficient Stream Data Processing Model for MultiuserCryptographic Service

Li Li 12 Fenghua Li 34 Guozhen Shi5 and Kui Geng3

1College of Communication Engineering Xidian University Xirsquoan 710071 China2Department of Electronic and Information Engineering Beijing Electronics Science and Technology InstituteBeijing 100070 China3State Key Laboratory of Information Security Institute of Information Engineering CAS Beijing 100093 China4School of Cyber Security University of Chinese Academy of Sciences Beijing 100049 China5Department of Information Security Beijing Electronic Science and Technology Institute Beijing 100070 China

Correspondence should be addressed to Fenghua Li lfhiieaccn

Received 2 April 2018 Accepted 8 July 2018 Published 31 July 2018

Academic Editor Jar Ferr Yang

Copyright copy 2018 Li Li et al ampis is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In view of the demand for high-concurrency massive data encryption and decryption application services in the security field thispaper proposes a dual-channel pipeline parallel data processing model (DPP) according to the characteristics of cryptographicoperations and realized cryptographic operations of cross-data streams with different service requirements in a multiuserenvironment By encapsulating cryptographic operation requirements in job packages the input data flow is divided by the dual-channel mechanism and job packages parallel scheduling which ensures the synchronization between the processing of thedependent job packages and parallel packages and hides the processing of the independent job package in the processing of thedependent job package Prototyping experiments prove that this model can realize the correct and rapid processing of multiservicecross-data streams Increasing the pipeline depth and improving the processing performance in each stage of the pipeline are thekey to improving the system performance

1 Introduction

With the development of computer and network technologythe large number of users and businesses of all kinds ofbusiness systems bring huge challenges to data analysisprocessing and storage of business systems Meanwhile theurgent need for the security service capabilities of businesssystems is also put forward Not only security needs arereflected in financial business but also the big data analysisfor user behavior can easily expose usersrsquo personal privacyampe vulnerability of information transmission in the Internetof ampings can easily become a security risk in the field ofindustrial control ampe use of cryptographic techniques toensure the security of business and data and the protection ofuser privacy are urgent tasks at this stage and even in thefuture amperefore it is necessary to study fast cryptographicoperations for mass data amperefore considering the securityand high-speed processing requirements it is urgent to design

a parallel system that can meet the requirements of differentalgorithms and different cryptographic working modes

As the mainstream of computer architecture researchand design multicore has an irreplaceable role in improvingcomputing performance People have done a lot of researchon the high-speed design and implementation of crypto-graphic algorithm itself as well as heterogeneous multicorecrypto processors However there is a lack of research on thehigh-speed processing of cryptographic services that crosseach other in multiuser scenarios ampis dissertation takes thedesign of high-performance cryptographic server as researchbackground According to the characteristics of crypto-graphic operations under the demand of high-concurrencymassive data encryption and decryption application servicean efficient stream data processing model for multiusercryptographic services is proposed to meet the requirementsof user-differentiated cryptographic service requirementsand achieve high-speed cryptographic service performance

HindawiJournal of Electrical and Computer EngineeringVolume 2018 Article ID 3917827 10 pageshttpsdoiorg10115520183917827

ampis paper is organized as follows Section 2 reviews theexisting research Section 3 introduces the thread separationof the cryptographic operations based on the characteristicsof cryptographic operation in different working modes InSection 4 the dual-channel pipeline parallel data processingmodel DPP is proposed Section 5 implements and tests themodel

2 Related Research

As the mainstream of processor architecture developmentmulticore processors have led to the research upsurge ofparallel processingampe speed of data processing is improvedby multicore parallel execution ampere are two issues in-volved here multithread parallelism and multitaskparallelism For multithreaded tasks concurrent executionof multiple threads by multicore processors can improve theprocessing performance For example one task can bedivided into three threads to complete in the followingorder initialization I operation C and results output Tampen we can complete it with three cores as shown inFigure 1 A single-threaded task is usually ported to mul-tiple cores for execution through automatic parallelizationtechniques ampere are many studies on the automatic par-allelization of loops ampe traditional loop parallel methodsinclude DOALL [1 2] and DOACROSS [3] When there isno dependency between iterations of the loop the DOALLmethod is used to perform the parallelization sequentiallywhen there are dependencies between iterations of the loopthe DOACROSS method is used ampis automatic paralleli-zation technique cannot achieve good results for generalloops containing complex control flow recursive datastructures and multiple pointer accesses For this reasonOttoni proposed an instruction-level automatic task parallelalgorithm called Decoupled Software Pipelines (DSWP) [4]It divides the loop into two parts the instruction stream andthe data stream and completes parallelism through syn-chronization between instructions and data ampe imple-mentation flow of DSWP and DOACROSS is shown inFigures 2(a) and 2(b) respectively

DSWP reduce interaction between cores compared toDOACROSS Whether it is DSWP or DOACROSS there isa synchronous relationship between threads that are auto-matically parallelized For example the execution time ofthread I and thread C needs to be consistent otherwise itwill result in the waiting consumption of cores ampereforesynchronization (or coordination) is a key issue that must beconsidered for parallelization Concurrency must achievehigh performance without significant overhead Synchro-nization between threads often leads to the sequentializationof parallel activities thereby undermining the potentialbenefits of concurrent execution amperefore the effective useof synchronization and coordination is critical to achievinghigh performance One way to achieve this goal is specu-lative execution which enables concurrent synchronizationthrough thread speculation or branch prediction [5ndash8]Successful speculation will reduce the portion of continuousexecution but false speculation will increase revocation andrecovery overhead Simultaneous implementation of the

speculative mechanism in the traditional control flow ar-chitecture requires a large amount of software and hardwareoverhead

In multitask parallelism if each task is a multithreadedtask due to the independence of the tasks and the in-dependence of the threads the processing method is equiv-alent to a multithreaded task If there are single-threadedtasks after they are parallelized into multiple threads therewill be some associated threads in the process of multitaskparallel execution which must be treated differently

ampe cryptographic service involves multiple elementsFor example a block cipher algorithm involves crypto-graphic algorithms KEYs initial vectors working modesetc Different cryptographic services have different opera-tional elements ampe rapid implementation of multiusercryptographic services belongs to the multitask paralleldomain As a device for realizing multiuser and massive datacryptographic services the high-performance cryptographicserver must achieve the following two points First is thecorrectness of user services the processing request of dif-ferent users cannot be confused and the second is the ra-pidity of data processing ampere are many researches on thefast implementation of the cryptographic algorithm itselfsuch as improving the computing performance of blockcipher algorithm through pipelining [9ndash13] and optimizingthe key operations of public-key cryptography algorithm toimprove the operation speed [14ndash16] Some studies alsoaccelerate the performance of cryptographic operationsthrough multicore parallelism For example the literatureuses GPU to implement parallel processing of part ofcryptographic algorithms [17ndash19] ampese research results areusually the performance improvement of a single crypto-graphic algorithm ampe literature adopts a heterogeneousmulticore approach to complete the parallel processing ofmultiple cryptographic algorithms [20ndash22] However thereis no proposed data processing method for multiusercryptographic services in the presence of multiple crypto-graphic algorithms multiple KEYs and multiple datastreams

I C TCore 1 Core 2 Core 3

Figure 1 ampe execution of multithreaded task

I

Core 1 Core 2

C I

CI

C I

(a)

ICore 1 Core 2

I C

CI

I C

(b)

Figure 2 (a) DOACROSS (b) DSWP

2 Journal of Electrical and Computer Engineering

ampis paper proposes a dual-channel pipeline parallel dataprocessing model (DPP) which includes parallel schedulingalgorithm preprocessing algorithmic operation and resultacquisition ampe DPP is designed and implemented ina heterogeneous multicore parallel architecture ampis parallelsystem performs a variety of cryptographic algorithms andsupports linear expansion of the algorithm operation unitEach algorithm operation unit adopts dual channels to receivedata and realizes parallel operations among multiple tasks

3 Thread Split

As mentioned above different cryptographic services havedifferent computing elements which is expressed as

Service IDcryptokeyIVmode

ID is the service number that is set for multiple usersampe cryptographic operations are usually carried out in

blocks and userrsquos data can be divided into several blocksAccording to different cryptographic algorithms the size ofthe block also varies Here UB is used to represent the size ofa single block ampe research in this paper is based on thefollowing assumptions

(1) Parallel processing with UB granularity(2) Each cryptographic algorithm core completes the

operation of one block ampe algorithm core adoptsthe full pipeline design

ampe fast implementation of cryptographic algorithm coreis not discussed here Under the premise of meeting theinterface conditions any kind of full pipeline imple-mentation scheme of the cryptographic algorithm can beapplied to the algorithm core in this model ampis paperfocuses on the parallelism between different blocks ofcryptographic service

31 Symmetric Cryptographic Algorithm ampe parallelismbetween blocks of symmetric cryptography algorithms mustconsider the working mode adopted by the cryptographicalgorithm ampe commonly used working modes are ECBCBC CFB OFB CTR [23ndash27] and so on Assume that Ci

denotes the ith ciphertext block Pi denotes the ith plaintextblock Enc denotes the encryption algorithm Dec denotesthe decryption algorithm key denotes the KEY IV denotesthe initial vector n is the number of plaintextciphertextblocks Ti is the counter value which increases by 1 with theincrement of the block and u is the length of the last block

(1) ECB working mode

Encryption Ci Enc(key Pi) 0le ilt n

Decryption Pi Dec(key Ci) 0le ilt n

(2) CTR working mode

EncryptionCi Pi XOR Enc(key Ti) i 0 1 2 nminus 2

Cnminus1 Pnminus1 XOR MSBu(Enc(key Tnminus1))1113896

DecryptionCi Pi XOR Enc(key Ti) i 0 1 2 nminus 2

Cnminus1 Pnminus1 XOR MSBu(Enc(key Tnminus1))1113896

(3) CBC working mode

EncryptionC0 Enc(key XOR(IV P0))

Ci Enc(key XOR(Ciminus1 Pi)) 0lt ilt n1113896

DecryptionP0 Dec(key C0)XORIV

Pi Dec(key Ci) XOR Ciminus1 0lt ilt n1113896

(4) CFB working mode

EncryptionC0 Enc(key IV)XOR P0

Ci Enc(key Ciminus1) XOR Pi 0lt ilt n1113896

DecryptionP0 Enc(key IV) XOR C0

Pi Enc(key Ciminus1) XOR Ci 0lt ilt n1113896

(5) OFB working mode

EncryptionC0 Enc(key IV) XOR P0

Ci Enc(key Ciminus1) XOR Pi 0lt ilt n1113896

DecryptionS0 Enc(key IV)

Si Enc(key Siminus1) 0lt ilt n

Pi SiXORCi 0le ilt n

⎧⎪⎨

⎪⎩

Because there is no dependency between blocks inthe ECB and CTR modes blocks can be processed inparallel So ECB and CTR are parallel modes and are verysuitable for parallel processing CBC CFB and OFBmodes have certain dependencies among blocks so CBCCFB and OFB are serial modes When using multicoreparallel operations in multiuser scenarios attention mustbe paid to coordination and synchronization amongblocks

By analyzing each working mode we can divide theencryption and decryption operation into 3 threads ampread1 completes the acquisition of the algorithm core input datathread 2 completes the encryptiondecryption operation ofa single block and thread 3 completes the output of theciphertextplaintext data Taking CBC encryption mode andOFB decryption mode as examples the thread splitting isshown in Table 1

In this way of splitting the function of thread 2 is rel-atively simple which is the cryptographic algorithm oper-ation of oneUB Since encryption and decryption operationsusually require multiple rounds of confusion and iterativeoperations the operation time of thread 2 is longer than thatof thread 1 and thread 3 Taking the SM4 algorithm as anexample each block needs 32 rounds of function operationsIn the full pipeline approach the algorithm architecture isshown in Figure 3 where F0 to F31 represent 32 rounds of

Journal of Electrical and Computer Engineering 3

function operations Different blocks without dependenciescan be executed in parallel within it

For example service 1 ID1SM4key1IV1CBC andservice 2 ID2SM4key2noneECB correspond to the datastreams data 1 and data 2 if data 1 can be split into m UBblocks and data 2 can be split into n UB blocks ampat is

data 1 p11 p12 p1m1113864 1113865

data 2 p21 p22 p2n1113864 1113865(1)

when p1i is operated in module Fk(0le kle 31) p1(i+1) cannotenter the thread 1 module but p2j can enter the thread 1 andFkprime(kprime lt k) modules So we can insert the block data of theparallel mode between blocks of the serial workingmode andhide the execution time of the parallel mode block datainside that of the serial working mode block dataampe thread2 pipeline depth determines the number of independentblocks that can be inserted between dependent blocks

32 Hash Algorithm ampe Hash function needs to obtain thehash value of the message through multiple iterations of theblock data so the Hash operation has dependencies betweenthe blocks By analyzing Hash algorithm can also be dividedinto 3 threads message expansion (ME) iterative com-pression and hash value output ampe message expansioncompletes the calculation of parameters required for theiterative compression function and the hash value output isused for the output of the final result Taking SM3 as anexample the algorithm architecture is shown in Figure 4ampe parallelism of Hash operations can only occur betweenblocks of different data streams

4 DPP Parallel Data Processing Model

ampe parallel computing models based on PRAM BSP andLogP [28 29] are all computing oriented lacking perti-nence to data processing and are not suitable for practicalapplication of massive data It can be seen from the abovedescription that the cryptographic operation data in themultiuser environment have the following characteristics

(1) Different user data streams intersect each other (2)Independent blocks and dependent blocks exist inmutual intersections amperefore the parallel processingmodel must have the following two mechanisms (1)distinguish between different user data streams and (2)distinguish between data stream dependencies For thispurpose we encapsulate the data stream and add certainattribute information to the block data so as to expressits properties thereby facilitating subsequent parallelprocessing

ampe cryptographic operations of streaming data are data-intensive applications Its typical feature is that the datavalue is time-sensitive so the system requires low latencyWhen a large amount of multiuser data reaches the system ina continuous fast time-varying and cross way it must bequickly sent to each cryptographic algorithm operationnode Otherwise data loss may occur due to limited storagespace of the system receiver Referring to the MapReducedata stream processing strategy specialized module is usedto distribute data streams

ampe three threads of cryptographic operations of dif-ferent working modes are implemented by three modulespreprocessing module operation module and result outputmodule ampe data reorganization module completes theintegration of the data stream packages of each serviceFigure 5 shows the dual-channel pipeline parallel dataprocessing model proposed in this paper ampe data streamprocessing is divided into six stages job package encapsu-lation (PE) parallel scheduling (PS) job package pre-processing (PP) algorithm operation (AO) result output(RO) and data reorganization (DR) ampe job package en-capsulation and data reorganization are completed by thenode P0 the parallel scheduling is completed by the node P0and the job package preprocessing algorithm operation andresult output are completed by the algorithm operation unitcry_IP Each algorithm corresponds to a cluster of algorithmoperation units For example the operation module cry_IPi1is an encryption operation unit of algorithm si and cry_IPi2 isa decryption operation unit of algorithm si ampe dual channelis embodied in the two channels of input and output data ofalgorithm operation unit

Table 1 ampread split

CBC encryption mode OFB decryption mode

ampread 1 result 1 XOR(X Pi) X IV i 0

Ciminus1 0lt ilt n1113896 ampread 1 result 1

IV i 0

result 2iminus1 0lt ilt n1113896

ampread 2 result 2 Enc(Key result 1) ampread 2 result 2i Enc(Key result 1i)

ampread 3 Ci result 2 ampread 3 Pi result 2 XOR Ci 0le ilt n

F0 F2 F31Pi

Cindash1

IV

0

Ci

i

Mode0

1

01

SM4

Key

Thread 1Thread 2 Thread 3

Figure 3 SM4 3-thread algorithm operation

F0 F2 F63Vindash1

IV

Bi

Vi

i0

1 SM3

Thread 1 Thread 2 Thread 3ME

W

Figure 4 SM3 3-thread algorithm operation

4 Journal of Electrical and Computer Engineering

In the DPP model data processing is performed in unitsof job packages No explicit synchronization process is re-quired between job packages Synchronization is implicit inthe algorithm preprocessing Each job package is completedby a fixed algorithm operation unit and there is no datainteraction between the algorithm operation units in the jobpackage processing

41 Encapsulation ampe encapsulation is completed by themaster node ampe format of the encapsulated job package isas follows

P IDcryptokeyIVmodeNoflagl

ID is the service number set for multiuser and is used todistinguish different data streams Crypto represents thespecific demand for cryptographic algorithms and encryptiondecryption operations key is the KEY IV is the initial vectorand mode is the working mode which can be used todistinguish the dependency property of the job packageNo is the job package serial number which is used toreassemble the data stream after the algorithm operationFlag is the tail package identifier which indicates whetherthe service data flow is end l is the length of the job packageand is consistent with the size of the unit block of thecryptographic algorithm

ampe flow data received by the system is described asTask Pij where i corresponds to the service number and jcorresponds to the sequence number of the service We referto job packages in parallel mode as independent jobpackages and job packages in serial mode as dependent jobpackages

42 Dual Channel and Parallel Scheduling According to thethread splitting under the same algorithm requirements thedifference of package processing of different working modesis embodied in the preprocessing module and the resultoutput module ampe operation module does not consider thecorrelation among the job packages and only processes theinput job packages For this reason it is necessary to dis-tinguish the input data of the preprocessing module and theresult output module We adopt dual channels to receive job

packages and classify independent job packages and de-pendent job packages

421 Dual Receiving Channel of the Preprocessing ModuleChannel 1 is used for the transfer of independent jobpackages Considering that the first job package of a datastream in the serial mode is not associated with job packagesof other data streams the job package transmitted in channel1 satisfies the condition

(mode ECB|CTR) or (modeCBC|CFB|OFB andNo1)

Channel 2 is used for the transfer of dependent jobpackages ampe job package that is transmitted in channel 2satisfies the condition

(modeCBC|CFB|OFB) and No 1

Suppose that the working mode of four services thatrequest cry_IPab operation is as follows service 1mode service 2modeCBC service 3mode service 4mode ECB the job packages on channel 1 and channel 2after parallel scheduling are shown in Figure 6

ampe selection of channel is determined by the controlsignal Si and the default selection is channel 1 that is Si 0cnt is used to record the execution time of the dependent jobpackage in the algorithm operation unit When module cntsenses that the preprocessing module inputs a dependent jobpackage the counting starts It is assumed that the algorithmoperation unit needs m clock cycles to complete the oper-ation of the dependent job package When cntm thecounter clears cnt 0 sets Si 1 to select channel 2 andinputs the next dependency job package In other casesSi 0 and channel 1 is selected ampe state flow diagram isshown in Figure 7

422 Dual Receiving Channel of the Result Output ModuleChannel 3 is used for the transfer of independent jobpackages ampe job package that is transmitted in channel 3satisfies the condition

(mode ECB|CTR) or (modeCBC|CFB|OFB andflag 1)

Channel 4 is used for the transfer of dependent jobpackages ampe job package that is transmitted in channel 4satisfies the condition

(modeCBC|CFB|OFB) and flag n

Channel 4 provides support for storing intermediatestates in serial mode Since the result of the tail package doesnot need to be used as an intermediate state the tail packagersquosresult in serial mode is also output through channel 3

ampe choice of channel is determined by the control signalSo and channel 3 is the default that is So 0 ampe channelselection control signal is the same as that of the pre-processing module When cntm So is set to 1 and channel4 is selected In other cases So 0 and channel 3 is selectedWhen So 1 the job package transmitted by channel 4 hasthe same ID as the job package received by the preprocessingmodule as shown in Figure 8

PS

PP11 AO11

PP1m1 AO12

PP21 AO21

PP2m2 AO22

PPp1 AOp1

PPpmp AOp2

PE DR

Tpar

L

Tr Tc

Ts

Tf

Tz

L

RO

RO

RO

RO

RO

RO

cry_IP1

cry_IP2

cry_IPp

Figure 5 Dual-channel pipeline parallel data processing model(DPP)

Journal of Electrical and Computer Engineering 5

423 Parallel Scheduling ampe process of parallel schedulingis as follows

Step 1 Determine the algorithm operation unitaccording to cryptoStep 2 Select the input channel of the preprocessingmodule of the algorithm operation unit according tomode and No

ampis scheduling method realizes fast transfer of in-coming data streams and continuous processing of jobpackages ampe use of dual channels reduces the interactionbetween modules and hides the processing time of in-dependent job packages in the processing time of dependentjob packages facilitating the parallel execution of jobpackages

424 Data Processing Steps

Step 1 ampe algorithm application process splits the datato be processed adds attribute information and en-capsulates it as job package

Step 2 ampe algorithm operation unit is determinedaccording to the crypto field in the job package and theinput channel of its preprocessing module is selectedaccording to mode and No ampe job package is sent tothe corresponding input channelStep 3 ampe preprocessing module obtains the inputdata of the algorithm operation module namely dataKEY and IV according to the package field of modeand NoStep 4 ampe algorithm operation module performspipeline processing on the received data and sends theresult to the receiving channel of the result outputmodule according to mode and flagStep 5 ampe result output module outputs the receivedjob package to the result receiving process and de-termines whether to feed the job package back to thepreprocessing module according to mode and flag ofthe job packageStep 6 ampe result receiving process recombines thereceived job package based on ID

43 Parallel Execution Time Assume that P0 is the jobpackage encapsulation and reorganization node and PaPb and Pc are algorithm operation nodes Tpar is thepackage encapsulation time and Tz is the data re-organization time g is the communication interval thatis the minimum time interval during which node P0continuously transmits and receives job packages ampereciprocal of g corresponds to the communication band-width L is the maximum communication delay which isthe time taken to transmit a job package from node P0 tothe scheduling node Ts indicates the parallel schedulingtime of job packages Tr indicates the job package pre-processing time Tc indicates the job package operationtime and Tf indicates the job package output time g

represents the calculated load which is the set of jobpackages m is the algorithm operation module pipelinedepth ampe message delivery process of the job package onDPP is shown in Figure 9

ampe continuous sending and receiving of messages needsto meet the conditions

cnt = 0Si = 0

S0

cnt = cnt + 1Si = 0

cnt = 0Si = 1

S1

S2

Mode = CBC|CFB|OFB

Mode = CBC|CFB|OFBcnt = mcnt ne m

Figure 7 Dual-channel selection control of preprocessing module

P22PP AO

Channel 3

C11

C31C32C41

C21RO

C21

P22middotID

01

So

cnt

Channel 4

C21 C41 C32 C31

Figure 8 Dual-channel control of output module

PS

cyp_IPcd

hellip

hellip

hellip

cntMode

Channel 1

Channel 2

Channel 1

Channel 2

P11P31P32P41P21

P11P31P32P41P21

P22P12

P22P12

cyp_IPab

Si

Si

cntMode

01

01

Figure 6 Parallel scheduling and dual receiving channel of preprocessing module

6 Journal of Electrical and Computer Engineering

L + Ts + Tr +Tc

mleg + L + Ts + Tr

ggeTc

m

(2)

Conclusion 1 ampe system communication bandwidth can beimproved by two ways increasing the operation speed of thealgorithm operation module and increasing the pipelinedepth of it

If g job packages come from α data streams consider twoscenarios

(1) Each data stream adopts a parallel mode that is jobpackages are all independent ampe load processingtime is as follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2L

(3)

(2) Each data stream adopts a parallel mode so that jobpackages of the same service data stream are mutuallydependent and job packages of different service datastreams are mutually independent Assume that themaximum number of service job packages iswprime in theextreme case the first job packet of this data flowappears after other service data flows and then theoperation time of other service data flows is hiddenduring that of the longest service data flow ampe loadprocessing time is as follows

T Tpar + wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(4)

For the data stream mixed in the serialparallel modedue to the pipeline design of the algorithm operationmodule in the process of dependent job packages the in-dependent job packages can be executed in parallel so theexecution time of independent job packages is hidden in theexecution time of the dependent job package amperefore theexecution time Tof the multitask mixed mode data stream isas follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2LleTleTpar

+ wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(5)

Conclusion 2 ampe execution time of mixed cross-datastreams is limited to the execution time of the datastreams with the most job packages On the premise ofconstant pipeline depth improving the processing per-formance of each module in the pipeline is the key toimprove the processing method

5 Implementation and Testing

51 Hardware Implementation We prototyped the modelto verify its validityampe architecture is shown in Figure 10ampe cipher server management system completes thereception of multiuser cryptographic service data streamsthe encapsulation of job packages and the data reorga-nization service of the operation results ampe crypto-graphic algorithm operation is performed by a cryptomachine as a coprocessor ampe crypto machine is designedusing Xilinx XC7K325t FPGA which includes parallelscheduling module SM3 and SM4 cryptographic algo-rithm cores

ampe hardware implementation block diagram of thecipher machine is shown in Figure 11 ampe cipher machineadopts the PCIe interface and receives the job package splitby the algorithm application process of the cipher servermanagement system in the way of DMA and stores them inDOWN_FIFO in the downlink data storage area

Parallel schedulingmodule PSCHEDULE Determine thealgorithm core according to the crypto field of the jobpackage determine the receiving FIFO according to themode and No fields and realize the transfer of the jobpackage FIFO1 corresponds to channel 1 in the model andFIFO2 corresponds to channel 2 in the model

Preprocessing module IP_CTRL Acquire algorithm coreinput data including IV KEY and the result of the preorderdependent job package and calculate the input data tocryptographic cores

Operation modules SM4 and SM3 Cryptographic coresperform algorithm operations on input data in pipeliningway and send the result of the operation to uFIFO or RAMuFIFO corresponds to channel 3 in the model and RAMcorresponds to the channel 4

Cipher machine

Client

ClientCipher service

management system

Cloud data center

Figure 10 Multiuser cryptographic service

P0Tpar g g g

L

L

L

Tr

Tr

Tr

Tc

Tc

Tc Tf

Tf

TfPa

Pb

Pc

Tz

Ts

Ts

Ts

Tcm

Figure 9 Message delivery on the DPP model

Journal of Electrical and Computer Engineering 7

Result output module UP_CTRL If IP_CTRL gives theID number the data of the same ID number are extractedfrom RAM and the output result resultprime is calculated and fedback to IP_CTRL and output to the output FIFOo at the sametime If IP_CTRL has no ID number output the data inuFIFO are extracted and resultrsquo is calculated and sent toFIFOo ampe data in FIFOo are fed back to the result receptionprocess of the cipher server management system through theinterface module in DMA mode

52 Test ampe test environment is as follows ampe mainfrequency of the heterogeneous multicore parallel pro-cessing system implemented by Xilinx XC7K325t FPGA is160MHz and the interface with the upper application isPCIe 20lowast 8

Test 1 SM4 CBC encryption for a 4000MB file ampe end ofthe test operation takes 114390935 s so the data streamprocessing rate is 4000lowast 8114390935 279742446Mbps

Test 2 Test 21 to Test 24 use eight 400MB files and jobpackages of the eight files enter the cipher machine in aninterleaving manner ampese files use different IVs andKEYs in different working modes ampe end time of each fileprocessing is shown in Table 2 For the data set themaximum end time is the total time it takes ampe data flowprocessing rate is derived from the following formula

DMA_ctrl

FIFO1

IP_CTRL UP_CTRL

PCI_E interface

PD_TLP_CTRL PU_TLP_CTRL

IP FIFO

DOWN_FIFO

PSHEDULE

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM4SM4

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM3

FIFOoFIFOoFIFOo

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

Result

Result

Figure 11 Hardware architecture of partem

Table 2 Cipher operation time under cross files

Test 21

File File 1 File 2 File 3 File 4Mode SM4 CBC

Time (s) 135175 134912 134994 134831File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 134788 134872 134950 135014Processing rate 3200lowast 8135175 s 189Gbps

Test 22

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 125029 124946 124988 124784File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 124825 124906 124864 125052Processing rate 3200lowast 8125052 s 205Gbps

Test 23

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 43340 43166 40397 41721File File 5 File 6 File 7 File 8Mode SM4 ECB

Time (s) 43433 49895 41908 43623Processing rate 3200lowast 849895 s 513Gbps

Test 24

File File 1 File 2 File 3 File 4Mode SM3 SM4 ECB

Time (s) 128909 124151 130394 131713File File 5 File 6 File 7 File 8Mode SM4 CBC SM4 OFB

Time (s) 127424 126119 131909 124842Processing rate 3200lowast 8131909194Gbps

8 Journal of Electrical and Computer Engineering

rate (bps) size of data flow (bit)the total time (s)

(6)

Analysis Because Test 1 has only one file in the CBCmode the job packages are interrelated and all are executedserial Although packages of each file are interrelated thefiles of Test 21 are independent of each other so the dataflow processing rate of Test 21 is higher than that of Test 1In Test 22 4 files are in the ECB work mode and theindependent job packages operation time can be hiddenwithin the operation time of the dependency packages sothe data flow processing rate of Test 22 is higher than thatof the Test 21 Similarly the data processing rate of Test 23is the highest Test 24 has 2 files with independent jobpackages and 6 files with dependent packages but they areallocated in two algorithm units so the operation rate isclose to Test 21

Test 3 Processing rate compare of dual channel and singlechannel ampe total amount of job packages is 10000and they are randomly assigned to j files If Ni representsthe number of job packages in filei 1113936

ji1Ni 10000 If j

is 10 20 30 40 the ECB or CBC encryption mode isadopted Change the number of files in CBC encryptionmode and compare the completion time of data flowin single-channel architecture and dual-channel archi-tecture ampe average value of data flow processing time isrun several times and the comparison result is shown inFigure 12 Single 0means that all files use ECBmode andthe system adopts single-channel architecture Dual 50indicates that 50 of the files in the data stream useCBC mode and the system is dual-channel architectureand so on

As can be seen from Figure 12 when the data flow is anindependent data flow the algorithm operation unit adoptsthe pipeline design so the processing rate under the dualchannel is close to the processing rate under the singlechannel with the increase of the associated job packages inthe data flow the advantage of the data processing rate ofdual channel is gradually displayed and with the increase ofthe number of files in the data stream the advantage of thedata processing rate is more obvious

6 Conclusion

Based on the characteristics of cryptographic operationsthis paper proposes a dual-channel pipeline parallel dataprocessing model DPP to implement cryptographic op-erations for cross-data streams with different servicerequirements in a multiuser environment ampe modelensures synchronization between dependent job packagesand parallel processing between independent job packagesand data streams It hides the processing of independentjob packages in the process of dependent job packages toimprove the processing speed of cross-data streamsPrototype experiments prove that the system under thismodel can realize correct and rapid processing of multi-service and personalized cross-data streams Increasingthe depth of the cryptographic algorithm pipeline and

improving the processing performance of each module inthe pipeline can improve the overall performance of thesystem

Data Availability

ampe data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

ampe authors declare that they have no conflicts of interest

Acknowledgments

ampis work was supported by the National Key RampD Programof China (no 2017YFB0802705) and the National NaturalScience Foundation of China (no 61672515)

References

[1] Y Song and Z Li ldquoApplying array contraction to a sequenceof DOALL loopsrdquo in Proceedings of the International Con-ference on Parallel Processing (ICPPrsquo04) vol 1 pp 46ndash53Montreal Canada August 2004

[2] G Elsesser V Ngo S Bhattacharya and W T Tsai ldquoLoadbalancing of DOALL loops in the Perfect Clubrdquo in Proceedingsof the 1993 Proceedings Seventh International Parallel ProcessingSymposium pp 129ndash133 Newport CA USA April 1993

10 20 30 40File number

0

1

2

3

4

5

6

Proc

essin

g ra

te (G

bps)

Single 0Single 50Single 100

Dual 0Dual 50Dual 100

Multiple files sm4 ECB and CBC cross encryption

Figure 12 Processing rate comparison of dual channel and singlechannel

Journal of Electrical and Computer Engineering 9

[3] D K C Ding-Kai Chen and P C Y Pen-Chung YewldquoStatement re-ordering for DOACROSS loopsrdquo in Pro-ceedings of the 1994 Internatonal Conference on ParallelProcessing vol 2 pp 24ndash28 Raleigh NC USA August 1994

[4] G Ottoni R Rangan A Stoler and D I August ldquoAutomaticthread extraction with decoupled software pipeliningrdquo inProceedings of the 38th Annual IEEEACM InternationalSymposium on Microarchitecture (MICROrsquo05) pp 105ndash118Barcelona Spain November 2005

[5] V Krishnan and J Torrellas ldquoA chip-multiprocessor archi-tecture with speculative multithreadingrdquo IEEE Transactionson Computers vol 48 no 9 pp 866ndash880 1999

[6] A S Rajam L E Campostrini J M M Caamantildeo andP Clauss ldquoSpeculative runtime parallelization of loop neststowards greater scope and efficiencyrdquo in Proceedings of the2015 IEEE International Parallel and Distributed ProcessingSymposium Workshop (IPDPSW) pp 245ndash254 HyderabadIndia May 2015

[7] S Aldea A Estebanez D R Llanos and A Gonzalez-Escribano ldquoAn OpenMP extension that supports thread-levelspeculationrdquo IEEE Transactions on Parallel and DistributedSystems vol 27 no 1 pp 78ndash91 2016

[8] J Salamanca J N Amaral and G Araujo ldquoEvaluating andimproving thread-level speculation in hardware transactionalmemoriesrdquo in Proceedings of the 2016 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)pp 586ndash595 Chicago IL USA May 2016

[9] Z Ying and B Qinghai ldquoampe scheme for improving the ef-ficiency of block cipher algorithmrdquo in Proceedings of the 2014IEEE Workshop on Advanced Research and Technology inIndustry Applications (WARTIA) pp 824ndash826 Ottawa ONCanada September 2014

[10] P Kitsos and A N Skodras ldquoAn FPGA implementationand performance evaluation of the seed block cipherrdquo inProceedings of the 2011 17th International Conference onDigital Signal Processing (DSP) pp 1ndash5 Corfu Greece July2011

[11] L Bossuet N Datta C Mancillas-Lopez and M NandildquoELmD a pipelineable authenticated encryption and itshardware implementationrdquo IEEE Transactions on Computersvol 65 no 11 pp 3318ndash3331 2016

[12] P U Deshpande and S A Bhosale ldquoAES encryption enginesof many core processor arrays on FPGA by using parallelpipeline and sequential techniquerdquo in Proceedings of the 2015International Conference on Energy Systems and Applicationspp 75ndash80 Pune India October 2015

[13] T Kryjak and M Gorgon ldquoPipeline implementation of the128-bit block cipher CLEFIA in FPGArdquo in Proceedings of the2009 International Conference on Field Programmable Logicand Applications pp 373ndash378 Prague Czech RepublicAugust 2009

[14] S Lin S He X Guo and D Guo ldquoAn efficient algorithm forcomputing modular division over GF(2m) in elliptic curvecryptographyrdquo in Proceedings of the 2017 11th IEEE In-ternational Conference on Anti-counterfeiting Security andIdentification (ASID) pp 179ndash182 Xiamen China 2017

[15] K M John and S Sabi ldquoA novel high performance ECCprocessor architecture with two staged multiplierrdquo in Pro-ceedings of the 2017 IEEE International Conference on Elec-trical Instrumentation and Communication Engineering(ICEICE) pp 1ndash5 Karur India April 2017

[16] M S Albahri M Benaissa and Z U A Khan ldquoParallelimplementation of ECC point multiplication on a homoge-neous multi-core microcontrollerrdquo in Proceedings of the 2016

12th International Conference on Mobile Ad-Hoc and SensorNetworks (MSN) pp 386ndash389 Hefei China December 2016

[17] W K Lee B M Goi R C W Phan and G S Poh ldquoHighspeed implementation of symmetric block cipher on GPUrdquo inProceedings of the 2014 International Symposium on IntelligentSignal Processing and Communication Systems (ISPACS)pp 102ndash107 Kuching Malaysia December 2014

[18] J Ma X Chen R Xu and J Shi ldquoImplementation andevaluation of different parallel designs of AES using CUDArdquoin Proceedings of the 2017 IEEE Second International Con-ference on Data Science in Cyberspace (DSC) pp 606ndash614Shenzhen China June 2017

[19] W Dai Y Doroz and B Sunar ldquoAccelerating NTRU basedhomomorphic encryption using GPUsrdquo in Proceedings of the2014 IEEE High Performance Extreme Computing Conference(HPEC) pp 1ndash6 Waltham MA USA September 2014

[20] G Barlas A Hassan and Y A Jundi ldquoAn analytical approachto the design of parallel block cipher encryptiondecryptiona CPUGPU case studyrdquo in Proceedings of the 2011 19thInternational Euromicro Conference on Parallel Distributedand Network-Based Processing pp 247ndash251 Ayia NapaCyprus February 2011

[21] H Kondo S Otani M Nakajima et al ldquoHeterogeneousmulticore SoC with SiP for secure multimedia applicationsrdquoIEEE Journal of Solid-State Circuits vol 44 no 8 pp 2251ndash2259 2009

[22] S Wang J Han Y Li Y Bo and X Zeng ldquoA 920 MHz quad-core cryptography processor accelerating parallel task pro-cessing of public-key algorithmsrdquo in Proceedings of the IEEE2013 Custom Integrated Circuits Conference pp 1ndash4 San JoseCA USA September 2013

[23] M Alfadel E S M El-Alfy and K M A Kamal ldquoEvaluatingtime and throughput at different modes of operation in AESalgorithmrdquo in Proceedings of the 2017 8th InternationalConference on Information Technology (ICIT) pp 795ndash801Amman Jordan May 2017

[24] A Abidi S Tawbi C Guyeux B Bouallegue andM Machhout ldquoSummary of topological study of chaotic cbcmode of operationrdquo in Proceedings of the 2016 IEEE IntlConference on Computational Science and Engineering (CSE)and IEEE Intl Conference on Embedded and UbiquitousComputing (EUC) and 15th Intl Symposium on DistributedComputing and Applications for Business Engineering(DCABES) pp 436ndash443 Paris France August 2016

[25] S Najjar-Ghabel S Yousefi andM Z Lighvan ldquoA high speedimplementation counter mode cryptography using hardwareparallelismrdquo in Proceedings of the 2016 Eighth InternationalConference on Information and Knowledge Technology (IKT)pp 55ndash60 Hamedan Iran September 2016

[26] H M Heys ldquoAnalysis of the statistical cipher feedback modeof block ciphersrdquo IEEE Transactions on Computers vol 52no 1 pp 77ndash92 2003

[27] M A Alomari K Samsudin and A R Ramli ldquoA study onencryption algorithms and modes for disk encryptionrdquo inProceedings of the 2009 International Conference on SignalProcessing Systems pp 793ndash797 Singapore 2009

[28] L Wang H M Cui L Chen and X B Feng ldquoResearch ontask parallel programming modelrdquo Journal of Softwarevol 24 no 1 pp 77ndash90 2013

[29] K Huang G C Fox and J J DongarraDistributed and CloudComputing From Parallel Processing to the Internet of JingsMorgan Kaufmann Burlington MA USA 2011

10 Journal of Electrical and Computer Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

ampis paper is organized as follows Section 2 reviews theexisting research Section 3 introduces the thread separationof the cryptographic operations based on the characteristicsof cryptographic operation in different working modes InSection 4 the dual-channel pipeline parallel data processingmodel DPP is proposed Section 5 implements and tests themodel

2 Related Research

As the mainstream of processor architecture developmentmulticore processors have led to the research upsurge ofparallel processingampe speed of data processing is improvedby multicore parallel execution ampere are two issues in-volved here multithread parallelism and multitaskparallelism For multithreaded tasks concurrent executionof multiple threads by multicore processors can improve theprocessing performance For example one task can bedivided into three threads to complete in the followingorder initialization I operation C and results output Tampen we can complete it with three cores as shown inFigure 1 A single-threaded task is usually ported to mul-tiple cores for execution through automatic parallelizationtechniques ampere are many studies on the automatic par-allelization of loops ampe traditional loop parallel methodsinclude DOALL [1 2] and DOACROSS [3] When there isno dependency between iterations of the loop the DOALLmethod is used to perform the parallelization sequentiallywhen there are dependencies between iterations of the loopthe DOACROSS method is used ampis automatic paralleli-zation technique cannot achieve good results for generalloops containing complex control flow recursive datastructures and multiple pointer accesses For this reasonOttoni proposed an instruction-level automatic task parallelalgorithm called Decoupled Software Pipelines (DSWP) [4]It divides the loop into two parts the instruction stream andthe data stream and completes parallelism through syn-chronization between instructions and data ampe imple-mentation flow of DSWP and DOACROSS is shown inFigures 2(a) and 2(b) respectively

DSWP reduce interaction between cores compared toDOACROSS Whether it is DSWP or DOACROSS there isa synchronous relationship between threads that are auto-matically parallelized For example the execution time ofthread I and thread C needs to be consistent otherwise itwill result in the waiting consumption of cores ampereforesynchronization (or coordination) is a key issue that must beconsidered for parallelization Concurrency must achievehigh performance without significant overhead Synchro-nization between threads often leads to the sequentializationof parallel activities thereby undermining the potentialbenefits of concurrent execution amperefore the effective useof synchronization and coordination is critical to achievinghigh performance One way to achieve this goal is specu-lative execution which enables concurrent synchronizationthrough thread speculation or branch prediction [5ndash8]Successful speculation will reduce the portion of continuousexecution but false speculation will increase revocation andrecovery overhead Simultaneous implementation of the

speculative mechanism in the traditional control flow ar-chitecture requires a large amount of software and hardwareoverhead

In multitask parallelism if each task is a multithreadedtask due to the independence of the tasks and the in-dependence of the threads the processing method is equiv-alent to a multithreaded task If there are single-threadedtasks after they are parallelized into multiple threads therewill be some associated threads in the process of multitaskparallel execution which must be treated differently

ampe cryptographic service involves multiple elementsFor example a block cipher algorithm involves crypto-graphic algorithms KEYs initial vectors working modesetc Different cryptographic services have different opera-tional elements ampe rapid implementation of multiusercryptographic services belongs to the multitask paralleldomain As a device for realizing multiuser and massive datacryptographic services the high-performance cryptographicserver must achieve the following two points First is thecorrectness of user services the processing request of dif-ferent users cannot be confused and the second is the ra-pidity of data processing ampere are many researches on thefast implementation of the cryptographic algorithm itselfsuch as improving the computing performance of blockcipher algorithm through pipelining [9ndash13] and optimizingthe key operations of public-key cryptography algorithm toimprove the operation speed [14ndash16] Some studies alsoaccelerate the performance of cryptographic operationsthrough multicore parallelism For example the literatureuses GPU to implement parallel processing of part ofcryptographic algorithms [17ndash19] ampese research results areusually the performance improvement of a single crypto-graphic algorithm ampe literature adopts a heterogeneousmulticore approach to complete the parallel processing ofmultiple cryptographic algorithms [20ndash22] However thereis no proposed data processing method for multiusercryptographic services in the presence of multiple crypto-graphic algorithms multiple KEYs and multiple datastreams

I C TCore 1 Core 2 Core 3

Figure 1 ampe execution of multithreaded task

I

Core 1 Core 2

C I

CI

C I

(a)

ICore 1 Core 2

I C

CI

I C

(b)

Figure 2 (a) DOACROSS (b) DSWP

2 Journal of Electrical and Computer Engineering

ampis paper proposes a dual-channel pipeline parallel dataprocessing model (DPP) which includes parallel schedulingalgorithm preprocessing algorithmic operation and resultacquisition ampe DPP is designed and implemented ina heterogeneous multicore parallel architecture ampis parallelsystem performs a variety of cryptographic algorithms andsupports linear expansion of the algorithm operation unitEach algorithm operation unit adopts dual channels to receivedata and realizes parallel operations among multiple tasks

3 Thread Split

As mentioned above different cryptographic services havedifferent computing elements which is expressed as

Service IDcryptokeyIVmode

ID is the service number that is set for multiple usersampe cryptographic operations are usually carried out in

blocks and userrsquos data can be divided into several blocksAccording to different cryptographic algorithms the size ofthe block also varies Here UB is used to represent the size ofa single block ampe research in this paper is based on thefollowing assumptions

(1) Parallel processing with UB granularity(2) Each cryptographic algorithm core completes the

operation of one block ampe algorithm core adoptsthe full pipeline design

ampe fast implementation of cryptographic algorithm coreis not discussed here Under the premise of meeting theinterface conditions any kind of full pipeline imple-mentation scheme of the cryptographic algorithm can beapplied to the algorithm core in this model ampis paperfocuses on the parallelism between different blocks ofcryptographic service

31 Symmetric Cryptographic Algorithm ampe parallelismbetween blocks of symmetric cryptography algorithms mustconsider the working mode adopted by the cryptographicalgorithm ampe commonly used working modes are ECBCBC CFB OFB CTR [23ndash27] and so on Assume that Ci

denotes the ith ciphertext block Pi denotes the ith plaintextblock Enc denotes the encryption algorithm Dec denotesthe decryption algorithm key denotes the KEY IV denotesthe initial vector n is the number of plaintextciphertextblocks Ti is the counter value which increases by 1 with theincrement of the block and u is the length of the last block

(1) ECB working mode

Encryption Ci Enc(key Pi) 0le ilt n

Decryption Pi Dec(key Ci) 0le ilt n

(2) CTR working mode

EncryptionCi Pi XOR Enc(key Ti) i 0 1 2 nminus 2

Cnminus1 Pnminus1 XOR MSBu(Enc(key Tnminus1))1113896

DecryptionCi Pi XOR Enc(key Ti) i 0 1 2 nminus 2

Cnminus1 Pnminus1 XOR MSBu(Enc(key Tnminus1))1113896

(3) CBC working mode

EncryptionC0 Enc(key XOR(IV P0))

Ci Enc(key XOR(Ciminus1 Pi)) 0lt ilt n1113896

DecryptionP0 Dec(key C0)XORIV

Pi Dec(key Ci) XOR Ciminus1 0lt ilt n1113896

(4) CFB working mode

EncryptionC0 Enc(key IV)XOR P0

Ci Enc(key Ciminus1) XOR Pi 0lt ilt n1113896

DecryptionP0 Enc(key IV) XOR C0

Pi Enc(key Ciminus1) XOR Ci 0lt ilt n1113896

(5) OFB working mode

EncryptionC0 Enc(key IV) XOR P0

Ci Enc(key Ciminus1) XOR Pi 0lt ilt n1113896

DecryptionS0 Enc(key IV)

Si Enc(key Siminus1) 0lt ilt n

Pi SiXORCi 0le ilt n

⎧⎪⎨

⎪⎩

Because there is no dependency between blocks inthe ECB and CTR modes blocks can be processed inparallel So ECB and CTR are parallel modes and are verysuitable for parallel processing CBC CFB and OFBmodes have certain dependencies among blocks so CBCCFB and OFB are serial modes When using multicoreparallel operations in multiuser scenarios attention mustbe paid to coordination and synchronization amongblocks

By analyzing each working mode we can divide theencryption and decryption operation into 3 threads ampread1 completes the acquisition of the algorithm core input datathread 2 completes the encryptiondecryption operation ofa single block and thread 3 completes the output of theciphertextplaintext data Taking CBC encryption mode andOFB decryption mode as examples the thread splitting isshown in Table 1

In this way of splitting the function of thread 2 is rel-atively simple which is the cryptographic algorithm oper-ation of oneUB Since encryption and decryption operationsusually require multiple rounds of confusion and iterativeoperations the operation time of thread 2 is longer than thatof thread 1 and thread 3 Taking the SM4 algorithm as anexample each block needs 32 rounds of function operationsIn the full pipeline approach the algorithm architecture isshown in Figure 3 where F0 to F31 represent 32 rounds of

Journal of Electrical and Computer Engineering 3

function operations Different blocks without dependenciescan be executed in parallel within it

For example service 1 ID1SM4key1IV1CBC andservice 2 ID2SM4key2noneECB correspond to the datastreams data 1 and data 2 if data 1 can be split into m UBblocks and data 2 can be split into n UB blocks ampat is

data 1 p11 p12 p1m1113864 1113865

data 2 p21 p22 p2n1113864 1113865(1)

when p1i is operated in module Fk(0le kle 31) p1(i+1) cannotenter the thread 1 module but p2j can enter the thread 1 andFkprime(kprime lt k) modules So we can insert the block data of theparallel mode between blocks of the serial workingmode andhide the execution time of the parallel mode block datainside that of the serial working mode block dataampe thread2 pipeline depth determines the number of independentblocks that can be inserted between dependent blocks

32 Hash Algorithm ampe Hash function needs to obtain thehash value of the message through multiple iterations of theblock data so the Hash operation has dependencies betweenthe blocks By analyzing Hash algorithm can also be dividedinto 3 threads message expansion (ME) iterative com-pression and hash value output ampe message expansioncompletes the calculation of parameters required for theiterative compression function and the hash value output isused for the output of the final result Taking SM3 as anexample the algorithm architecture is shown in Figure 4ampe parallelism of Hash operations can only occur betweenblocks of different data streams

4 DPP Parallel Data Processing Model

ampe parallel computing models based on PRAM BSP andLogP [28 29] are all computing oriented lacking perti-nence to data processing and are not suitable for practicalapplication of massive data It can be seen from the abovedescription that the cryptographic operation data in themultiuser environment have the following characteristics

(1) Different user data streams intersect each other (2)Independent blocks and dependent blocks exist inmutual intersections amperefore the parallel processingmodel must have the following two mechanisms (1)distinguish between different user data streams and (2)distinguish between data stream dependencies For thispurpose we encapsulate the data stream and add certainattribute information to the block data so as to expressits properties thereby facilitating subsequent parallelprocessing

ampe cryptographic operations of streaming data are data-intensive applications Its typical feature is that the datavalue is time-sensitive so the system requires low latencyWhen a large amount of multiuser data reaches the system ina continuous fast time-varying and cross way it must bequickly sent to each cryptographic algorithm operationnode Otherwise data loss may occur due to limited storagespace of the system receiver Referring to the MapReducedata stream processing strategy specialized module is usedto distribute data streams

ampe three threads of cryptographic operations of dif-ferent working modes are implemented by three modulespreprocessing module operation module and result outputmodule ampe data reorganization module completes theintegration of the data stream packages of each serviceFigure 5 shows the dual-channel pipeline parallel dataprocessing model proposed in this paper ampe data streamprocessing is divided into six stages job package encapsu-lation (PE) parallel scheduling (PS) job package pre-processing (PP) algorithm operation (AO) result output(RO) and data reorganization (DR) ampe job package en-capsulation and data reorganization are completed by thenode P0 the parallel scheduling is completed by the node P0and the job package preprocessing algorithm operation andresult output are completed by the algorithm operation unitcry_IP Each algorithm corresponds to a cluster of algorithmoperation units For example the operation module cry_IPi1is an encryption operation unit of algorithm si and cry_IPi2 isa decryption operation unit of algorithm si ampe dual channelis embodied in the two channels of input and output data ofalgorithm operation unit

Table 1 ampread split

CBC encryption mode OFB decryption mode

ampread 1 result 1 XOR(X Pi) X IV i 0

Ciminus1 0lt ilt n1113896 ampread 1 result 1

IV i 0

result 2iminus1 0lt ilt n1113896

ampread 2 result 2 Enc(Key result 1) ampread 2 result 2i Enc(Key result 1i)

ampread 3 Ci result 2 ampread 3 Pi result 2 XOR Ci 0le ilt n

F0 F2 F31Pi

Cindash1

IV

0

Ci

i

Mode0

1

01

SM4

Key

Thread 1Thread 2 Thread 3

Figure 3 SM4 3-thread algorithm operation

F0 F2 F63Vindash1

IV

Bi

Vi

i0

1 SM3

Thread 1 Thread 2 Thread 3ME

W

Figure 4 SM3 3-thread algorithm operation

4 Journal of Electrical and Computer Engineering

In the DPP model data processing is performed in unitsof job packages No explicit synchronization process is re-quired between job packages Synchronization is implicit inthe algorithm preprocessing Each job package is completedby a fixed algorithm operation unit and there is no datainteraction between the algorithm operation units in the jobpackage processing

41 Encapsulation ampe encapsulation is completed by themaster node ampe format of the encapsulated job package isas follows

P IDcryptokeyIVmodeNoflagl

ID is the service number set for multiuser and is used todistinguish different data streams Crypto represents thespecific demand for cryptographic algorithms and encryptiondecryption operations key is the KEY IV is the initial vectorand mode is the working mode which can be used todistinguish the dependency property of the job packageNo is the job package serial number which is used toreassemble the data stream after the algorithm operationFlag is the tail package identifier which indicates whetherthe service data flow is end l is the length of the job packageand is consistent with the size of the unit block of thecryptographic algorithm

ampe flow data received by the system is described asTask Pij where i corresponds to the service number and jcorresponds to the sequence number of the service We referto job packages in parallel mode as independent jobpackages and job packages in serial mode as dependent jobpackages

42 Dual Channel and Parallel Scheduling According to thethread splitting under the same algorithm requirements thedifference of package processing of different working modesis embodied in the preprocessing module and the resultoutput module ampe operation module does not consider thecorrelation among the job packages and only processes theinput job packages For this reason it is necessary to dis-tinguish the input data of the preprocessing module and theresult output module We adopt dual channels to receive job

packages and classify independent job packages and de-pendent job packages

421 Dual Receiving Channel of the Preprocessing ModuleChannel 1 is used for the transfer of independent jobpackages Considering that the first job package of a datastream in the serial mode is not associated with job packagesof other data streams the job package transmitted in channel1 satisfies the condition

(mode ECB|CTR) or (modeCBC|CFB|OFB andNo1)

Channel 2 is used for the transfer of dependent jobpackages ampe job package that is transmitted in channel 2satisfies the condition

(modeCBC|CFB|OFB) and No 1

Suppose that the working mode of four services thatrequest cry_IPab operation is as follows service 1mode service 2modeCBC service 3mode service 4mode ECB the job packages on channel 1 and channel 2after parallel scheduling are shown in Figure 6

ampe selection of channel is determined by the controlsignal Si and the default selection is channel 1 that is Si 0cnt is used to record the execution time of the dependent jobpackage in the algorithm operation unit When module cntsenses that the preprocessing module inputs a dependent jobpackage the counting starts It is assumed that the algorithmoperation unit needs m clock cycles to complete the oper-ation of the dependent job package When cntm thecounter clears cnt 0 sets Si 1 to select channel 2 andinputs the next dependency job package In other casesSi 0 and channel 1 is selected ampe state flow diagram isshown in Figure 7

422 Dual Receiving Channel of the Result Output ModuleChannel 3 is used for the transfer of independent jobpackages ampe job package that is transmitted in channel 3satisfies the condition

(mode ECB|CTR) or (modeCBC|CFB|OFB andflag 1)

Channel 4 is used for the transfer of dependent jobpackages ampe job package that is transmitted in channel 4satisfies the condition

(modeCBC|CFB|OFB) and flag n

Channel 4 provides support for storing intermediatestates in serial mode Since the result of the tail package doesnot need to be used as an intermediate state the tail packagersquosresult in serial mode is also output through channel 3

ampe choice of channel is determined by the control signalSo and channel 3 is the default that is So 0 ampe channelselection control signal is the same as that of the pre-processing module When cntm So is set to 1 and channel4 is selected In other cases So 0 and channel 3 is selectedWhen So 1 the job package transmitted by channel 4 hasthe same ID as the job package received by the preprocessingmodule as shown in Figure 8

PS

PP11 AO11

PP1m1 AO12

PP21 AO21

PP2m2 AO22

PPp1 AOp1

PPpmp AOp2

PE DR

Tpar

L

Tr Tc

Ts

Tf

Tz

L

RO

RO

RO

RO

RO

RO

cry_IP1

cry_IP2

cry_IPp

Figure 5 Dual-channel pipeline parallel data processing model(DPP)

Journal of Electrical and Computer Engineering 5

423 Parallel Scheduling ampe process of parallel schedulingis as follows

Step 1 Determine the algorithm operation unitaccording to cryptoStep 2 Select the input channel of the preprocessingmodule of the algorithm operation unit according tomode and No

ampis scheduling method realizes fast transfer of in-coming data streams and continuous processing of jobpackages ampe use of dual channels reduces the interactionbetween modules and hides the processing time of in-dependent job packages in the processing time of dependentjob packages facilitating the parallel execution of jobpackages

424 Data Processing Steps

Step 1 ampe algorithm application process splits the datato be processed adds attribute information and en-capsulates it as job package

Step 2 ampe algorithm operation unit is determinedaccording to the crypto field in the job package and theinput channel of its preprocessing module is selectedaccording to mode and No ampe job package is sent tothe corresponding input channelStep 3 ampe preprocessing module obtains the inputdata of the algorithm operation module namely dataKEY and IV according to the package field of modeand NoStep 4 ampe algorithm operation module performspipeline processing on the received data and sends theresult to the receiving channel of the result outputmodule according to mode and flagStep 5 ampe result output module outputs the receivedjob package to the result receiving process and de-termines whether to feed the job package back to thepreprocessing module according to mode and flag ofthe job packageStep 6 ampe result receiving process recombines thereceived job package based on ID

43 Parallel Execution Time Assume that P0 is the jobpackage encapsulation and reorganization node and PaPb and Pc are algorithm operation nodes Tpar is thepackage encapsulation time and Tz is the data re-organization time g is the communication interval thatis the minimum time interval during which node P0continuously transmits and receives job packages ampereciprocal of g corresponds to the communication band-width L is the maximum communication delay which isthe time taken to transmit a job package from node P0 tothe scheduling node Ts indicates the parallel schedulingtime of job packages Tr indicates the job package pre-processing time Tc indicates the job package operationtime and Tf indicates the job package output time g

represents the calculated load which is the set of jobpackages m is the algorithm operation module pipelinedepth ampe message delivery process of the job package onDPP is shown in Figure 9

ampe continuous sending and receiving of messages needsto meet the conditions

cnt = 0Si = 0

S0

cnt = cnt + 1Si = 0

cnt = 0Si = 1

S1

S2

Mode = CBC|CFB|OFB

Mode = CBC|CFB|OFBcnt = mcnt ne m

Figure 7 Dual-channel selection control of preprocessing module

P22PP AO

Channel 3

C11

C31C32C41

C21RO

C21

P22middotID

01

So

cnt

Channel 4

C21 C41 C32 C31

Figure 8 Dual-channel control of output module

PS

cyp_IPcd

hellip

hellip

hellip

cntMode

Channel 1

Channel 2

Channel 1

Channel 2

P11P31P32P41P21

P11P31P32P41P21

P22P12

P22P12

cyp_IPab

Si

Si

cntMode

01

01

Figure 6 Parallel scheduling and dual receiving channel of preprocessing module

6 Journal of Electrical and Computer Engineering

L + Ts + Tr +Tc

mleg + L + Ts + Tr

ggeTc

m

(2)

Conclusion 1 ampe system communication bandwidth can beimproved by two ways increasing the operation speed of thealgorithm operation module and increasing the pipelinedepth of it

If g job packages come from α data streams consider twoscenarios

(1) Each data stream adopts a parallel mode that is jobpackages are all independent ampe load processingtime is as follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2L

(3)

(2) Each data stream adopts a parallel mode so that jobpackages of the same service data stream are mutuallydependent and job packages of different service datastreams are mutually independent Assume that themaximum number of service job packages iswprime in theextreme case the first job packet of this data flowappears after other service data flows and then theoperation time of other service data flows is hiddenduring that of the longest service data flow ampe loadprocessing time is as follows

T Tpar + wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(4)

For the data stream mixed in the serialparallel modedue to the pipeline design of the algorithm operationmodule in the process of dependent job packages the in-dependent job packages can be executed in parallel so theexecution time of independent job packages is hidden in theexecution time of the dependent job package amperefore theexecution time Tof the multitask mixed mode data stream isas follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2LleTleTpar

+ wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(5)

Conclusion 2 ampe execution time of mixed cross-datastreams is limited to the execution time of the datastreams with the most job packages On the premise ofconstant pipeline depth improving the processing per-formance of each module in the pipeline is the key toimprove the processing method

5 Implementation and Testing

51 Hardware Implementation We prototyped the modelto verify its validityampe architecture is shown in Figure 10ampe cipher server management system completes thereception of multiuser cryptographic service data streamsthe encapsulation of job packages and the data reorga-nization service of the operation results ampe crypto-graphic algorithm operation is performed by a cryptomachine as a coprocessor ampe crypto machine is designedusing Xilinx XC7K325t FPGA which includes parallelscheduling module SM3 and SM4 cryptographic algo-rithm cores

ampe hardware implementation block diagram of thecipher machine is shown in Figure 11 ampe cipher machineadopts the PCIe interface and receives the job package splitby the algorithm application process of the cipher servermanagement system in the way of DMA and stores them inDOWN_FIFO in the downlink data storage area

Parallel schedulingmodule PSCHEDULE Determine thealgorithm core according to the crypto field of the jobpackage determine the receiving FIFO according to themode and No fields and realize the transfer of the jobpackage FIFO1 corresponds to channel 1 in the model andFIFO2 corresponds to channel 2 in the model

Preprocessing module IP_CTRL Acquire algorithm coreinput data including IV KEY and the result of the preorderdependent job package and calculate the input data tocryptographic cores

Operation modules SM4 and SM3 Cryptographic coresperform algorithm operations on input data in pipeliningway and send the result of the operation to uFIFO or RAMuFIFO corresponds to channel 3 in the model and RAMcorresponds to the channel 4

Cipher machine

Client

ClientCipher service

management system

Cloud data center

Figure 10 Multiuser cryptographic service

P0Tpar g g g

L

L

L

Tr

Tr

Tr

Tc

Tc

Tc Tf

Tf

TfPa

Pb

Pc

Tz

Ts

Ts

Ts

Tcm

Figure 9 Message delivery on the DPP model

Journal of Electrical and Computer Engineering 7

Result output module UP_CTRL If IP_CTRL gives theID number the data of the same ID number are extractedfrom RAM and the output result resultprime is calculated and fedback to IP_CTRL and output to the output FIFOo at the sametime If IP_CTRL has no ID number output the data inuFIFO are extracted and resultrsquo is calculated and sent toFIFOo ampe data in FIFOo are fed back to the result receptionprocess of the cipher server management system through theinterface module in DMA mode

52 Test ampe test environment is as follows ampe mainfrequency of the heterogeneous multicore parallel pro-cessing system implemented by Xilinx XC7K325t FPGA is160MHz and the interface with the upper application isPCIe 20lowast 8

Test 1 SM4 CBC encryption for a 4000MB file ampe end ofthe test operation takes 114390935 s so the data streamprocessing rate is 4000lowast 8114390935 279742446Mbps

Test 2 Test 21 to Test 24 use eight 400MB files and jobpackages of the eight files enter the cipher machine in aninterleaving manner ampese files use different IVs andKEYs in different working modes ampe end time of each fileprocessing is shown in Table 2 For the data set themaximum end time is the total time it takes ampe data flowprocessing rate is derived from the following formula

DMA_ctrl

FIFO1

IP_CTRL UP_CTRL

PCI_E interface

PD_TLP_CTRL PU_TLP_CTRL

IP FIFO

DOWN_FIFO

PSHEDULE

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM4SM4

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM3

FIFOoFIFOoFIFOo

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

Result

Result

Figure 11 Hardware architecture of partem

Table 2 Cipher operation time under cross files

Test 21

File File 1 File 2 File 3 File 4Mode SM4 CBC

Time (s) 135175 134912 134994 134831File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 134788 134872 134950 135014Processing rate 3200lowast 8135175 s 189Gbps

Test 22

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 125029 124946 124988 124784File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 124825 124906 124864 125052Processing rate 3200lowast 8125052 s 205Gbps

Test 23

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 43340 43166 40397 41721File File 5 File 6 File 7 File 8Mode SM4 ECB

Time (s) 43433 49895 41908 43623Processing rate 3200lowast 849895 s 513Gbps

Test 24

File File 1 File 2 File 3 File 4Mode SM3 SM4 ECB

Time (s) 128909 124151 130394 131713File File 5 File 6 File 7 File 8Mode SM4 CBC SM4 OFB

Time (s) 127424 126119 131909 124842Processing rate 3200lowast 8131909194Gbps

8 Journal of Electrical and Computer Engineering

rate (bps) size of data flow (bit)the total time (s)

(6)

Analysis Because Test 1 has only one file in the CBCmode the job packages are interrelated and all are executedserial Although packages of each file are interrelated thefiles of Test 21 are independent of each other so the dataflow processing rate of Test 21 is higher than that of Test 1In Test 22 4 files are in the ECB work mode and theindependent job packages operation time can be hiddenwithin the operation time of the dependency packages sothe data flow processing rate of Test 22 is higher than thatof the Test 21 Similarly the data processing rate of Test 23is the highest Test 24 has 2 files with independent jobpackages and 6 files with dependent packages but they areallocated in two algorithm units so the operation rate isclose to Test 21

Test 3 Processing rate compare of dual channel and singlechannel ampe total amount of job packages is 10000and they are randomly assigned to j files If Ni representsthe number of job packages in filei 1113936

ji1Ni 10000 If j

is 10 20 30 40 the ECB or CBC encryption mode isadopted Change the number of files in CBC encryptionmode and compare the completion time of data flowin single-channel architecture and dual-channel archi-tecture ampe average value of data flow processing time isrun several times and the comparison result is shown inFigure 12 Single 0means that all files use ECBmode andthe system adopts single-channel architecture Dual 50indicates that 50 of the files in the data stream useCBC mode and the system is dual-channel architectureand so on

As can be seen from Figure 12 when the data flow is anindependent data flow the algorithm operation unit adoptsthe pipeline design so the processing rate under the dualchannel is close to the processing rate under the singlechannel with the increase of the associated job packages inthe data flow the advantage of the data processing rate ofdual channel is gradually displayed and with the increase ofthe number of files in the data stream the advantage of thedata processing rate is more obvious

6 Conclusion

Based on the characteristics of cryptographic operationsthis paper proposes a dual-channel pipeline parallel dataprocessing model DPP to implement cryptographic op-erations for cross-data streams with different servicerequirements in a multiuser environment ampe modelensures synchronization between dependent job packagesand parallel processing between independent job packagesand data streams It hides the processing of independentjob packages in the process of dependent job packages toimprove the processing speed of cross-data streamsPrototype experiments prove that the system under thismodel can realize correct and rapid processing of multi-service and personalized cross-data streams Increasingthe depth of the cryptographic algorithm pipeline and

improving the processing performance of each module inthe pipeline can improve the overall performance of thesystem

Data Availability

ampe data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

ampe authors declare that they have no conflicts of interest

Acknowledgments

ampis work was supported by the National Key RampD Programof China (no 2017YFB0802705) and the National NaturalScience Foundation of China (no 61672515)

References

[1] Y Song and Z Li ldquoApplying array contraction to a sequenceof DOALL loopsrdquo in Proceedings of the International Con-ference on Parallel Processing (ICPPrsquo04) vol 1 pp 46ndash53Montreal Canada August 2004

[2] G Elsesser V Ngo S Bhattacharya and W T Tsai ldquoLoadbalancing of DOALL loops in the Perfect Clubrdquo in Proceedingsof the 1993 Proceedings Seventh International Parallel ProcessingSymposium pp 129ndash133 Newport CA USA April 1993

10 20 30 40File number

0

1

2

3

4

5

6

Proc

essin

g ra

te (G

bps)

Single 0Single 50Single 100

Dual 0Dual 50Dual 100

Multiple files sm4 ECB and CBC cross encryption

Figure 12 Processing rate comparison of dual channel and singlechannel

Journal of Electrical and Computer Engineering 9

[3] D K C Ding-Kai Chen and P C Y Pen-Chung YewldquoStatement re-ordering for DOACROSS loopsrdquo in Pro-ceedings of the 1994 Internatonal Conference on ParallelProcessing vol 2 pp 24ndash28 Raleigh NC USA August 1994

[4] G Ottoni R Rangan A Stoler and D I August ldquoAutomaticthread extraction with decoupled software pipeliningrdquo inProceedings of the 38th Annual IEEEACM InternationalSymposium on Microarchitecture (MICROrsquo05) pp 105ndash118Barcelona Spain November 2005

[5] V Krishnan and J Torrellas ldquoA chip-multiprocessor archi-tecture with speculative multithreadingrdquo IEEE Transactionson Computers vol 48 no 9 pp 866ndash880 1999

[6] A S Rajam L E Campostrini J M M Caamantildeo andP Clauss ldquoSpeculative runtime parallelization of loop neststowards greater scope and efficiencyrdquo in Proceedings of the2015 IEEE International Parallel and Distributed ProcessingSymposium Workshop (IPDPSW) pp 245ndash254 HyderabadIndia May 2015

[7] S Aldea A Estebanez D R Llanos and A Gonzalez-Escribano ldquoAn OpenMP extension that supports thread-levelspeculationrdquo IEEE Transactions on Parallel and DistributedSystems vol 27 no 1 pp 78ndash91 2016

[8] J Salamanca J N Amaral and G Araujo ldquoEvaluating andimproving thread-level speculation in hardware transactionalmemoriesrdquo in Proceedings of the 2016 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)pp 586ndash595 Chicago IL USA May 2016

[9] Z Ying and B Qinghai ldquoampe scheme for improving the ef-ficiency of block cipher algorithmrdquo in Proceedings of the 2014IEEE Workshop on Advanced Research and Technology inIndustry Applications (WARTIA) pp 824ndash826 Ottawa ONCanada September 2014

[10] P Kitsos and A N Skodras ldquoAn FPGA implementationand performance evaluation of the seed block cipherrdquo inProceedings of the 2011 17th International Conference onDigital Signal Processing (DSP) pp 1ndash5 Corfu Greece July2011

[11] L Bossuet N Datta C Mancillas-Lopez and M NandildquoELmD a pipelineable authenticated encryption and itshardware implementationrdquo IEEE Transactions on Computersvol 65 no 11 pp 3318ndash3331 2016

[12] P U Deshpande and S A Bhosale ldquoAES encryption enginesof many core processor arrays on FPGA by using parallelpipeline and sequential techniquerdquo in Proceedings of the 2015International Conference on Energy Systems and Applicationspp 75ndash80 Pune India October 2015

[13] T Kryjak and M Gorgon ldquoPipeline implementation of the128-bit block cipher CLEFIA in FPGArdquo in Proceedings of the2009 International Conference on Field Programmable Logicand Applications pp 373ndash378 Prague Czech RepublicAugust 2009

[14] S Lin S He X Guo and D Guo ldquoAn efficient algorithm forcomputing modular division over GF(2m) in elliptic curvecryptographyrdquo in Proceedings of the 2017 11th IEEE In-ternational Conference on Anti-counterfeiting Security andIdentification (ASID) pp 179ndash182 Xiamen China 2017

[15] K M John and S Sabi ldquoA novel high performance ECCprocessor architecture with two staged multiplierrdquo in Pro-ceedings of the 2017 IEEE International Conference on Elec-trical Instrumentation and Communication Engineering(ICEICE) pp 1ndash5 Karur India April 2017

[16] M S Albahri M Benaissa and Z U A Khan ldquoParallelimplementation of ECC point multiplication on a homoge-neous multi-core microcontrollerrdquo in Proceedings of the 2016

12th International Conference on Mobile Ad-Hoc and SensorNetworks (MSN) pp 386ndash389 Hefei China December 2016

[17] W K Lee B M Goi R C W Phan and G S Poh ldquoHighspeed implementation of symmetric block cipher on GPUrdquo inProceedings of the 2014 International Symposium on IntelligentSignal Processing and Communication Systems (ISPACS)pp 102ndash107 Kuching Malaysia December 2014

[18] J Ma X Chen R Xu and J Shi ldquoImplementation andevaluation of different parallel designs of AES using CUDArdquoin Proceedings of the 2017 IEEE Second International Con-ference on Data Science in Cyberspace (DSC) pp 606ndash614Shenzhen China June 2017

[19] W Dai Y Doroz and B Sunar ldquoAccelerating NTRU basedhomomorphic encryption using GPUsrdquo in Proceedings of the2014 IEEE High Performance Extreme Computing Conference(HPEC) pp 1ndash6 Waltham MA USA September 2014

[20] G Barlas A Hassan and Y A Jundi ldquoAn analytical approachto the design of parallel block cipher encryptiondecryptiona CPUGPU case studyrdquo in Proceedings of the 2011 19thInternational Euromicro Conference on Parallel Distributedand Network-Based Processing pp 247ndash251 Ayia NapaCyprus February 2011

[21] H Kondo S Otani M Nakajima et al ldquoHeterogeneousmulticore SoC with SiP for secure multimedia applicationsrdquoIEEE Journal of Solid-State Circuits vol 44 no 8 pp 2251ndash2259 2009

[22] S Wang J Han Y Li Y Bo and X Zeng ldquoA 920 MHz quad-core cryptography processor accelerating parallel task pro-cessing of public-key algorithmsrdquo in Proceedings of the IEEE2013 Custom Integrated Circuits Conference pp 1ndash4 San JoseCA USA September 2013

[23] M Alfadel E S M El-Alfy and K M A Kamal ldquoEvaluatingtime and throughput at different modes of operation in AESalgorithmrdquo in Proceedings of the 2017 8th InternationalConference on Information Technology (ICIT) pp 795ndash801Amman Jordan May 2017

[24] A Abidi S Tawbi C Guyeux B Bouallegue andM Machhout ldquoSummary of topological study of chaotic cbcmode of operationrdquo in Proceedings of the 2016 IEEE IntlConference on Computational Science and Engineering (CSE)and IEEE Intl Conference on Embedded and UbiquitousComputing (EUC) and 15th Intl Symposium on DistributedComputing and Applications for Business Engineering(DCABES) pp 436ndash443 Paris France August 2016

[25] S Najjar-Ghabel S Yousefi andM Z Lighvan ldquoA high speedimplementation counter mode cryptography using hardwareparallelismrdquo in Proceedings of the 2016 Eighth InternationalConference on Information and Knowledge Technology (IKT)pp 55ndash60 Hamedan Iran September 2016

[26] H M Heys ldquoAnalysis of the statistical cipher feedback modeof block ciphersrdquo IEEE Transactions on Computers vol 52no 1 pp 77ndash92 2003

[27] M A Alomari K Samsudin and A R Ramli ldquoA study onencryption algorithms and modes for disk encryptionrdquo inProceedings of the 2009 International Conference on SignalProcessing Systems pp 793ndash797 Singapore 2009

[28] L Wang H M Cui L Chen and X B Feng ldquoResearch ontask parallel programming modelrdquo Journal of Softwarevol 24 no 1 pp 77ndash90 2013

[29] K Huang G C Fox and J J DongarraDistributed and CloudComputing From Parallel Processing to the Internet of JingsMorgan Kaufmann Burlington MA USA 2011

10 Journal of Electrical and Computer Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

ampis paper proposes a dual-channel pipeline parallel dataprocessing model (DPP) which includes parallel schedulingalgorithm preprocessing algorithmic operation and resultacquisition ampe DPP is designed and implemented ina heterogeneous multicore parallel architecture ampis parallelsystem performs a variety of cryptographic algorithms andsupports linear expansion of the algorithm operation unitEach algorithm operation unit adopts dual channels to receivedata and realizes parallel operations among multiple tasks

3 Thread Split

As mentioned above different cryptographic services havedifferent computing elements which is expressed as

Service IDcryptokeyIVmode

ID is the service number that is set for multiple usersampe cryptographic operations are usually carried out in

blocks and userrsquos data can be divided into several blocksAccording to different cryptographic algorithms the size ofthe block also varies Here UB is used to represent the size ofa single block ampe research in this paper is based on thefollowing assumptions

(1) Parallel processing with UB granularity(2) Each cryptographic algorithm core completes the

operation of one block ampe algorithm core adoptsthe full pipeline design

ampe fast implementation of cryptographic algorithm coreis not discussed here Under the premise of meeting theinterface conditions any kind of full pipeline imple-mentation scheme of the cryptographic algorithm can beapplied to the algorithm core in this model ampis paperfocuses on the parallelism between different blocks ofcryptographic service

31 Symmetric Cryptographic Algorithm ampe parallelismbetween blocks of symmetric cryptography algorithms mustconsider the working mode adopted by the cryptographicalgorithm ampe commonly used working modes are ECBCBC CFB OFB CTR [23ndash27] and so on Assume that Ci

denotes the ith ciphertext block Pi denotes the ith plaintextblock Enc denotes the encryption algorithm Dec denotesthe decryption algorithm key denotes the KEY IV denotesthe initial vector n is the number of plaintextciphertextblocks Ti is the counter value which increases by 1 with theincrement of the block and u is the length of the last block

(1) ECB working mode

Encryption Ci Enc(key Pi) 0le ilt n

Decryption Pi Dec(key Ci) 0le ilt n

(2) CTR working mode

EncryptionCi Pi XOR Enc(key Ti) i 0 1 2 nminus 2

Cnminus1 Pnminus1 XOR MSBu(Enc(key Tnminus1))1113896

DecryptionCi Pi XOR Enc(key Ti) i 0 1 2 nminus 2

Cnminus1 Pnminus1 XOR MSBu(Enc(key Tnminus1))1113896

(3) CBC working mode

EncryptionC0 Enc(key XOR(IV P0))

Ci Enc(key XOR(Ciminus1 Pi)) 0lt ilt n1113896

DecryptionP0 Dec(key C0)XORIV

Pi Dec(key Ci) XOR Ciminus1 0lt ilt n1113896

(4) CFB working mode

EncryptionC0 Enc(key IV)XOR P0

Ci Enc(key Ciminus1) XOR Pi 0lt ilt n1113896

DecryptionP0 Enc(key IV) XOR C0

Pi Enc(key Ciminus1) XOR Ci 0lt ilt n1113896

(5) OFB working mode

EncryptionC0 Enc(key IV) XOR P0

Ci Enc(key Ciminus1) XOR Pi 0lt ilt n1113896

DecryptionS0 Enc(key IV)

Si Enc(key Siminus1) 0lt ilt n

Pi SiXORCi 0le ilt n

⎧⎪⎨

⎪⎩

Because there is no dependency between blocks inthe ECB and CTR modes blocks can be processed inparallel So ECB and CTR are parallel modes and are verysuitable for parallel processing CBC CFB and OFBmodes have certain dependencies among blocks so CBCCFB and OFB are serial modes When using multicoreparallel operations in multiuser scenarios attention mustbe paid to coordination and synchronization amongblocks

By analyzing each working mode we can divide theencryption and decryption operation into 3 threads ampread1 completes the acquisition of the algorithm core input datathread 2 completes the encryptiondecryption operation ofa single block and thread 3 completes the output of theciphertextplaintext data Taking CBC encryption mode andOFB decryption mode as examples the thread splitting isshown in Table 1

In this way of splitting the function of thread 2 is rel-atively simple which is the cryptographic algorithm oper-ation of oneUB Since encryption and decryption operationsusually require multiple rounds of confusion and iterativeoperations the operation time of thread 2 is longer than thatof thread 1 and thread 3 Taking the SM4 algorithm as anexample each block needs 32 rounds of function operationsIn the full pipeline approach the algorithm architecture isshown in Figure 3 where F0 to F31 represent 32 rounds of

Journal of Electrical and Computer Engineering 3

function operations Different blocks without dependenciescan be executed in parallel within it

For example service 1 ID1SM4key1IV1CBC andservice 2 ID2SM4key2noneECB correspond to the datastreams data 1 and data 2 if data 1 can be split into m UBblocks and data 2 can be split into n UB blocks ampat is

data 1 p11 p12 p1m1113864 1113865

data 2 p21 p22 p2n1113864 1113865(1)

when p1i is operated in module Fk(0le kle 31) p1(i+1) cannotenter the thread 1 module but p2j can enter the thread 1 andFkprime(kprime lt k) modules So we can insert the block data of theparallel mode between blocks of the serial workingmode andhide the execution time of the parallel mode block datainside that of the serial working mode block dataampe thread2 pipeline depth determines the number of independentblocks that can be inserted between dependent blocks

32 Hash Algorithm ampe Hash function needs to obtain thehash value of the message through multiple iterations of theblock data so the Hash operation has dependencies betweenthe blocks By analyzing Hash algorithm can also be dividedinto 3 threads message expansion (ME) iterative com-pression and hash value output ampe message expansioncompletes the calculation of parameters required for theiterative compression function and the hash value output isused for the output of the final result Taking SM3 as anexample the algorithm architecture is shown in Figure 4ampe parallelism of Hash operations can only occur betweenblocks of different data streams

4 DPP Parallel Data Processing Model

ampe parallel computing models based on PRAM BSP andLogP [28 29] are all computing oriented lacking perti-nence to data processing and are not suitable for practicalapplication of massive data It can be seen from the abovedescription that the cryptographic operation data in themultiuser environment have the following characteristics

(1) Different user data streams intersect each other (2)Independent blocks and dependent blocks exist inmutual intersections amperefore the parallel processingmodel must have the following two mechanisms (1)distinguish between different user data streams and (2)distinguish between data stream dependencies For thispurpose we encapsulate the data stream and add certainattribute information to the block data so as to expressits properties thereby facilitating subsequent parallelprocessing

ampe cryptographic operations of streaming data are data-intensive applications Its typical feature is that the datavalue is time-sensitive so the system requires low latencyWhen a large amount of multiuser data reaches the system ina continuous fast time-varying and cross way it must bequickly sent to each cryptographic algorithm operationnode Otherwise data loss may occur due to limited storagespace of the system receiver Referring to the MapReducedata stream processing strategy specialized module is usedto distribute data streams

ampe three threads of cryptographic operations of dif-ferent working modes are implemented by three modulespreprocessing module operation module and result outputmodule ampe data reorganization module completes theintegration of the data stream packages of each serviceFigure 5 shows the dual-channel pipeline parallel dataprocessing model proposed in this paper ampe data streamprocessing is divided into six stages job package encapsu-lation (PE) parallel scheduling (PS) job package pre-processing (PP) algorithm operation (AO) result output(RO) and data reorganization (DR) ampe job package en-capsulation and data reorganization are completed by thenode P0 the parallel scheduling is completed by the node P0and the job package preprocessing algorithm operation andresult output are completed by the algorithm operation unitcry_IP Each algorithm corresponds to a cluster of algorithmoperation units For example the operation module cry_IPi1is an encryption operation unit of algorithm si and cry_IPi2 isa decryption operation unit of algorithm si ampe dual channelis embodied in the two channels of input and output data ofalgorithm operation unit

Table 1 ampread split

CBC encryption mode OFB decryption mode

ampread 1 result 1 XOR(X Pi) X IV i 0

Ciminus1 0lt ilt n1113896 ampread 1 result 1

IV i 0

result 2iminus1 0lt ilt n1113896

ampread 2 result 2 Enc(Key result 1) ampread 2 result 2i Enc(Key result 1i)

ampread 3 Ci result 2 ampread 3 Pi result 2 XOR Ci 0le ilt n

F0 F2 F31Pi

Cindash1

IV

0

Ci

i

Mode0

1

01

SM4

Key

Thread 1Thread 2 Thread 3

Figure 3 SM4 3-thread algorithm operation

F0 F2 F63Vindash1

IV

Bi

Vi

i0

1 SM3

Thread 1 Thread 2 Thread 3ME

W

Figure 4 SM3 3-thread algorithm operation

4 Journal of Electrical and Computer Engineering

In the DPP model data processing is performed in unitsof job packages No explicit synchronization process is re-quired between job packages Synchronization is implicit inthe algorithm preprocessing Each job package is completedby a fixed algorithm operation unit and there is no datainteraction between the algorithm operation units in the jobpackage processing

41 Encapsulation ampe encapsulation is completed by themaster node ampe format of the encapsulated job package isas follows

P IDcryptokeyIVmodeNoflagl

ID is the service number set for multiuser and is used todistinguish different data streams Crypto represents thespecific demand for cryptographic algorithms and encryptiondecryption operations key is the KEY IV is the initial vectorand mode is the working mode which can be used todistinguish the dependency property of the job packageNo is the job package serial number which is used toreassemble the data stream after the algorithm operationFlag is the tail package identifier which indicates whetherthe service data flow is end l is the length of the job packageand is consistent with the size of the unit block of thecryptographic algorithm

ampe flow data received by the system is described asTask Pij where i corresponds to the service number and jcorresponds to the sequence number of the service We referto job packages in parallel mode as independent jobpackages and job packages in serial mode as dependent jobpackages

42 Dual Channel and Parallel Scheduling According to thethread splitting under the same algorithm requirements thedifference of package processing of different working modesis embodied in the preprocessing module and the resultoutput module ampe operation module does not consider thecorrelation among the job packages and only processes theinput job packages For this reason it is necessary to dis-tinguish the input data of the preprocessing module and theresult output module We adopt dual channels to receive job

packages and classify independent job packages and de-pendent job packages

421 Dual Receiving Channel of the Preprocessing ModuleChannel 1 is used for the transfer of independent jobpackages Considering that the first job package of a datastream in the serial mode is not associated with job packagesof other data streams the job package transmitted in channel1 satisfies the condition

(mode ECB|CTR) or (modeCBC|CFB|OFB andNo1)

Channel 2 is used for the transfer of dependent jobpackages ampe job package that is transmitted in channel 2satisfies the condition

(modeCBC|CFB|OFB) and No 1

Suppose that the working mode of four services thatrequest cry_IPab operation is as follows service 1mode service 2modeCBC service 3mode service 4mode ECB the job packages on channel 1 and channel 2after parallel scheduling are shown in Figure 6

ampe selection of channel is determined by the controlsignal Si and the default selection is channel 1 that is Si 0cnt is used to record the execution time of the dependent jobpackage in the algorithm operation unit When module cntsenses that the preprocessing module inputs a dependent jobpackage the counting starts It is assumed that the algorithmoperation unit needs m clock cycles to complete the oper-ation of the dependent job package When cntm thecounter clears cnt 0 sets Si 1 to select channel 2 andinputs the next dependency job package In other casesSi 0 and channel 1 is selected ampe state flow diagram isshown in Figure 7

422 Dual Receiving Channel of the Result Output ModuleChannel 3 is used for the transfer of independent jobpackages ampe job package that is transmitted in channel 3satisfies the condition

(mode ECB|CTR) or (modeCBC|CFB|OFB andflag 1)

Channel 4 is used for the transfer of dependent jobpackages ampe job package that is transmitted in channel 4satisfies the condition

(modeCBC|CFB|OFB) and flag n

Channel 4 provides support for storing intermediatestates in serial mode Since the result of the tail package doesnot need to be used as an intermediate state the tail packagersquosresult in serial mode is also output through channel 3

ampe choice of channel is determined by the control signalSo and channel 3 is the default that is So 0 ampe channelselection control signal is the same as that of the pre-processing module When cntm So is set to 1 and channel4 is selected In other cases So 0 and channel 3 is selectedWhen So 1 the job package transmitted by channel 4 hasthe same ID as the job package received by the preprocessingmodule as shown in Figure 8

PS

PP11 AO11

PP1m1 AO12

PP21 AO21

PP2m2 AO22

PPp1 AOp1

PPpmp AOp2

PE DR

Tpar

L

Tr Tc

Ts

Tf

Tz

L

RO

RO

RO

RO

RO

RO

cry_IP1

cry_IP2

cry_IPp

Figure 5 Dual-channel pipeline parallel data processing model(DPP)

Journal of Electrical and Computer Engineering 5

423 Parallel Scheduling ampe process of parallel schedulingis as follows

Step 1 Determine the algorithm operation unitaccording to cryptoStep 2 Select the input channel of the preprocessingmodule of the algorithm operation unit according tomode and No

ampis scheduling method realizes fast transfer of in-coming data streams and continuous processing of jobpackages ampe use of dual channels reduces the interactionbetween modules and hides the processing time of in-dependent job packages in the processing time of dependentjob packages facilitating the parallel execution of jobpackages

424 Data Processing Steps

Step 1 ampe algorithm application process splits the datato be processed adds attribute information and en-capsulates it as job package

Step 2 ampe algorithm operation unit is determinedaccording to the crypto field in the job package and theinput channel of its preprocessing module is selectedaccording to mode and No ampe job package is sent tothe corresponding input channelStep 3 ampe preprocessing module obtains the inputdata of the algorithm operation module namely dataKEY and IV according to the package field of modeand NoStep 4 ampe algorithm operation module performspipeline processing on the received data and sends theresult to the receiving channel of the result outputmodule according to mode and flagStep 5 ampe result output module outputs the receivedjob package to the result receiving process and de-termines whether to feed the job package back to thepreprocessing module according to mode and flag ofthe job packageStep 6 ampe result receiving process recombines thereceived job package based on ID

43 Parallel Execution Time Assume that P0 is the jobpackage encapsulation and reorganization node and PaPb and Pc are algorithm operation nodes Tpar is thepackage encapsulation time and Tz is the data re-organization time g is the communication interval thatis the minimum time interval during which node P0continuously transmits and receives job packages ampereciprocal of g corresponds to the communication band-width L is the maximum communication delay which isthe time taken to transmit a job package from node P0 tothe scheduling node Ts indicates the parallel schedulingtime of job packages Tr indicates the job package pre-processing time Tc indicates the job package operationtime and Tf indicates the job package output time g

represents the calculated load which is the set of jobpackages m is the algorithm operation module pipelinedepth ampe message delivery process of the job package onDPP is shown in Figure 9

ampe continuous sending and receiving of messages needsto meet the conditions

cnt = 0Si = 0

S0

cnt = cnt + 1Si = 0

cnt = 0Si = 1

S1

S2

Mode = CBC|CFB|OFB

Mode = CBC|CFB|OFBcnt = mcnt ne m

Figure 7 Dual-channel selection control of preprocessing module

P22PP AO

Channel 3

C11

C31C32C41

C21RO

C21

P22middotID

01

So

cnt

Channel 4

C21 C41 C32 C31

Figure 8 Dual-channel control of output module

PS

cyp_IPcd

hellip

hellip

hellip

cntMode

Channel 1

Channel 2

Channel 1

Channel 2

P11P31P32P41P21

P11P31P32P41P21

P22P12

P22P12

cyp_IPab

Si

Si

cntMode

01

01

Figure 6 Parallel scheduling and dual receiving channel of preprocessing module

6 Journal of Electrical and Computer Engineering

L + Ts + Tr +Tc

mleg + L + Ts + Tr

ggeTc

m

(2)

Conclusion 1 ampe system communication bandwidth can beimproved by two ways increasing the operation speed of thealgorithm operation module and increasing the pipelinedepth of it

If g job packages come from α data streams consider twoscenarios

(1) Each data stream adopts a parallel mode that is jobpackages are all independent ampe load processingtime is as follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2L

(3)

(2) Each data stream adopts a parallel mode so that jobpackages of the same service data stream are mutuallydependent and job packages of different service datastreams are mutually independent Assume that themaximum number of service job packages iswprime in theextreme case the first job packet of this data flowappears after other service data flows and then theoperation time of other service data flows is hiddenduring that of the longest service data flow ampe loadprocessing time is as follows

T Tpar + wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(4)

For the data stream mixed in the serialparallel modedue to the pipeline design of the algorithm operationmodule in the process of dependent job packages the in-dependent job packages can be executed in parallel so theexecution time of independent job packages is hidden in theexecution time of the dependent job package amperefore theexecution time Tof the multitask mixed mode data stream isas follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2LleTleTpar

+ wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(5)

Conclusion 2 ampe execution time of mixed cross-datastreams is limited to the execution time of the datastreams with the most job packages On the premise ofconstant pipeline depth improving the processing per-formance of each module in the pipeline is the key toimprove the processing method

5 Implementation and Testing

51 Hardware Implementation We prototyped the modelto verify its validityampe architecture is shown in Figure 10ampe cipher server management system completes thereception of multiuser cryptographic service data streamsthe encapsulation of job packages and the data reorga-nization service of the operation results ampe crypto-graphic algorithm operation is performed by a cryptomachine as a coprocessor ampe crypto machine is designedusing Xilinx XC7K325t FPGA which includes parallelscheduling module SM3 and SM4 cryptographic algo-rithm cores

ampe hardware implementation block diagram of thecipher machine is shown in Figure 11 ampe cipher machineadopts the PCIe interface and receives the job package splitby the algorithm application process of the cipher servermanagement system in the way of DMA and stores them inDOWN_FIFO in the downlink data storage area

Parallel schedulingmodule PSCHEDULE Determine thealgorithm core according to the crypto field of the jobpackage determine the receiving FIFO according to themode and No fields and realize the transfer of the jobpackage FIFO1 corresponds to channel 1 in the model andFIFO2 corresponds to channel 2 in the model

Preprocessing module IP_CTRL Acquire algorithm coreinput data including IV KEY and the result of the preorderdependent job package and calculate the input data tocryptographic cores

Operation modules SM4 and SM3 Cryptographic coresperform algorithm operations on input data in pipeliningway and send the result of the operation to uFIFO or RAMuFIFO corresponds to channel 3 in the model and RAMcorresponds to the channel 4

Cipher machine

Client

ClientCipher service

management system

Cloud data center

Figure 10 Multiuser cryptographic service

P0Tpar g g g

L

L

L

Tr

Tr

Tr

Tc

Tc

Tc Tf

Tf

TfPa

Pb

Pc

Tz

Ts

Ts

Ts

Tcm

Figure 9 Message delivery on the DPP model

Journal of Electrical and Computer Engineering 7

Result output module UP_CTRL If IP_CTRL gives theID number the data of the same ID number are extractedfrom RAM and the output result resultprime is calculated and fedback to IP_CTRL and output to the output FIFOo at the sametime If IP_CTRL has no ID number output the data inuFIFO are extracted and resultrsquo is calculated and sent toFIFOo ampe data in FIFOo are fed back to the result receptionprocess of the cipher server management system through theinterface module in DMA mode

52 Test ampe test environment is as follows ampe mainfrequency of the heterogeneous multicore parallel pro-cessing system implemented by Xilinx XC7K325t FPGA is160MHz and the interface with the upper application isPCIe 20lowast 8

Test 1 SM4 CBC encryption for a 4000MB file ampe end ofthe test operation takes 114390935 s so the data streamprocessing rate is 4000lowast 8114390935 279742446Mbps

Test 2 Test 21 to Test 24 use eight 400MB files and jobpackages of the eight files enter the cipher machine in aninterleaving manner ampese files use different IVs andKEYs in different working modes ampe end time of each fileprocessing is shown in Table 2 For the data set themaximum end time is the total time it takes ampe data flowprocessing rate is derived from the following formula

DMA_ctrl

FIFO1

IP_CTRL UP_CTRL

PCI_E interface

PD_TLP_CTRL PU_TLP_CTRL

IP FIFO

DOWN_FIFO

PSHEDULE

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM4SM4

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM3

FIFOoFIFOoFIFOo

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

Result

Result

Figure 11 Hardware architecture of partem

Table 2 Cipher operation time under cross files

Test 21

File File 1 File 2 File 3 File 4Mode SM4 CBC

Time (s) 135175 134912 134994 134831File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 134788 134872 134950 135014Processing rate 3200lowast 8135175 s 189Gbps

Test 22

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 125029 124946 124988 124784File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 124825 124906 124864 125052Processing rate 3200lowast 8125052 s 205Gbps

Test 23

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 43340 43166 40397 41721File File 5 File 6 File 7 File 8Mode SM4 ECB

Time (s) 43433 49895 41908 43623Processing rate 3200lowast 849895 s 513Gbps

Test 24

File File 1 File 2 File 3 File 4Mode SM3 SM4 ECB

Time (s) 128909 124151 130394 131713File File 5 File 6 File 7 File 8Mode SM4 CBC SM4 OFB

Time (s) 127424 126119 131909 124842Processing rate 3200lowast 8131909194Gbps

8 Journal of Electrical and Computer Engineering

rate (bps) size of data flow (bit)the total time (s)

(6)

Analysis Because Test 1 has only one file in the CBCmode the job packages are interrelated and all are executedserial Although packages of each file are interrelated thefiles of Test 21 are independent of each other so the dataflow processing rate of Test 21 is higher than that of Test 1In Test 22 4 files are in the ECB work mode and theindependent job packages operation time can be hiddenwithin the operation time of the dependency packages sothe data flow processing rate of Test 22 is higher than thatof the Test 21 Similarly the data processing rate of Test 23is the highest Test 24 has 2 files with independent jobpackages and 6 files with dependent packages but they areallocated in two algorithm units so the operation rate isclose to Test 21

Test 3 Processing rate compare of dual channel and singlechannel ampe total amount of job packages is 10000and they are randomly assigned to j files If Ni representsthe number of job packages in filei 1113936

ji1Ni 10000 If j

is 10 20 30 40 the ECB or CBC encryption mode isadopted Change the number of files in CBC encryptionmode and compare the completion time of data flowin single-channel architecture and dual-channel archi-tecture ampe average value of data flow processing time isrun several times and the comparison result is shown inFigure 12 Single 0means that all files use ECBmode andthe system adopts single-channel architecture Dual 50indicates that 50 of the files in the data stream useCBC mode and the system is dual-channel architectureand so on

As can be seen from Figure 12 when the data flow is anindependent data flow the algorithm operation unit adoptsthe pipeline design so the processing rate under the dualchannel is close to the processing rate under the singlechannel with the increase of the associated job packages inthe data flow the advantage of the data processing rate ofdual channel is gradually displayed and with the increase ofthe number of files in the data stream the advantage of thedata processing rate is more obvious

6 Conclusion

Based on the characteristics of cryptographic operationsthis paper proposes a dual-channel pipeline parallel dataprocessing model DPP to implement cryptographic op-erations for cross-data streams with different servicerequirements in a multiuser environment ampe modelensures synchronization between dependent job packagesand parallel processing between independent job packagesand data streams It hides the processing of independentjob packages in the process of dependent job packages toimprove the processing speed of cross-data streamsPrototype experiments prove that the system under thismodel can realize correct and rapid processing of multi-service and personalized cross-data streams Increasingthe depth of the cryptographic algorithm pipeline and

improving the processing performance of each module inthe pipeline can improve the overall performance of thesystem

Data Availability

ampe data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

ampe authors declare that they have no conflicts of interest

Acknowledgments

ampis work was supported by the National Key RampD Programof China (no 2017YFB0802705) and the National NaturalScience Foundation of China (no 61672515)

References

[1] Y Song and Z Li ldquoApplying array contraction to a sequenceof DOALL loopsrdquo in Proceedings of the International Con-ference on Parallel Processing (ICPPrsquo04) vol 1 pp 46ndash53Montreal Canada August 2004

[2] G Elsesser V Ngo S Bhattacharya and W T Tsai ldquoLoadbalancing of DOALL loops in the Perfect Clubrdquo in Proceedingsof the 1993 Proceedings Seventh International Parallel ProcessingSymposium pp 129ndash133 Newport CA USA April 1993

10 20 30 40File number

0

1

2

3

4

5

6

Proc

essin

g ra

te (G

bps)

Single 0Single 50Single 100

Dual 0Dual 50Dual 100

Multiple files sm4 ECB and CBC cross encryption

Figure 12 Processing rate comparison of dual channel and singlechannel

Journal of Electrical and Computer Engineering 9

[3] D K C Ding-Kai Chen and P C Y Pen-Chung YewldquoStatement re-ordering for DOACROSS loopsrdquo in Pro-ceedings of the 1994 Internatonal Conference on ParallelProcessing vol 2 pp 24ndash28 Raleigh NC USA August 1994

[4] G Ottoni R Rangan A Stoler and D I August ldquoAutomaticthread extraction with decoupled software pipeliningrdquo inProceedings of the 38th Annual IEEEACM InternationalSymposium on Microarchitecture (MICROrsquo05) pp 105ndash118Barcelona Spain November 2005

[5] V Krishnan and J Torrellas ldquoA chip-multiprocessor archi-tecture with speculative multithreadingrdquo IEEE Transactionson Computers vol 48 no 9 pp 866ndash880 1999

[6] A S Rajam L E Campostrini J M M Caamantildeo andP Clauss ldquoSpeculative runtime parallelization of loop neststowards greater scope and efficiencyrdquo in Proceedings of the2015 IEEE International Parallel and Distributed ProcessingSymposium Workshop (IPDPSW) pp 245ndash254 HyderabadIndia May 2015

[7] S Aldea A Estebanez D R Llanos and A Gonzalez-Escribano ldquoAn OpenMP extension that supports thread-levelspeculationrdquo IEEE Transactions on Parallel and DistributedSystems vol 27 no 1 pp 78ndash91 2016

[8] J Salamanca J N Amaral and G Araujo ldquoEvaluating andimproving thread-level speculation in hardware transactionalmemoriesrdquo in Proceedings of the 2016 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)pp 586ndash595 Chicago IL USA May 2016

[9] Z Ying and B Qinghai ldquoampe scheme for improving the ef-ficiency of block cipher algorithmrdquo in Proceedings of the 2014IEEE Workshop on Advanced Research and Technology inIndustry Applications (WARTIA) pp 824ndash826 Ottawa ONCanada September 2014

[10] P Kitsos and A N Skodras ldquoAn FPGA implementationand performance evaluation of the seed block cipherrdquo inProceedings of the 2011 17th International Conference onDigital Signal Processing (DSP) pp 1ndash5 Corfu Greece July2011

[11] L Bossuet N Datta C Mancillas-Lopez and M NandildquoELmD a pipelineable authenticated encryption and itshardware implementationrdquo IEEE Transactions on Computersvol 65 no 11 pp 3318ndash3331 2016

[12] P U Deshpande and S A Bhosale ldquoAES encryption enginesof many core processor arrays on FPGA by using parallelpipeline and sequential techniquerdquo in Proceedings of the 2015International Conference on Energy Systems and Applicationspp 75ndash80 Pune India October 2015

[13] T Kryjak and M Gorgon ldquoPipeline implementation of the128-bit block cipher CLEFIA in FPGArdquo in Proceedings of the2009 International Conference on Field Programmable Logicand Applications pp 373ndash378 Prague Czech RepublicAugust 2009

[14] S Lin S He X Guo and D Guo ldquoAn efficient algorithm forcomputing modular division over GF(2m) in elliptic curvecryptographyrdquo in Proceedings of the 2017 11th IEEE In-ternational Conference on Anti-counterfeiting Security andIdentification (ASID) pp 179ndash182 Xiamen China 2017

[15] K M John and S Sabi ldquoA novel high performance ECCprocessor architecture with two staged multiplierrdquo in Pro-ceedings of the 2017 IEEE International Conference on Elec-trical Instrumentation and Communication Engineering(ICEICE) pp 1ndash5 Karur India April 2017

[16] M S Albahri M Benaissa and Z U A Khan ldquoParallelimplementation of ECC point multiplication on a homoge-neous multi-core microcontrollerrdquo in Proceedings of the 2016

12th International Conference on Mobile Ad-Hoc and SensorNetworks (MSN) pp 386ndash389 Hefei China December 2016

[17] W K Lee B M Goi R C W Phan and G S Poh ldquoHighspeed implementation of symmetric block cipher on GPUrdquo inProceedings of the 2014 International Symposium on IntelligentSignal Processing and Communication Systems (ISPACS)pp 102ndash107 Kuching Malaysia December 2014

[18] J Ma X Chen R Xu and J Shi ldquoImplementation andevaluation of different parallel designs of AES using CUDArdquoin Proceedings of the 2017 IEEE Second International Con-ference on Data Science in Cyberspace (DSC) pp 606ndash614Shenzhen China June 2017

[19] W Dai Y Doroz and B Sunar ldquoAccelerating NTRU basedhomomorphic encryption using GPUsrdquo in Proceedings of the2014 IEEE High Performance Extreme Computing Conference(HPEC) pp 1ndash6 Waltham MA USA September 2014

[20] G Barlas A Hassan and Y A Jundi ldquoAn analytical approachto the design of parallel block cipher encryptiondecryptiona CPUGPU case studyrdquo in Proceedings of the 2011 19thInternational Euromicro Conference on Parallel Distributedand Network-Based Processing pp 247ndash251 Ayia NapaCyprus February 2011

[21] H Kondo S Otani M Nakajima et al ldquoHeterogeneousmulticore SoC with SiP for secure multimedia applicationsrdquoIEEE Journal of Solid-State Circuits vol 44 no 8 pp 2251ndash2259 2009

[22] S Wang J Han Y Li Y Bo and X Zeng ldquoA 920 MHz quad-core cryptography processor accelerating parallel task pro-cessing of public-key algorithmsrdquo in Proceedings of the IEEE2013 Custom Integrated Circuits Conference pp 1ndash4 San JoseCA USA September 2013

[23] M Alfadel E S M El-Alfy and K M A Kamal ldquoEvaluatingtime and throughput at different modes of operation in AESalgorithmrdquo in Proceedings of the 2017 8th InternationalConference on Information Technology (ICIT) pp 795ndash801Amman Jordan May 2017

[24] A Abidi S Tawbi C Guyeux B Bouallegue andM Machhout ldquoSummary of topological study of chaotic cbcmode of operationrdquo in Proceedings of the 2016 IEEE IntlConference on Computational Science and Engineering (CSE)and IEEE Intl Conference on Embedded and UbiquitousComputing (EUC) and 15th Intl Symposium on DistributedComputing and Applications for Business Engineering(DCABES) pp 436ndash443 Paris France August 2016

[25] S Najjar-Ghabel S Yousefi andM Z Lighvan ldquoA high speedimplementation counter mode cryptography using hardwareparallelismrdquo in Proceedings of the 2016 Eighth InternationalConference on Information and Knowledge Technology (IKT)pp 55ndash60 Hamedan Iran September 2016

[26] H M Heys ldquoAnalysis of the statistical cipher feedback modeof block ciphersrdquo IEEE Transactions on Computers vol 52no 1 pp 77ndash92 2003

[27] M A Alomari K Samsudin and A R Ramli ldquoA study onencryption algorithms and modes for disk encryptionrdquo inProceedings of the 2009 International Conference on SignalProcessing Systems pp 793ndash797 Singapore 2009

[28] L Wang H M Cui L Chen and X B Feng ldquoResearch ontask parallel programming modelrdquo Journal of Softwarevol 24 no 1 pp 77ndash90 2013

[29] K Huang G C Fox and J J DongarraDistributed and CloudComputing From Parallel Processing to the Internet of JingsMorgan Kaufmann Burlington MA USA 2011

10 Journal of Electrical and Computer Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

function operations Different blocks without dependenciescan be executed in parallel within it

For example service 1 ID1SM4key1IV1CBC andservice 2 ID2SM4key2noneECB correspond to the datastreams data 1 and data 2 if data 1 can be split into m UBblocks and data 2 can be split into n UB blocks ampat is

data 1 p11 p12 p1m1113864 1113865

data 2 p21 p22 p2n1113864 1113865(1)

when p1i is operated in module Fk(0le kle 31) p1(i+1) cannotenter the thread 1 module but p2j can enter the thread 1 andFkprime(kprime lt k) modules So we can insert the block data of theparallel mode between blocks of the serial workingmode andhide the execution time of the parallel mode block datainside that of the serial working mode block dataampe thread2 pipeline depth determines the number of independentblocks that can be inserted between dependent blocks

32 Hash Algorithm ampe Hash function needs to obtain thehash value of the message through multiple iterations of theblock data so the Hash operation has dependencies betweenthe blocks By analyzing Hash algorithm can also be dividedinto 3 threads message expansion (ME) iterative com-pression and hash value output ampe message expansioncompletes the calculation of parameters required for theiterative compression function and the hash value output isused for the output of the final result Taking SM3 as anexample the algorithm architecture is shown in Figure 4ampe parallelism of Hash operations can only occur betweenblocks of different data streams

4 DPP Parallel Data Processing Model

ampe parallel computing models based on PRAM BSP andLogP [28 29] are all computing oriented lacking perti-nence to data processing and are not suitable for practicalapplication of massive data It can be seen from the abovedescription that the cryptographic operation data in themultiuser environment have the following characteristics

(1) Different user data streams intersect each other (2)Independent blocks and dependent blocks exist inmutual intersections amperefore the parallel processingmodel must have the following two mechanisms (1)distinguish between different user data streams and (2)distinguish between data stream dependencies For thispurpose we encapsulate the data stream and add certainattribute information to the block data so as to expressits properties thereby facilitating subsequent parallelprocessing

ampe cryptographic operations of streaming data are data-intensive applications Its typical feature is that the datavalue is time-sensitive so the system requires low latencyWhen a large amount of multiuser data reaches the system ina continuous fast time-varying and cross way it must bequickly sent to each cryptographic algorithm operationnode Otherwise data loss may occur due to limited storagespace of the system receiver Referring to the MapReducedata stream processing strategy specialized module is usedto distribute data streams

ampe three threads of cryptographic operations of dif-ferent working modes are implemented by three modulespreprocessing module operation module and result outputmodule ampe data reorganization module completes theintegration of the data stream packages of each serviceFigure 5 shows the dual-channel pipeline parallel dataprocessing model proposed in this paper ampe data streamprocessing is divided into six stages job package encapsu-lation (PE) parallel scheduling (PS) job package pre-processing (PP) algorithm operation (AO) result output(RO) and data reorganization (DR) ampe job package en-capsulation and data reorganization are completed by thenode P0 the parallel scheduling is completed by the node P0and the job package preprocessing algorithm operation andresult output are completed by the algorithm operation unitcry_IP Each algorithm corresponds to a cluster of algorithmoperation units For example the operation module cry_IPi1is an encryption operation unit of algorithm si and cry_IPi2 isa decryption operation unit of algorithm si ampe dual channelis embodied in the two channels of input and output data ofalgorithm operation unit

Table 1 ampread split

CBC encryption mode OFB decryption mode

ampread 1 result 1 XOR(X Pi) X IV i 0

Ciminus1 0lt ilt n1113896 ampread 1 result 1

IV i 0

result 2iminus1 0lt ilt n1113896

ampread 2 result 2 Enc(Key result 1) ampread 2 result 2i Enc(Key result 1i)

ampread 3 Ci result 2 ampread 3 Pi result 2 XOR Ci 0le ilt n

F0 F2 F31Pi

Cindash1

IV

0

Ci

i

Mode0

1

01

SM4

Key

Thread 1Thread 2 Thread 3

Figure 3 SM4 3-thread algorithm operation

F0 F2 F63Vindash1

IV

Bi

Vi

i0

1 SM3

Thread 1 Thread 2 Thread 3ME

W

Figure 4 SM3 3-thread algorithm operation

4 Journal of Electrical and Computer Engineering

In the DPP model data processing is performed in unitsof job packages No explicit synchronization process is re-quired between job packages Synchronization is implicit inthe algorithm preprocessing Each job package is completedby a fixed algorithm operation unit and there is no datainteraction between the algorithm operation units in the jobpackage processing

41 Encapsulation ampe encapsulation is completed by themaster node ampe format of the encapsulated job package isas follows

P IDcryptokeyIVmodeNoflagl

ID is the service number set for multiuser and is used todistinguish different data streams Crypto represents thespecific demand for cryptographic algorithms and encryptiondecryption operations key is the KEY IV is the initial vectorand mode is the working mode which can be used todistinguish the dependency property of the job packageNo is the job package serial number which is used toreassemble the data stream after the algorithm operationFlag is the tail package identifier which indicates whetherthe service data flow is end l is the length of the job packageand is consistent with the size of the unit block of thecryptographic algorithm

ampe flow data received by the system is described asTask Pij where i corresponds to the service number and jcorresponds to the sequence number of the service We referto job packages in parallel mode as independent jobpackages and job packages in serial mode as dependent jobpackages

42 Dual Channel and Parallel Scheduling According to thethread splitting under the same algorithm requirements thedifference of package processing of different working modesis embodied in the preprocessing module and the resultoutput module ampe operation module does not consider thecorrelation among the job packages and only processes theinput job packages For this reason it is necessary to dis-tinguish the input data of the preprocessing module and theresult output module We adopt dual channels to receive job

packages and classify independent job packages and de-pendent job packages

421 Dual Receiving Channel of the Preprocessing ModuleChannel 1 is used for the transfer of independent jobpackages Considering that the first job package of a datastream in the serial mode is not associated with job packagesof other data streams the job package transmitted in channel1 satisfies the condition

(mode ECB|CTR) or (modeCBC|CFB|OFB andNo1)

Channel 2 is used for the transfer of dependent jobpackages ampe job package that is transmitted in channel 2satisfies the condition

(modeCBC|CFB|OFB) and No 1

Suppose that the working mode of four services thatrequest cry_IPab operation is as follows service 1mode service 2modeCBC service 3mode service 4mode ECB the job packages on channel 1 and channel 2after parallel scheduling are shown in Figure 6

ampe selection of channel is determined by the controlsignal Si and the default selection is channel 1 that is Si 0cnt is used to record the execution time of the dependent jobpackage in the algorithm operation unit When module cntsenses that the preprocessing module inputs a dependent jobpackage the counting starts It is assumed that the algorithmoperation unit needs m clock cycles to complete the oper-ation of the dependent job package When cntm thecounter clears cnt 0 sets Si 1 to select channel 2 andinputs the next dependency job package In other casesSi 0 and channel 1 is selected ampe state flow diagram isshown in Figure 7

422 Dual Receiving Channel of the Result Output ModuleChannel 3 is used for the transfer of independent jobpackages ampe job package that is transmitted in channel 3satisfies the condition

(mode ECB|CTR) or (modeCBC|CFB|OFB andflag 1)

Channel 4 is used for the transfer of dependent jobpackages ampe job package that is transmitted in channel 4satisfies the condition

(modeCBC|CFB|OFB) and flag n

Channel 4 provides support for storing intermediatestates in serial mode Since the result of the tail package doesnot need to be used as an intermediate state the tail packagersquosresult in serial mode is also output through channel 3

ampe choice of channel is determined by the control signalSo and channel 3 is the default that is So 0 ampe channelselection control signal is the same as that of the pre-processing module When cntm So is set to 1 and channel4 is selected In other cases So 0 and channel 3 is selectedWhen So 1 the job package transmitted by channel 4 hasthe same ID as the job package received by the preprocessingmodule as shown in Figure 8

PS

PP11 AO11

PP1m1 AO12

PP21 AO21

PP2m2 AO22

PPp1 AOp1

PPpmp AOp2

PE DR

Tpar

L

Tr Tc

Ts

Tf

Tz

L

RO

RO

RO

RO

RO

RO

cry_IP1

cry_IP2

cry_IPp

Figure 5 Dual-channel pipeline parallel data processing model(DPP)

Journal of Electrical and Computer Engineering 5

423 Parallel Scheduling ampe process of parallel schedulingis as follows

Step 1 Determine the algorithm operation unitaccording to cryptoStep 2 Select the input channel of the preprocessingmodule of the algorithm operation unit according tomode and No

ampis scheduling method realizes fast transfer of in-coming data streams and continuous processing of jobpackages ampe use of dual channels reduces the interactionbetween modules and hides the processing time of in-dependent job packages in the processing time of dependentjob packages facilitating the parallel execution of jobpackages

424 Data Processing Steps

Step 1 ampe algorithm application process splits the datato be processed adds attribute information and en-capsulates it as job package

Step 2 ampe algorithm operation unit is determinedaccording to the crypto field in the job package and theinput channel of its preprocessing module is selectedaccording to mode and No ampe job package is sent tothe corresponding input channelStep 3 ampe preprocessing module obtains the inputdata of the algorithm operation module namely dataKEY and IV according to the package field of modeand NoStep 4 ampe algorithm operation module performspipeline processing on the received data and sends theresult to the receiving channel of the result outputmodule according to mode and flagStep 5 ampe result output module outputs the receivedjob package to the result receiving process and de-termines whether to feed the job package back to thepreprocessing module according to mode and flag ofthe job packageStep 6 ampe result receiving process recombines thereceived job package based on ID

43 Parallel Execution Time Assume that P0 is the jobpackage encapsulation and reorganization node and PaPb and Pc are algorithm operation nodes Tpar is thepackage encapsulation time and Tz is the data re-organization time g is the communication interval thatis the minimum time interval during which node P0continuously transmits and receives job packages ampereciprocal of g corresponds to the communication band-width L is the maximum communication delay which isthe time taken to transmit a job package from node P0 tothe scheduling node Ts indicates the parallel schedulingtime of job packages Tr indicates the job package pre-processing time Tc indicates the job package operationtime and Tf indicates the job package output time g

represents the calculated load which is the set of jobpackages m is the algorithm operation module pipelinedepth ampe message delivery process of the job package onDPP is shown in Figure 9

ampe continuous sending and receiving of messages needsto meet the conditions

cnt = 0Si = 0

S0

cnt = cnt + 1Si = 0

cnt = 0Si = 1

S1

S2

Mode = CBC|CFB|OFB

Mode = CBC|CFB|OFBcnt = mcnt ne m

Figure 7 Dual-channel selection control of preprocessing module

P22PP AO

Channel 3

C11

C31C32C41

C21RO

C21

P22middotID

01

So

cnt

Channel 4

C21 C41 C32 C31

Figure 8 Dual-channel control of output module

PS

cyp_IPcd

hellip

hellip

hellip

cntMode

Channel 1

Channel 2

Channel 1

Channel 2

P11P31P32P41P21

P11P31P32P41P21

P22P12

P22P12

cyp_IPab

Si

Si

cntMode

01

01

Figure 6 Parallel scheduling and dual receiving channel of preprocessing module

6 Journal of Electrical and Computer Engineering

L + Ts + Tr +Tc

mleg + L + Ts + Tr

ggeTc

m

(2)

Conclusion 1 ampe system communication bandwidth can beimproved by two ways increasing the operation speed of thealgorithm operation module and increasing the pipelinedepth of it

If g job packages come from α data streams consider twoscenarios

(1) Each data stream adopts a parallel mode that is jobpackages are all independent ampe load processingtime is as follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2L

(3)

(2) Each data stream adopts a parallel mode so that jobpackages of the same service data stream are mutuallydependent and job packages of different service datastreams are mutually independent Assume that themaximum number of service job packages iswprime in theextreme case the first job packet of this data flowappears after other service data flows and then theoperation time of other service data flows is hiddenduring that of the longest service data flow ampe loadprocessing time is as follows

T Tpar + wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(4)

For the data stream mixed in the serialparallel modedue to the pipeline design of the algorithm operationmodule in the process of dependent job packages the in-dependent job packages can be executed in parallel so theexecution time of independent job packages is hidden in theexecution time of the dependent job package amperefore theexecution time Tof the multitask mixed mode data stream isas follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2LleTleTpar

+ wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(5)

Conclusion 2 ampe execution time of mixed cross-datastreams is limited to the execution time of the datastreams with the most job packages On the premise ofconstant pipeline depth improving the processing per-formance of each module in the pipeline is the key toimprove the processing method

5 Implementation and Testing

51 Hardware Implementation We prototyped the modelto verify its validityampe architecture is shown in Figure 10ampe cipher server management system completes thereception of multiuser cryptographic service data streamsthe encapsulation of job packages and the data reorga-nization service of the operation results ampe crypto-graphic algorithm operation is performed by a cryptomachine as a coprocessor ampe crypto machine is designedusing Xilinx XC7K325t FPGA which includes parallelscheduling module SM3 and SM4 cryptographic algo-rithm cores

ampe hardware implementation block diagram of thecipher machine is shown in Figure 11 ampe cipher machineadopts the PCIe interface and receives the job package splitby the algorithm application process of the cipher servermanagement system in the way of DMA and stores them inDOWN_FIFO in the downlink data storage area

Parallel schedulingmodule PSCHEDULE Determine thealgorithm core according to the crypto field of the jobpackage determine the receiving FIFO according to themode and No fields and realize the transfer of the jobpackage FIFO1 corresponds to channel 1 in the model andFIFO2 corresponds to channel 2 in the model

Preprocessing module IP_CTRL Acquire algorithm coreinput data including IV KEY and the result of the preorderdependent job package and calculate the input data tocryptographic cores

Operation modules SM4 and SM3 Cryptographic coresperform algorithm operations on input data in pipeliningway and send the result of the operation to uFIFO or RAMuFIFO corresponds to channel 3 in the model and RAMcorresponds to the channel 4

Cipher machine

Client

ClientCipher service

management system

Cloud data center

Figure 10 Multiuser cryptographic service

P0Tpar g g g

L

L

L

Tr

Tr

Tr

Tc

Tc

Tc Tf

Tf

TfPa

Pb

Pc

Tz

Ts

Ts

Ts

Tcm

Figure 9 Message delivery on the DPP model

Journal of Electrical and Computer Engineering 7

Result output module UP_CTRL If IP_CTRL gives theID number the data of the same ID number are extractedfrom RAM and the output result resultprime is calculated and fedback to IP_CTRL and output to the output FIFOo at the sametime If IP_CTRL has no ID number output the data inuFIFO are extracted and resultrsquo is calculated and sent toFIFOo ampe data in FIFOo are fed back to the result receptionprocess of the cipher server management system through theinterface module in DMA mode

52 Test ampe test environment is as follows ampe mainfrequency of the heterogeneous multicore parallel pro-cessing system implemented by Xilinx XC7K325t FPGA is160MHz and the interface with the upper application isPCIe 20lowast 8

Test 1 SM4 CBC encryption for a 4000MB file ampe end ofthe test operation takes 114390935 s so the data streamprocessing rate is 4000lowast 8114390935 279742446Mbps

Test 2 Test 21 to Test 24 use eight 400MB files and jobpackages of the eight files enter the cipher machine in aninterleaving manner ampese files use different IVs andKEYs in different working modes ampe end time of each fileprocessing is shown in Table 2 For the data set themaximum end time is the total time it takes ampe data flowprocessing rate is derived from the following formula

DMA_ctrl

FIFO1

IP_CTRL UP_CTRL

PCI_E interface

PD_TLP_CTRL PU_TLP_CTRL

IP FIFO

DOWN_FIFO

PSHEDULE

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM4SM4

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM3

FIFOoFIFOoFIFOo

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

Result

Result

Figure 11 Hardware architecture of partem

Table 2 Cipher operation time under cross files

Test 21

File File 1 File 2 File 3 File 4Mode SM4 CBC

Time (s) 135175 134912 134994 134831File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 134788 134872 134950 135014Processing rate 3200lowast 8135175 s 189Gbps

Test 22

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 125029 124946 124988 124784File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 124825 124906 124864 125052Processing rate 3200lowast 8125052 s 205Gbps

Test 23

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 43340 43166 40397 41721File File 5 File 6 File 7 File 8Mode SM4 ECB

Time (s) 43433 49895 41908 43623Processing rate 3200lowast 849895 s 513Gbps

Test 24

File File 1 File 2 File 3 File 4Mode SM3 SM4 ECB

Time (s) 128909 124151 130394 131713File File 5 File 6 File 7 File 8Mode SM4 CBC SM4 OFB

Time (s) 127424 126119 131909 124842Processing rate 3200lowast 8131909194Gbps

8 Journal of Electrical and Computer Engineering

rate (bps) size of data flow (bit)the total time (s)

(6)

Analysis Because Test 1 has only one file in the CBCmode the job packages are interrelated and all are executedserial Although packages of each file are interrelated thefiles of Test 21 are independent of each other so the dataflow processing rate of Test 21 is higher than that of Test 1In Test 22 4 files are in the ECB work mode and theindependent job packages operation time can be hiddenwithin the operation time of the dependency packages sothe data flow processing rate of Test 22 is higher than thatof the Test 21 Similarly the data processing rate of Test 23is the highest Test 24 has 2 files with independent jobpackages and 6 files with dependent packages but they areallocated in two algorithm units so the operation rate isclose to Test 21

Test 3 Processing rate compare of dual channel and singlechannel ampe total amount of job packages is 10000and they are randomly assigned to j files If Ni representsthe number of job packages in filei 1113936

ji1Ni 10000 If j

is 10 20 30 40 the ECB or CBC encryption mode isadopted Change the number of files in CBC encryptionmode and compare the completion time of data flowin single-channel architecture and dual-channel archi-tecture ampe average value of data flow processing time isrun several times and the comparison result is shown inFigure 12 Single 0means that all files use ECBmode andthe system adopts single-channel architecture Dual 50indicates that 50 of the files in the data stream useCBC mode and the system is dual-channel architectureand so on

As can be seen from Figure 12 when the data flow is anindependent data flow the algorithm operation unit adoptsthe pipeline design so the processing rate under the dualchannel is close to the processing rate under the singlechannel with the increase of the associated job packages inthe data flow the advantage of the data processing rate ofdual channel is gradually displayed and with the increase ofthe number of files in the data stream the advantage of thedata processing rate is more obvious

6 Conclusion

Based on the characteristics of cryptographic operationsthis paper proposes a dual-channel pipeline parallel dataprocessing model DPP to implement cryptographic op-erations for cross-data streams with different servicerequirements in a multiuser environment ampe modelensures synchronization between dependent job packagesand parallel processing between independent job packagesand data streams It hides the processing of independentjob packages in the process of dependent job packages toimprove the processing speed of cross-data streamsPrototype experiments prove that the system under thismodel can realize correct and rapid processing of multi-service and personalized cross-data streams Increasingthe depth of the cryptographic algorithm pipeline and

improving the processing performance of each module inthe pipeline can improve the overall performance of thesystem

Data Availability

ampe data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

ampe authors declare that they have no conflicts of interest

Acknowledgments

ampis work was supported by the National Key RampD Programof China (no 2017YFB0802705) and the National NaturalScience Foundation of China (no 61672515)

References

[1] Y Song and Z Li ldquoApplying array contraction to a sequenceof DOALL loopsrdquo in Proceedings of the International Con-ference on Parallel Processing (ICPPrsquo04) vol 1 pp 46ndash53Montreal Canada August 2004

[2] G Elsesser V Ngo S Bhattacharya and W T Tsai ldquoLoadbalancing of DOALL loops in the Perfect Clubrdquo in Proceedingsof the 1993 Proceedings Seventh International Parallel ProcessingSymposium pp 129ndash133 Newport CA USA April 1993

10 20 30 40File number

0

1

2

3

4

5

6

Proc

essin

g ra

te (G

bps)

Single 0Single 50Single 100

Dual 0Dual 50Dual 100

Multiple files sm4 ECB and CBC cross encryption

Figure 12 Processing rate comparison of dual channel and singlechannel

Journal of Electrical and Computer Engineering 9

[3] D K C Ding-Kai Chen and P C Y Pen-Chung YewldquoStatement re-ordering for DOACROSS loopsrdquo in Pro-ceedings of the 1994 Internatonal Conference on ParallelProcessing vol 2 pp 24ndash28 Raleigh NC USA August 1994

[4] G Ottoni R Rangan A Stoler and D I August ldquoAutomaticthread extraction with decoupled software pipeliningrdquo inProceedings of the 38th Annual IEEEACM InternationalSymposium on Microarchitecture (MICROrsquo05) pp 105ndash118Barcelona Spain November 2005

[5] V Krishnan and J Torrellas ldquoA chip-multiprocessor archi-tecture with speculative multithreadingrdquo IEEE Transactionson Computers vol 48 no 9 pp 866ndash880 1999

[6] A S Rajam L E Campostrini J M M Caamantildeo andP Clauss ldquoSpeculative runtime parallelization of loop neststowards greater scope and efficiencyrdquo in Proceedings of the2015 IEEE International Parallel and Distributed ProcessingSymposium Workshop (IPDPSW) pp 245ndash254 HyderabadIndia May 2015

[7] S Aldea A Estebanez D R Llanos and A Gonzalez-Escribano ldquoAn OpenMP extension that supports thread-levelspeculationrdquo IEEE Transactions on Parallel and DistributedSystems vol 27 no 1 pp 78ndash91 2016

[8] J Salamanca J N Amaral and G Araujo ldquoEvaluating andimproving thread-level speculation in hardware transactionalmemoriesrdquo in Proceedings of the 2016 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)pp 586ndash595 Chicago IL USA May 2016

[9] Z Ying and B Qinghai ldquoampe scheme for improving the ef-ficiency of block cipher algorithmrdquo in Proceedings of the 2014IEEE Workshop on Advanced Research and Technology inIndustry Applications (WARTIA) pp 824ndash826 Ottawa ONCanada September 2014

[10] P Kitsos and A N Skodras ldquoAn FPGA implementationand performance evaluation of the seed block cipherrdquo inProceedings of the 2011 17th International Conference onDigital Signal Processing (DSP) pp 1ndash5 Corfu Greece July2011

[11] L Bossuet N Datta C Mancillas-Lopez and M NandildquoELmD a pipelineable authenticated encryption and itshardware implementationrdquo IEEE Transactions on Computersvol 65 no 11 pp 3318ndash3331 2016

[12] P U Deshpande and S A Bhosale ldquoAES encryption enginesof many core processor arrays on FPGA by using parallelpipeline and sequential techniquerdquo in Proceedings of the 2015International Conference on Energy Systems and Applicationspp 75ndash80 Pune India October 2015

[13] T Kryjak and M Gorgon ldquoPipeline implementation of the128-bit block cipher CLEFIA in FPGArdquo in Proceedings of the2009 International Conference on Field Programmable Logicand Applications pp 373ndash378 Prague Czech RepublicAugust 2009

[14] S Lin S He X Guo and D Guo ldquoAn efficient algorithm forcomputing modular division over GF(2m) in elliptic curvecryptographyrdquo in Proceedings of the 2017 11th IEEE In-ternational Conference on Anti-counterfeiting Security andIdentification (ASID) pp 179ndash182 Xiamen China 2017

[15] K M John and S Sabi ldquoA novel high performance ECCprocessor architecture with two staged multiplierrdquo in Pro-ceedings of the 2017 IEEE International Conference on Elec-trical Instrumentation and Communication Engineering(ICEICE) pp 1ndash5 Karur India April 2017

[16] M S Albahri M Benaissa and Z U A Khan ldquoParallelimplementation of ECC point multiplication on a homoge-neous multi-core microcontrollerrdquo in Proceedings of the 2016

12th International Conference on Mobile Ad-Hoc and SensorNetworks (MSN) pp 386ndash389 Hefei China December 2016

[17] W K Lee B M Goi R C W Phan and G S Poh ldquoHighspeed implementation of symmetric block cipher on GPUrdquo inProceedings of the 2014 International Symposium on IntelligentSignal Processing and Communication Systems (ISPACS)pp 102ndash107 Kuching Malaysia December 2014

[18] J Ma X Chen R Xu and J Shi ldquoImplementation andevaluation of different parallel designs of AES using CUDArdquoin Proceedings of the 2017 IEEE Second International Con-ference on Data Science in Cyberspace (DSC) pp 606ndash614Shenzhen China June 2017

[19] W Dai Y Doroz and B Sunar ldquoAccelerating NTRU basedhomomorphic encryption using GPUsrdquo in Proceedings of the2014 IEEE High Performance Extreme Computing Conference(HPEC) pp 1ndash6 Waltham MA USA September 2014

[20] G Barlas A Hassan and Y A Jundi ldquoAn analytical approachto the design of parallel block cipher encryptiondecryptiona CPUGPU case studyrdquo in Proceedings of the 2011 19thInternational Euromicro Conference on Parallel Distributedand Network-Based Processing pp 247ndash251 Ayia NapaCyprus February 2011

[21] H Kondo S Otani M Nakajima et al ldquoHeterogeneousmulticore SoC with SiP for secure multimedia applicationsrdquoIEEE Journal of Solid-State Circuits vol 44 no 8 pp 2251ndash2259 2009

[22] S Wang J Han Y Li Y Bo and X Zeng ldquoA 920 MHz quad-core cryptography processor accelerating parallel task pro-cessing of public-key algorithmsrdquo in Proceedings of the IEEE2013 Custom Integrated Circuits Conference pp 1ndash4 San JoseCA USA September 2013

[23] M Alfadel E S M El-Alfy and K M A Kamal ldquoEvaluatingtime and throughput at different modes of operation in AESalgorithmrdquo in Proceedings of the 2017 8th InternationalConference on Information Technology (ICIT) pp 795ndash801Amman Jordan May 2017

[24] A Abidi S Tawbi C Guyeux B Bouallegue andM Machhout ldquoSummary of topological study of chaotic cbcmode of operationrdquo in Proceedings of the 2016 IEEE IntlConference on Computational Science and Engineering (CSE)and IEEE Intl Conference on Embedded and UbiquitousComputing (EUC) and 15th Intl Symposium on DistributedComputing and Applications for Business Engineering(DCABES) pp 436ndash443 Paris France August 2016

[25] S Najjar-Ghabel S Yousefi andM Z Lighvan ldquoA high speedimplementation counter mode cryptography using hardwareparallelismrdquo in Proceedings of the 2016 Eighth InternationalConference on Information and Knowledge Technology (IKT)pp 55ndash60 Hamedan Iran September 2016

[26] H M Heys ldquoAnalysis of the statistical cipher feedback modeof block ciphersrdquo IEEE Transactions on Computers vol 52no 1 pp 77ndash92 2003

[27] M A Alomari K Samsudin and A R Ramli ldquoA study onencryption algorithms and modes for disk encryptionrdquo inProceedings of the 2009 International Conference on SignalProcessing Systems pp 793ndash797 Singapore 2009

[28] L Wang H M Cui L Chen and X B Feng ldquoResearch ontask parallel programming modelrdquo Journal of Softwarevol 24 no 1 pp 77ndash90 2013

[29] K Huang G C Fox and J J DongarraDistributed and CloudComputing From Parallel Processing to the Internet of JingsMorgan Kaufmann Burlington MA USA 2011

10 Journal of Electrical and Computer Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

In the DPP model data processing is performed in unitsof job packages No explicit synchronization process is re-quired between job packages Synchronization is implicit inthe algorithm preprocessing Each job package is completedby a fixed algorithm operation unit and there is no datainteraction between the algorithm operation units in the jobpackage processing

41 Encapsulation ampe encapsulation is completed by themaster node ampe format of the encapsulated job package isas follows

P IDcryptokeyIVmodeNoflagl

ID is the service number set for multiuser and is used todistinguish different data streams Crypto represents thespecific demand for cryptographic algorithms and encryptiondecryption operations key is the KEY IV is the initial vectorand mode is the working mode which can be used todistinguish the dependency property of the job packageNo is the job package serial number which is used toreassemble the data stream after the algorithm operationFlag is the tail package identifier which indicates whetherthe service data flow is end l is the length of the job packageand is consistent with the size of the unit block of thecryptographic algorithm

ampe flow data received by the system is described asTask Pij where i corresponds to the service number and jcorresponds to the sequence number of the service We referto job packages in parallel mode as independent jobpackages and job packages in serial mode as dependent jobpackages

42 Dual Channel and Parallel Scheduling According to thethread splitting under the same algorithm requirements thedifference of package processing of different working modesis embodied in the preprocessing module and the resultoutput module ampe operation module does not consider thecorrelation among the job packages and only processes theinput job packages For this reason it is necessary to dis-tinguish the input data of the preprocessing module and theresult output module We adopt dual channels to receive job

packages and classify independent job packages and de-pendent job packages

421 Dual Receiving Channel of the Preprocessing ModuleChannel 1 is used for the transfer of independent jobpackages Considering that the first job package of a datastream in the serial mode is not associated with job packagesof other data streams the job package transmitted in channel1 satisfies the condition

(mode ECB|CTR) or (modeCBC|CFB|OFB andNo1)

Channel 2 is used for the transfer of dependent jobpackages ampe job package that is transmitted in channel 2satisfies the condition

(modeCBC|CFB|OFB) and No 1

Suppose that the working mode of four services thatrequest cry_IPab operation is as follows service 1mode service 2modeCBC service 3mode service 4mode ECB the job packages on channel 1 and channel 2after parallel scheduling are shown in Figure 6

ampe selection of channel is determined by the controlsignal Si and the default selection is channel 1 that is Si 0cnt is used to record the execution time of the dependent jobpackage in the algorithm operation unit When module cntsenses that the preprocessing module inputs a dependent jobpackage the counting starts It is assumed that the algorithmoperation unit needs m clock cycles to complete the oper-ation of the dependent job package When cntm thecounter clears cnt 0 sets Si 1 to select channel 2 andinputs the next dependency job package In other casesSi 0 and channel 1 is selected ampe state flow diagram isshown in Figure 7

422 Dual Receiving Channel of the Result Output ModuleChannel 3 is used for the transfer of independent jobpackages ampe job package that is transmitted in channel 3satisfies the condition

(mode ECB|CTR) or (modeCBC|CFB|OFB andflag 1)

Channel 4 is used for the transfer of dependent jobpackages ampe job package that is transmitted in channel 4satisfies the condition

(modeCBC|CFB|OFB) and flag n

Channel 4 provides support for storing intermediatestates in serial mode Since the result of the tail package doesnot need to be used as an intermediate state the tail packagersquosresult in serial mode is also output through channel 3

ampe choice of channel is determined by the control signalSo and channel 3 is the default that is So 0 ampe channelselection control signal is the same as that of the pre-processing module When cntm So is set to 1 and channel4 is selected In other cases So 0 and channel 3 is selectedWhen So 1 the job package transmitted by channel 4 hasthe same ID as the job package received by the preprocessingmodule as shown in Figure 8

PS

PP11 AO11

PP1m1 AO12

PP21 AO21

PP2m2 AO22

PPp1 AOp1

PPpmp AOp2

PE DR

Tpar

L

Tr Tc

Ts

Tf

Tz

L

RO

RO

RO

RO

RO

RO

cry_IP1

cry_IP2

cry_IPp

Figure 5 Dual-channel pipeline parallel data processing model(DPP)

Journal of Electrical and Computer Engineering 5

423 Parallel Scheduling ampe process of parallel schedulingis as follows

Step 1 Determine the algorithm operation unitaccording to cryptoStep 2 Select the input channel of the preprocessingmodule of the algorithm operation unit according tomode and No

ampis scheduling method realizes fast transfer of in-coming data streams and continuous processing of jobpackages ampe use of dual channels reduces the interactionbetween modules and hides the processing time of in-dependent job packages in the processing time of dependentjob packages facilitating the parallel execution of jobpackages

424 Data Processing Steps

Step 1 ampe algorithm application process splits the datato be processed adds attribute information and en-capsulates it as job package

Step 2 ampe algorithm operation unit is determinedaccording to the crypto field in the job package and theinput channel of its preprocessing module is selectedaccording to mode and No ampe job package is sent tothe corresponding input channelStep 3 ampe preprocessing module obtains the inputdata of the algorithm operation module namely dataKEY and IV according to the package field of modeand NoStep 4 ampe algorithm operation module performspipeline processing on the received data and sends theresult to the receiving channel of the result outputmodule according to mode and flagStep 5 ampe result output module outputs the receivedjob package to the result receiving process and de-termines whether to feed the job package back to thepreprocessing module according to mode and flag ofthe job packageStep 6 ampe result receiving process recombines thereceived job package based on ID

43 Parallel Execution Time Assume that P0 is the jobpackage encapsulation and reorganization node and PaPb and Pc are algorithm operation nodes Tpar is thepackage encapsulation time and Tz is the data re-organization time g is the communication interval thatis the minimum time interval during which node P0continuously transmits and receives job packages ampereciprocal of g corresponds to the communication band-width L is the maximum communication delay which isthe time taken to transmit a job package from node P0 tothe scheduling node Ts indicates the parallel schedulingtime of job packages Tr indicates the job package pre-processing time Tc indicates the job package operationtime and Tf indicates the job package output time g

represents the calculated load which is the set of jobpackages m is the algorithm operation module pipelinedepth ampe message delivery process of the job package onDPP is shown in Figure 9

ampe continuous sending and receiving of messages needsto meet the conditions

cnt = 0Si = 0

S0

cnt = cnt + 1Si = 0

cnt = 0Si = 1

S1

S2

Mode = CBC|CFB|OFB

Mode = CBC|CFB|OFBcnt = mcnt ne m

Figure 7 Dual-channel selection control of preprocessing module

P22PP AO

Channel 3

C11

C31C32C41

C21RO

C21

P22middotID

01

So

cnt

Channel 4

C21 C41 C32 C31

Figure 8 Dual-channel control of output module

PS

cyp_IPcd

hellip

hellip

hellip

cntMode

Channel 1

Channel 2

Channel 1

Channel 2

P11P31P32P41P21

P11P31P32P41P21

P22P12

P22P12

cyp_IPab

Si

Si

cntMode

01

01

Figure 6 Parallel scheduling and dual receiving channel of preprocessing module

6 Journal of Electrical and Computer Engineering

L + Ts + Tr +Tc

mleg + L + Ts + Tr

ggeTc

m

(2)

Conclusion 1 ampe system communication bandwidth can beimproved by two ways increasing the operation speed of thealgorithm operation module and increasing the pipelinedepth of it

If g job packages come from α data streams consider twoscenarios

(1) Each data stream adopts a parallel mode that is jobpackages are all independent ampe load processingtime is as follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2L

(3)

(2) Each data stream adopts a parallel mode so that jobpackages of the same service data stream are mutuallydependent and job packages of different service datastreams are mutually independent Assume that themaximum number of service job packages iswprime in theextreme case the first job packet of this data flowappears after other service data flows and then theoperation time of other service data flows is hiddenduring that of the longest service data flow ampe loadprocessing time is as follows

T Tpar + wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(4)

For the data stream mixed in the serialparallel modedue to the pipeline design of the algorithm operationmodule in the process of dependent job packages the in-dependent job packages can be executed in parallel so theexecution time of independent job packages is hidden in theexecution time of the dependent job package amperefore theexecution time Tof the multitask mixed mode data stream isas follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2LleTleTpar

+ wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(5)

Conclusion 2 ampe execution time of mixed cross-datastreams is limited to the execution time of the datastreams with the most job packages On the premise ofconstant pipeline depth improving the processing per-formance of each module in the pipeline is the key toimprove the processing method

5 Implementation and Testing

51 Hardware Implementation We prototyped the modelto verify its validityampe architecture is shown in Figure 10ampe cipher server management system completes thereception of multiuser cryptographic service data streamsthe encapsulation of job packages and the data reorga-nization service of the operation results ampe crypto-graphic algorithm operation is performed by a cryptomachine as a coprocessor ampe crypto machine is designedusing Xilinx XC7K325t FPGA which includes parallelscheduling module SM3 and SM4 cryptographic algo-rithm cores

ampe hardware implementation block diagram of thecipher machine is shown in Figure 11 ampe cipher machineadopts the PCIe interface and receives the job package splitby the algorithm application process of the cipher servermanagement system in the way of DMA and stores them inDOWN_FIFO in the downlink data storage area

Parallel schedulingmodule PSCHEDULE Determine thealgorithm core according to the crypto field of the jobpackage determine the receiving FIFO according to themode and No fields and realize the transfer of the jobpackage FIFO1 corresponds to channel 1 in the model andFIFO2 corresponds to channel 2 in the model

Preprocessing module IP_CTRL Acquire algorithm coreinput data including IV KEY and the result of the preorderdependent job package and calculate the input data tocryptographic cores

Operation modules SM4 and SM3 Cryptographic coresperform algorithm operations on input data in pipeliningway and send the result of the operation to uFIFO or RAMuFIFO corresponds to channel 3 in the model and RAMcorresponds to the channel 4

Cipher machine

Client

ClientCipher service

management system

Cloud data center

Figure 10 Multiuser cryptographic service

P0Tpar g g g

L

L

L

Tr

Tr

Tr

Tc

Tc

Tc Tf

Tf

TfPa

Pb

Pc

Tz

Ts

Ts

Ts

Tcm

Figure 9 Message delivery on the DPP model

Journal of Electrical and Computer Engineering 7

Result output module UP_CTRL If IP_CTRL gives theID number the data of the same ID number are extractedfrom RAM and the output result resultprime is calculated and fedback to IP_CTRL and output to the output FIFOo at the sametime If IP_CTRL has no ID number output the data inuFIFO are extracted and resultrsquo is calculated and sent toFIFOo ampe data in FIFOo are fed back to the result receptionprocess of the cipher server management system through theinterface module in DMA mode

52 Test ampe test environment is as follows ampe mainfrequency of the heterogeneous multicore parallel pro-cessing system implemented by Xilinx XC7K325t FPGA is160MHz and the interface with the upper application isPCIe 20lowast 8

Test 1 SM4 CBC encryption for a 4000MB file ampe end ofthe test operation takes 114390935 s so the data streamprocessing rate is 4000lowast 8114390935 279742446Mbps

Test 2 Test 21 to Test 24 use eight 400MB files and jobpackages of the eight files enter the cipher machine in aninterleaving manner ampese files use different IVs andKEYs in different working modes ampe end time of each fileprocessing is shown in Table 2 For the data set themaximum end time is the total time it takes ampe data flowprocessing rate is derived from the following formula

DMA_ctrl

FIFO1

IP_CTRL UP_CTRL

PCI_E interface

PD_TLP_CTRL PU_TLP_CTRL

IP FIFO

DOWN_FIFO

PSHEDULE

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM4SM4

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM3

FIFOoFIFOoFIFOo

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

Result

Result

Figure 11 Hardware architecture of partem

Table 2 Cipher operation time under cross files

Test 21

File File 1 File 2 File 3 File 4Mode SM4 CBC

Time (s) 135175 134912 134994 134831File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 134788 134872 134950 135014Processing rate 3200lowast 8135175 s 189Gbps

Test 22

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 125029 124946 124988 124784File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 124825 124906 124864 125052Processing rate 3200lowast 8125052 s 205Gbps

Test 23

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 43340 43166 40397 41721File File 5 File 6 File 7 File 8Mode SM4 ECB

Time (s) 43433 49895 41908 43623Processing rate 3200lowast 849895 s 513Gbps

Test 24

File File 1 File 2 File 3 File 4Mode SM3 SM4 ECB

Time (s) 128909 124151 130394 131713File File 5 File 6 File 7 File 8Mode SM4 CBC SM4 OFB

Time (s) 127424 126119 131909 124842Processing rate 3200lowast 8131909194Gbps

8 Journal of Electrical and Computer Engineering

rate (bps) size of data flow (bit)the total time (s)

(6)

Analysis Because Test 1 has only one file in the CBCmode the job packages are interrelated and all are executedserial Although packages of each file are interrelated thefiles of Test 21 are independent of each other so the dataflow processing rate of Test 21 is higher than that of Test 1In Test 22 4 files are in the ECB work mode and theindependent job packages operation time can be hiddenwithin the operation time of the dependency packages sothe data flow processing rate of Test 22 is higher than thatof the Test 21 Similarly the data processing rate of Test 23is the highest Test 24 has 2 files with independent jobpackages and 6 files with dependent packages but they areallocated in two algorithm units so the operation rate isclose to Test 21

Test 3 Processing rate compare of dual channel and singlechannel ampe total amount of job packages is 10000and they are randomly assigned to j files If Ni representsthe number of job packages in filei 1113936

ji1Ni 10000 If j

is 10 20 30 40 the ECB or CBC encryption mode isadopted Change the number of files in CBC encryptionmode and compare the completion time of data flowin single-channel architecture and dual-channel archi-tecture ampe average value of data flow processing time isrun several times and the comparison result is shown inFigure 12 Single 0means that all files use ECBmode andthe system adopts single-channel architecture Dual 50indicates that 50 of the files in the data stream useCBC mode and the system is dual-channel architectureand so on

As can be seen from Figure 12 when the data flow is anindependent data flow the algorithm operation unit adoptsthe pipeline design so the processing rate under the dualchannel is close to the processing rate under the singlechannel with the increase of the associated job packages inthe data flow the advantage of the data processing rate ofdual channel is gradually displayed and with the increase ofthe number of files in the data stream the advantage of thedata processing rate is more obvious

6 Conclusion

Based on the characteristics of cryptographic operationsthis paper proposes a dual-channel pipeline parallel dataprocessing model DPP to implement cryptographic op-erations for cross-data streams with different servicerequirements in a multiuser environment ampe modelensures synchronization between dependent job packagesand parallel processing between independent job packagesand data streams It hides the processing of independentjob packages in the process of dependent job packages toimprove the processing speed of cross-data streamsPrototype experiments prove that the system under thismodel can realize correct and rapid processing of multi-service and personalized cross-data streams Increasingthe depth of the cryptographic algorithm pipeline and

improving the processing performance of each module inthe pipeline can improve the overall performance of thesystem

Data Availability

ampe data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

ampe authors declare that they have no conflicts of interest

Acknowledgments

ampis work was supported by the National Key RampD Programof China (no 2017YFB0802705) and the National NaturalScience Foundation of China (no 61672515)

References

[1] Y Song and Z Li ldquoApplying array contraction to a sequenceof DOALL loopsrdquo in Proceedings of the International Con-ference on Parallel Processing (ICPPrsquo04) vol 1 pp 46ndash53Montreal Canada August 2004

[2] G Elsesser V Ngo S Bhattacharya and W T Tsai ldquoLoadbalancing of DOALL loops in the Perfect Clubrdquo in Proceedingsof the 1993 Proceedings Seventh International Parallel ProcessingSymposium pp 129ndash133 Newport CA USA April 1993

10 20 30 40File number

0

1

2

3

4

5

6

Proc

essin

g ra

te (G

bps)

Single 0Single 50Single 100

Dual 0Dual 50Dual 100

Multiple files sm4 ECB and CBC cross encryption

Figure 12 Processing rate comparison of dual channel and singlechannel

Journal of Electrical and Computer Engineering 9

[3] D K C Ding-Kai Chen and P C Y Pen-Chung YewldquoStatement re-ordering for DOACROSS loopsrdquo in Pro-ceedings of the 1994 Internatonal Conference on ParallelProcessing vol 2 pp 24ndash28 Raleigh NC USA August 1994

[4] G Ottoni R Rangan A Stoler and D I August ldquoAutomaticthread extraction with decoupled software pipeliningrdquo inProceedings of the 38th Annual IEEEACM InternationalSymposium on Microarchitecture (MICROrsquo05) pp 105ndash118Barcelona Spain November 2005

[5] V Krishnan and J Torrellas ldquoA chip-multiprocessor archi-tecture with speculative multithreadingrdquo IEEE Transactionson Computers vol 48 no 9 pp 866ndash880 1999

[6] A S Rajam L E Campostrini J M M Caamantildeo andP Clauss ldquoSpeculative runtime parallelization of loop neststowards greater scope and efficiencyrdquo in Proceedings of the2015 IEEE International Parallel and Distributed ProcessingSymposium Workshop (IPDPSW) pp 245ndash254 HyderabadIndia May 2015

[7] S Aldea A Estebanez D R Llanos and A Gonzalez-Escribano ldquoAn OpenMP extension that supports thread-levelspeculationrdquo IEEE Transactions on Parallel and DistributedSystems vol 27 no 1 pp 78ndash91 2016

[8] J Salamanca J N Amaral and G Araujo ldquoEvaluating andimproving thread-level speculation in hardware transactionalmemoriesrdquo in Proceedings of the 2016 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)pp 586ndash595 Chicago IL USA May 2016

[9] Z Ying and B Qinghai ldquoampe scheme for improving the ef-ficiency of block cipher algorithmrdquo in Proceedings of the 2014IEEE Workshop on Advanced Research and Technology inIndustry Applications (WARTIA) pp 824ndash826 Ottawa ONCanada September 2014

[10] P Kitsos and A N Skodras ldquoAn FPGA implementationand performance evaluation of the seed block cipherrdquo inProceedings of the 2011 17th International Conference onDigital Signal Processing (DSP) pp 1ndash5 Corfu Greece July2011

[11] L Bossuet N Datta C Mancillas-Lopez and M NandildquoELmD a pipelineable authenticated encryption and itshardware implementationrdquo IEEE Transactions on Computersvol 65 no 11 pp 3318ndash3331 2016

[12] P U Deshpande and S A Bhosale ldquoAES encryption enginesof many core processor arrays on FPGA by using parallelpipeline and sequential techniquerdquo in Proceedings of the 2015International Conference on Energy Systems and Applicationspp 75ndash80 Pune India October 2015

[13] T Kryjak and M Gorgon ldquoPipeline implementation of the128-bit block cipher CLEFIA in FPGArdquo in Proceedings of the2009 International Conference on Field Programmable Logicand Applications pp 373ndash378 Prague Czech RepublicAugust 2009

[14] S Lin S He X Guo and D Guo ldquoAn efficient algorithm forcomputing modular division over GF(2m) in elliptic curvecryptographyrdquo in Proceedings of the 2017 11th IEEE In-ternational Conference on Anti-counterfeiting Security andIdentification (ASID) pp 179ndash182 Xiamen China 2017

[15] K M John and S Sabi ldquoA novel high performance ECCprocessor architecture with two staged multiplierrdquo in Pro-ceedings of the 2017 IEEE International Conference on Elec-trical Instrumentation and Communication Engineering(ICEICE) pp 1ndash5 Karur India April 2017

[16] M S Albahri M Benaissa and Z U A Khan ldquoParallelimplementation of ECC point multiplication on a homoge-neous multi-core microcontrollerrdquo in Proceedings of the 2016

12th International Conference on Mobile Ad-Hoc and SensorNetworks (MSN) pp 386ndash389 Hefei China December 2016

[17] W K Lee B M Goi R C W Phan and G S Poh ldquoHighspeed implementation of symmetric block cipher on GPUrdquo inProceedings of the 2014 International Symposium on IntelligentSignal Processing and Communication Systems (ISPACS)pp 102ndash107 Kuching Malaysia December 2014

[18] J Ma X Chen R Xu and J Shi ldquoImplementation andevaluation of different parallel designs of AES using CUDArdquoin Proceedings of the 2017 IEEE Second International Con-ference on Data Science in Cyberspace (DSC) pp 606ndash614Shenzhen China June 2017

[19] W Dai Y Doroz and B Sunar ldquoAccelerating NTRU basedhomomorphic encryption using GPUsrdquo in Proceedings of the2014 IEEE High Performance Extreme Computing Conference(HPEC) pp 1ndash6 Waltham MA USA September 2014

[20] G Barlas A Hassan and Y A Jundi ldquoAn analytical approachto the design of parallel block cipher encryptiondecryptiona CPUGPU case studyrdquo in Proceedings of the 2011 19thInternational Euromicro Conference on Parallel Distributedand Network-Based Processing pp 247ndash251 Ayia NapaCyprus February 2011

[21] H Kondo S Otani M Nakajima et al ldquoHeterogeneousmulticore SoC with SiP for secure multimedia applicationsrdquoIEEE Journal of Solid-State Circuits vol 44 no 8 pp 2251ndash2259 2009

[22] S Wang J Han Y Li Y Bo and X Zeng ldquoA 920 MHz quad-core cryptography processor accelerating parallel task pro-cessing of public-key algorithmsrdquo in Proceedings of the IEEE2013 Custom Integrated Circuits Conference pp 1ndash4 San JoseCA USA September 2013

[23] M Alfadel E S M El-Alfy and K M A Kamal ldquoEvaluatingtime and throughput at different modes of operation in AESalgorithmrdquo in Proceedings of the 2017 8th InternationalConference on Information Technology (ICIT) pp 795ndash801Amman Jordan May 2017

[24] A Abidi S Tawbi C Guyeux B Bouallegue andM Machhout ldquoSummary of topological study of chaotic cbcmode of operationrdquo in Proceedings of the 2016 IEEE IntlConference on Computational Science and Engineering (CSE)and IEEE Intl Conference on Embedded and UbiquitousComputing (EUC) and 15th Intl Symposium on DistributedComputing and Applications for Business Engineering(DCABES) pp 436ndash443 Paris France August 2016

[25] S Najjar-Ghabel S Yousefi andM Z Lighvan ldquoA high speedimplementation counter mode cryptography using hardwareparallelismrdquo in Proceedings of the 2016 Eighth InternationalConference on Information and Knowledge Technology (IKT)pp 55ndash60 Hamedan Iran September 2016

[26] H M Heys ldquoAnalysis of the statistical cipher feedback modeof block ciphersrdquo IEEE Transactions on Computers vol 52no 1 pp 77ndash92 2003

[27] M A Alomari K Samsudin and A R Ramli ldquoA study onencryption algorithms and modes for disk encryptionrdquo inProceedings of the 2009 International Conference on SignalProcessing Systems pp 793ndash797 Singapore 2009

[28] L Wang H M Cui L Chen and X B Feng ldquoResearch ontask parallel programming modelrdquo Journal of Softwarevol 24 no 1 pp 77ndash90 2013

[29] K Huang G C Fox and J J DongarraDistributed and CloudComputing From Parallel Processing to the Internet of JingsMorgan Kaufmann Burlington MA USA 2011

10 Journal of Electrical and Computer Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

423 Parallel Scheduling ampe process of parallel schedulingis as follows

Step 1 Determine the algorithm operation unitaccording to cryptoStep 2 Select the input channel of the preprocessingmodule of the algorithm operation unit according tomode and No

ampis scheduling method realizes fast transfer of in-coming data streams and continuous processing of jobpackages ampe use of dual channels reduces the interactionbetween modules and hides the processing time of in-dependent job packages in the processing time of dependentjob packages facilitating the parallel execution of jobpackages

424 Data Processing Steps

Step 1 ampe algorithm application process splits the datato be processed adds attribute information and en-capsulates it as job package

Step 2 ampe algorithm operation unit is determinedaccording to the crypto field in the job package and theinput channel of its preprocessing module is selectedaccording to mode and No ampe job package is sent tothe corresponding input channelStep 3 ampe preprocessing module obtains the inputdata of the algorithm operation module namely dataKEY and IV according to the package field of modeand NoStep 4 ampe algorithm operation module performspipeline processing on the received data and sends theresult to the receiving channel of the result outputmodule according to mode and flagStep 5 ampe result output module outputs the receivedjob package to the result receiving process and de-termines whether to feed the job package back to thepreprocessing module according to mode and flag ofthe job packageStep 6 ampe result receiving process recombines thereceived job package based on ID

43 Parallel Execution Time Assume that P0 is the jobpackage encapsulation and reorganization node and PaPb and Pc are algorithm operation nodes Tpar is thepackage encapsulation time and Tz is the data re-organization time g is the communication interval thatis the minimum time interval during which node P0continuously transmits and receives job packages ampereciprocal of g corresponds to the communication band-width L is the maximum communication delay which isthe time taken to transmit a job package from node P0 tothe scheduling node Ts indicates the parallel schedulingtime of job packages Tr indicates the job package pre-processing time Tc indicates the job package operationtime and Tf indicates the job package output time g

represents the calculated load which is the set of jobpackages m is the algorithm operation module pipelinedepth ampe message delivery process of the job package onDPP is shown in Figure 9

ampe continuous sending and receiving of messages needsto meet the conditions

cnt = 0Si = 0

S0

cnt = cnt + 1Si = 0

cnt = 0Si = 1

S1

S2

Mode = CBC|CFB|OFB

Mode = CBC|CFB|OFBcnt = mcnt ne m

Figure 7 Dual-channel selection control of preprocessing module

P22PP AO

Channel 3

C11

C31C32C41

C21RO

C21

P22middotID

01

So

cnt

Channel 4

C21 C41 C32 C31

Figure 8 Dual-channel control of output module

PS

cyp_IPcd

hellip

hellip

hellip

cntMode

Channel 1

Channel 2

Channel 1

Channel 2

P11P31P32P41P21

P11P31P32P41P21

P22P12

P22P12

cyp_IPab

Si

Si

cntMode

01

01

Figure 6 Parallel scheduling and dual receiving channel of preprocessing module

6 Journal of Electrical and Computer Engineering

L + Ts + Tr +Tc

mleg + L + Ts + Tr

ggeTc

m

(2)

Conclusion 1 ampe system communication bandwidth can beimproved by two ways increasing the operation speed of thealgorithm operation module and increasing the pipelinedepth of it

If g job packages come from α data streams consider twoscenarios

(1) Each data stream adopts a parallel mode that is jobpackages are all independent ampe load processingtime is as follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2L

(3)

(2) Each data stream adopts a parallel mode so that jobpackages of the same service data stream are mutuallydependent and job packages of different service datastreams are mutually independent Assume that themaximum number of service job packages iswprime in theextreme case the first job packet of this data flowappears after other service data flows and then theoperation time of other service data flows is hiddenduring that of the longest service data flow ampe loadprocessing time is as follows

T Tpar + wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(4)

For the data stream mixed in the serialparallel modedue to the pipeline design of the algorithm operationmodule in the process of dependent job packages the in-dependent job packages can be executed in parallel so theexecution time of independent job packages is hidden in theexecution time of the dependent job package amperefore theexecution time Tof the multitask mixed mode data stream isas follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2LleTleTpar

+ wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(5)

Conclusion 2 ampe execution time of mixed cross-datastreams is limited to the execution time of the datastreams with the most job packages On the premise ofconstant pipeline depth improving the processing per-formance of each module in the pipeline is the key toimprove the processing method

5 Implementation and Testing

51 Hardware Implementation We prototyped the modelto verify its validityampe architecture is shown in Figure 10ampe cipher server management system completes thereception of multiuser cryptographic service data streamsthe encapsulation of job packages and the data reorga-nization service of the operation results ampe crypto-graphic algorithm operation is performed by a cryptomachine as a coprocessor ampe crypto machine is designedusing Xilinx XC7K325t FPGA which includes parallelscheduling module SM3 and SM4 cryptographic algo-rithm cores

ampe hardware implementation block diagram of thecipher machine is shown in Figure 11 ampe cipher machineadopts the PCIe interface and receives the job package splitby the algorithm application process of the cipher servermanagement system in the way of DMA and stores them inDOWN_FIFO in the downlink data storage area

Parallel schedulingmodule PSCHEDULE Determine thealgorithm core according to the crypto field of the jobpackage determine the receiving FIFO according to themode and No fields and realize the transfer of the jobpackage FIFO1 corresponds to channel 1 in the model andFIFO2 corresponds to channel 2 in the model

Preprocessing module IP_CTRL Acquire algorithm coreinput data including IV KEY and the result of the preorderdependent job package and calculate the input data tocryptographic cores

Operation modules SM4 and SM3 Cryptographic coresperform algorithm operations on input data in pipeliningway and send the result of the operation to uFIFO or RAMuFIFO corresponds to channel 3 in the model and RAMcorresponds to the channel 4

Cipher machine

Client

ClientCipher service

management system

Cloud data center

Figure 10 Multiuser cryptographic service

P0Tpar g g g

L

L

L

Tr

Tr

Tr

Tc

Tc

Tc Tf

Tf

TfPa

Pb

Pc

Tz

Ts

Ts

Ts

Tcm

Figure 9 Message delivery on the DPP model

Journal of Electrical and Computer Engineering 7

Result output module UP_CTRL If IP_CTRL gives theID number the data of the same ID number are extractedfrom RAM and the output result resultprime is calculated and fedback to IP_CTRL and output to the output FIFOo at the sametime If IP_CTRL has no ID number output the data inuFIFO are extracted and resultrsquo is calculated and sent toFIFOo ampe data in FIFOo are fed back to the result receptionprocess of the cipher server management system through theinterface module in DMA mode

52 Test ampe test environment is as follows ampe mainfrequency of the heterogeneous multicore parallel pro-cessing system implemented by Xilinx XC7K325t FPGA is160MHz and the interface with the upper application isPCIe 20lowast 8

Test 1 SM4 CBC encryption for a 4000MB file ampe end ofthe test operation takes 114390935 s so the data streamprocessing rate is 4000lowast 8114390935 279742446Mbps

Test 2 Test 21 to Test 24 use eight 400MB files and jobpackages of the eight files enter the cipher machine in aninterleaving manner ampese files use different IVs andKEYs in different working modes ampe end time of each fileprocessing is shown in Table 2 For the data set themaximum end time is the total time it takes ampe data flowprocessing rate is derived from the following formula

DMA_ctrl

FIFO1

IP_CTRL UP_CTRL

PCI_E interface

PD_TLP_CTRL PU_TLP_CTRL

IP FIFO

DOWN_FIFO

PSHEDULE

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM4SM4

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM3

FIFOoFIFOoFIFOo

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

Result

Result

Figure 11 Hardware architecture of partem

Table 2 Cipher operation time under cross files

Test 21

File File 1 File 2 File 3 File 4Mode SM4 CBC

Time (s) 135175 134912 134994 134831File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 134788 134872 134950 135014Processing rate 3200lowast 8135175 s 189Gbps

Test 22

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 125029 124946 124988 124784File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 124825 124906 124864 125052Processing rate 3200lowast 8125052 s 205Gbps

Test 23

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 43340 43166 40397 41721File File 5 File 6 File 7 File 8Mode SM4 ECB

Time (s) 43433 49895 41908 43623Processing rate 3200lowast 849895 s 513Gbps

Test 24

File File 1 File 2 File 3 File 4Mode SM3 SM4 ECB

Time (s) 128909 124151 130394 131713File File 5 File 6 File 7 File 8Mode SM4 CBC SM4 OFB

Time (s) 127424 126119 131909 124842Processing rate 3200lowast 8131909194Gbps

8 Journal of Electrical and Computer Engineering

rate (bps) size of data flow (bit)the total time (s)

(6)

Analysis Because Test 1 has only one file in the CBCmode the job packages are interrelated and all are executedserial Although packages of each file are interrelated thefiles of Test 21 are independent of each other so the dataflow processing rate of Test 21 is higher than that of Test 1In Test 22 4 files are in the ECB work mode and theindependent job packages operation time can be hiddenwithin the operation time of the dependency packages sothe data flow processing rate of Test 22 is higher than thatof the Test 21 Similarly the data processing rate of Test 23is the highest Test 24 has 2 files with independent jobpackages and 6 files with dependent packages but they areallocated in two algorithm units so the operation rate isclose to Test 21

Test 3 Processing rate compare of dual channel and singlechannel ampe total amount of job packages is 10000and they are randomly assigned to j files If Ni representsthe number of job packages in filei 1113936

ji1Ni 10000 If j

is 10 20 30 40 the ECB or CBC encryption mode isadopted Change the number of files in CBC encryptionmode and compare the completion time of data flowin single-channel architecture and dual-channel archi-tecture ampe average value of data flow processing time isrun several times and the comparison result is shown inFigure 12 Single 0means that all files use ECBmode andthe system adopts single-channel architecture Dual 50indicates that 50 of the files in the data stream useCBC mode and the system is dual-channel architectureand so on

As can be seen from Figure 12 when the data flow is anindependent data flow the algorithm operation unit adoptsthe pipeline design so the processing rate under the dualchannel is close to the processing rate under the singlechannel with the increase of the associated job packages inthe data flow the advantage of the data processing rate ofdual channel is gradually displayed and with the increase ofthe number of files in the data stream the advantage of thedata processing rate is more obvious

6 Conclusion

Based on the characteristics of cryptographic operationsthis paper proposes a dual-channel pipeline parallel dataprocessing model DPP to implement cryptographic op-erations for cross-data streams with different servicerequirements in a multiuser environment ampe modelensures synchronization between dependent job packagesand parallel processing between independent job packagesand data streams It hides the processing of independentjob packages in the process of dependent job packages toimprove the processing speed of cross-data streamsPrototype experiments prove that the system under thismodel can realize correct and rapid processing of multi-service and personalized cross-data streams Increasingthe depth of the cryptographic algorithm pipeline and

improving the processing performance of each module inthe pipeline can improve the overall performance of thesystem

Data Availability

ampe data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

ampe authors declare that they have no conflicts of interest

Acknowledgments

ampis work was supported by the National Key RampD Programof China (no 2017YFB0802705) and the National NaturalScience Foundation of China (no 61672515)

References

[1] Y Song and Z Li ldquoApplying array contraction to a sequenceof DOALL loopsrdquo in Proceedings of the International Con-ference on Parallel Processing (ICPPrsquo04) vol 1 pp 46ndash53Montreal Canada August 2004

[2] G Elsesser V Ngo S Bhattacharya and W T Tsai ldquoLoadbalancing of DOALL loops in the Perfect Clubrdquo in Proceedingsof the 1993 Proceedings Seventh International Parallel ProcessingSymposium pp 129ndash133 Newport CA USA April 1993

10 20 30 40File number

0

1

2

3

4

5

6

Proc

essin

g ra

te (G

bps)

Single 0Single 50Single 100

Dual 0Dual 50Dual 100

Multiple files sm4 ECB and CBC cross encryption

Figure 12 Processing rate comparison of dual channel and singlechannel

Journal of Electrical and Computer Engineering 9

[3] D K C Ding-Kai Chen and P C Y Pen-Chung YewldquoStatement re-ordering for DOACROSS loopsrdquo in Pro-ceedings of the 1994 Internatonal Conference on ParallelProcessing vol 2 pp 24ndash28 Raleigh NC USA August 1994

[4] G Ottoni R Rangan A Stoler and D I August ldquoAutomaticthread extraction with decoupled software pipeliningrdquo inProceedings of the 38th Annual IEEEACM InternationalSymposium on Microarchitecture (MICROrsquo05) pp 105ndash118Barcelona Spain November 2005

[5] V Krishnan and J Torrellas ldquoA chip-multiprocessor archi-tecture with speculative multithreadingrdquo IEEE Transactionson Computers vol 48 no 9 pp 866ndash880 1999

[6] A S Rajam L E Campostrini J M M Caamantildeo andP Clauss ldquoSpeculative runtime parallelization of loop neststowards greater scope and efficiencyrdquo in Proceedings of the2015 IEEE International Parallel and Distributed ProcessingSymposium Workshop (IPDPSW) pp 245ndash254 HyderabadIndia May 2015

[7] S Aldea A Estebanez D R Llanos and A Gonzalez-Escribano ldquoAn OpenMP extension that supports thread-levelspeculationrdquo IEEE Transactions on Parallel and DistributedSystems vol 27 no 1 pp 78ndash91 2016

[8] J Salamanca J N Amaral and G Araujo ldquoEvaluating andimproving thread-level speculation in hardware transactionalmemoriesrdquo in Proceedings of the 2016 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)pp 586ndash595 Chicago IL USA May 2016

[9] Z Ying and B Qinghai ldquoampe scheme for improving the ef-ficiency of block cipher algorithmrdquo in Proceedings of the 2014IEEE Workshop on Advanced Research and Technology inIndustry Applications (WARTIA) pp 824ndash826 Ottawa ONCanada September 2014

[10] P Kitsos and A N Skodras ldquoAn FPGA implementationand performance evaluation of the seed block cipherrdquo inProceedings of the 2011 17th International Conference onDigital Signal Processing (DSP) pp 1ndash5 Corfu Greece July2011

[11] L Bossuet N Datta C Mancillas-Lopez and M NandildquoELmD a pipelineable authenticated encryption and itshardware implementationrdquo IEEE Transactions on Computersvol 65 no 11 pp 3318ndash3331 2016

[12] P U Deshpande and S A Bhosale ldquoAES encryption enginesof many core processor arrays on FPGA by using parallelpipeline and sequential techniquerdquo in Proceedings of the 2015International Conference on Energy Systems and Applicationspp 75ndash80 Pune India October 2015

[13] T Kryjak and M Gorgon ldquoPipeline implementation of the128-bit block cipher CLEFIA in FPGArdquo in Proceedings of the2009 International Conference on Field Programmable Logicand Applications pp 373ndash378 Prague Czech RepublicAugust 2009

[14] S Lin S He X Guo and D Guo ldquoAn efficient algorithm forcomputing modular division over GF(2m) in elliptic curvecryptographyrdquo in Proceedings of the 2017 11th IEEE In-ternational Conference on Anti-counterfeiting Security andIdentification (ASID) pp 179ndash182 Xiamen China 2017

[15] K M John and S Sabi ldquoA novel high performance ECCprocessor architecture with two staged multiplierrdquo in Pro-ceedings of the 2017 IEEE International Conference on Elec-trical Instrumentation and Communication Engineering(ICEICE) pp 1ndash5 Karur India April 2017

[16] M S Albahri M Benaissa and Z U A Khan ldquoParallelimplementation of ECC point multiplication on a homoge-neous multi-core microcontrollerrdquo in Proceedings of the 2016

12th International Conference on Mobile Ad-Hoc and SensorNetworks (MSN) pp 386ndash389 Hefei China December 2016

[17] W K Lee B M Goi R C W Phan and G S Poh ldquoHighspeed implementation of symmetric block cipher on GPUrdquo inProceedings of the 2014 International Symposium on IntelligentSignal Processing and Communication Systems (ISPACS)pp 102ndash107 Kuching Malaysia December 2014

[18] J Ma X Chen R Xu and J Shi ldquoImplementation andevaluation of different parallel designs of AES using CUDArdquoin Proceedings of the 2017 IEEE Second International Con-ference on Data Science in Cyberspace (DSC) pp 606ndash614Shenzhen China June 2017

[19] W Dai Y Doroz and B Sunar ldquoAccelerating NTRU basedhomomorphic encryption using GPUsrdquo in Proceedings of the2014 IEEE High Performance Extreme Computing Conference(HPEC) pp 1ndash6 Waltham MA USA September 2014

[20] G Barlas A Hassan and Y A Jundi ldquoAn analytical approachto the design of parallel block cipher encryptiondecryptiona CPUGPU case studyrdquo in Proceedings of the 2011 19thInternational Euromicro Conference on Parallel Distributedand Network-Based Processing pp 247ndash251 Ayia NapaCyprus February 2011

[21] H Kondo S Otani M Nakajima et al ldquoHeterogeneousmulticore SoC with SiP for secure multimedia applicationsrdquoIEEE Journal of Solid-State Circuits vol 44 no 8 pp 2251ndash2259 2009

[22] S Wang J Han Y Li Y Bo and X Zeng ldquoA 920 MHz quad-core cryptography processor accelerating parallel task pro-cessing of public-key algorithmsrdquo in Proceedings of the IEEE2013 Custom Integrated Circuits Conference pp 1ndash4 San JoseCA USA September 2013

[23] M Alfadel E S M El-Alfy and K M A Kamal ldquoEvaluatingtime and throughput at different modes of operation in AESalgorithmrdquo in Proceedings of the 2017 8th InternationalConference on Information Technology (ICIT) pp 795ndash801Amman Jordan May 2017

[24] A Abidi S Tawbi C Guyeux B Bouallegue andM Machhout ldquoSummary of topological study of chaotic cbcmode of operationrdquo in Proceedings of the 2016 IEEE IntlConference on Computational Science and Engineering (CSE)and IEEE Intl Conference on Embedded and UbiquitousComputing (EUC) and 15th Intl Symposium on DistributedComputing and Applications for Business Engineering(DCABES) pp 436ndash443 Paris France August 2016

[25] S Najjar-Ghabel S Yousefi andM Z Lighvan ldquoA high speedimplementation counter mode cryptography using hardwareparallelismrdquo in Proceedings of the 2016 Eighth InternationalConference on Information and Knowledge Technology (IKT)pp 55ndash60 Hamedan Iran September 2016

[26] H M Heys ldquoAnalysis of the statistical cipher feedback modeof block ciphersrdquo IEEE Transactions on Computers vol 52no 1 pp 77ndash92 2003

[27] M A Alomari K Samsudin and A R Ramli ldquoA study onencryption algorithms and modes for disk encryptionrdquo inProceedings of the 2009 International Conference on SignalProcessing Systems pp 793ndash797 Singapore 2009

[28] L Wang H M Cui L Chen and X B Feng ldquoResearch ontask parallel programming modelrdquo Journal of Softwarevol 24 no 1 pp 77ndash90 2013

[29] K Huang G C Fox and J J DongarraDistributed and CloudComputing From Parallel Processing to the Internet of JingsMorgan Kaufmann Burlington MA USA 2011

10 Journal of Electrical and Computer Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

L + Ts + Tr +Tc

mleg + L + Ts + Tr

ggeTc

m

(2)

Conclusion 1 ampe system communication bandwidth can beimproved by two ways increasing the operation speed of thealgorithm operation module and increasing the pipelinedepth of it

If g job packages come from α data streams consider twoscenarios

(1) Each data stream adopts a parallel mode that is jobpackages are all independent ampe load processingtime is as follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2L

(3)

(2) Each data stream adopts a parallel mode so that jobpackages of the same service data stream are mutuallydependent and job packages of different service datastreams are mutually independent Assume that themaximum number of service job packages iswprime in theextreme case the first job packet of this data flowappears after other service data flows and then theoperation time of other service data flows is hiddenduring that of the longest service data flow ampe loadprocessing time is as follows

T Tpar + wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(4)

For the data stream mixed in the serialparallel modedue to the pipeline design of the algorithm operationmodule in the process of dependent job packages the in-dependent job packages can be executed in parallel so theexecution time of independent job packages is hidden in theexecution time of the dependent job package amperefore theexecution time Tof the multitask mixed mode data stream isas follows

T Tpar +(nminus 1)g + Ts + Tr + Tc + Tf + Tz + 2LleTleTpar

+ wminuswprime minus 1( 1113857g + wprime Ts + Tr + Tc + Tf( 1113857 + Tz + 2L

(5)

Conclusion 2 ampe execution time of mixed cross-datastreams is limited to the execution time of the datastreams with the most job packages On the premise ofconstant pipeline depth improving the processing per-formance of each module in the pipeline is the key toimprove the processing method

5 Implementation and Testing

51 Hardware Implementation We prototyped the modelto verify its validityampe architecture is shown in Figure 10ampe cipher server management system completes thereception of multiuser cryptographic service data streamsthe encapsulation of job packages and the data reorga-nization service of the operation results ampe crypto-graphic algorithm operation is performed by a cryptomachine as a coprocessor ampe crypto machine is designedusing Xilinx XC7K325t FPGA which includes parallelscheduling module SM3 and SM4 cryptographic algo-rithm cores

ampe hardware implementation block diagram of thecipher machine is shown in Figure 11 ampe cipher machineadopts the PCIe interface and receives the job package splitby the algorithm application process of the cipher servermanagement system in the way of DMA and stores them inDOWN_FIFO in the downlink data storage area

Parallel schedulingmodule PSCHEDULE Determine thealgorithm core according to the crypto field of the jobpackage determine the receiving FIFO according to themode and No fields and realize the transfer of the jobpackage FIFO1 corresponds to channel 1 in the model andFIFO2 corresponds to channel 2 in the model

Preprocessing module IP_CTRL Acquire algorithm coreinput data including IV KEY and the result of the preorderdependent job package and calculate the input data tocryptographic cores

Operation modules SM4 and SM3 Cryptographic coresperform algorithm operations on input data in pipeliningway and send the result of the operation to uFIFO or RAMuFIFO corresponds to channel 3 in the model and RAMcorresponds to the channel 4

Cipher machine

Client

ClientCipher service

management system

Cloud data center

Figure 10 Multiuser cryptographic service

P0Tpar g g g

L

L

L

Tr

Tr

Tr

Tc

Tc

Tc Tf

Tf

TfPa

Pb

Pc

Tz

Ts

Ts

Ts

Tcm

Figure 9 Message delivery on the DPP model

Journal of Electrical and Computer Engineering 7

Result output module UP_CTRL If IP_CTRL gives theID number the data of the same ID number are extractedfrom RAM and the output result resultprime is calculated and fedback to IP_CTRL and output to the output FIFOo at the sametime If IP_CTRL has no ID number output the data inuFIFO are extracted and resultrsquo is calculated and sent toFIFOo ampe data in FIFOo are fed back to the result receptionprocess of the cipher server management system through theinterface module in DMA mode

52 Test ampe test environment is as follows ampe mainfrequency of the heterogeneous multicore parallel pro-cessing system implemented by Xilinx XC7K325t FPGA is160MHz and the interface with the upper application isPCIe 20lowast 8

Test 1 SM4 CBC encryption for a 4000MB file ampe end ofthe test operation takes 114390935 s so the data streamprocessing rate is 4000lowast 8114390935 279742446Mbps

Test 2 Test 21 to Test 24 use eight 400MB files and jobpackages of the eight files enter the cipher machine in aninterleaving manner ampese files use different IVs andKEYs in different working modes ampe end time of each fileprocessing is shown in Table 2 For the data set themaximum end time is the total time it takes ampe data flowprocessing rate is derived from the following formula

DMA_ctrl

FIFO1

IP_CTRL UP_CTRL

PCI_E interface

PD_TLP_CTRL PU_TLP_CTRL

IP FIFO

DOWN_FIFO

PSHEDULE

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM4SM4

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM3

FIFOoFIFOoFIFOo

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

Result

Result

Figure 11 Hardware architecture of partem

Table 2 Cipher operation time under cross files

Test 21

File File 1 File 2 File 3 File 4Mode SM4 CBC

Time (s) 135175 134912 134994 134831File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 134788 134872 134950 135014Processing rate 3200lowast 8135175 s 189Gbps

Test 22

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 125029 124946 124988 124784File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 124825 124906 124864 125052Processing rate 3200lowast 8125052 s 205Gbps

Test 23

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 43340 43166 40397 41721File File 5 File 6 File 7 File 8Mode SM4 ECB

Time (s) 43433 49895 41908 43623Processing rate 3200lowast 849895 s 513Gbps

Test 24

File File 1 File 2 File 3 File 4Mode SM3 SM4 ECB

Time (s) 128909 124151 130394 131713File File 5 File 6 File 7 File 8Mode SM4 CBC SM4 OFB

Time (s) 127424 126119 131909 124842Processing rate 3200lowast 8131909194Gbps

8 Journal of Electrical and Computer Engineering

rate (bps) size of data flow (bit)the total time (s)

(6)

Analysis Because Test 1 has only one file in the CBCmode the job packages are interrelated and all are executedserial Although packages of each file are interrelated thefiles of Test 21 are independent of each other so the dataflow processing rate of Test 21 is higher than that of Test 1In Test 22 4 files are in the ECB work mode and theindependent job packages operation time can be hiddenwithin the operation time of the dependency packages sothe data flow processing rate of Test 22 is higher than thatof the Test 21 Similarly the data processing rate of Test 23is the highest Test 24 has 2 files with independent jobpackages and 6 files with dependent packages but they areallocated in two algorithm units so the operation rate isclose to Test 21

Test 3 Processing rate compare of dual channel and singlechannel ampe total amount of job packages is 10000and they are randomly assigned to j files If Ni representsthe number of job packages in filei 1113936

ji1Ni 10000 If j

is 10 20 30 40 the ECB or CBC encryption mode isadopted Change the number of files in CBC encryptionmode and compare the completion time of data flowin single-channel architecture and dual-channel archi-tecture ampe average value of data flow processing time isrun several times and the comparison result is shown inFigure 12 Single 0means that all files use ECBmode andthe system adopts single-channel architecture Dual 50indicates that 50 of the files in the data stream useCBC mode and the system is dual-channel architectureand so on

As can be seen from Figure 12 when the data flow is anindependent data flow the algorithm operation unit adoptsthe pipeline design so the processing rate under the dualchannel is close to the processing rate under the singlechannel with the increase of the associated job packages inthe data flow the advantage of the data processing rate ofdual channel is gradually displayed and with the increase ofthe number of files in the data stream the advantage of thedata processing rate is more obvious

6 Conclusion

Based on the characteristics of cryptographic operationsthis paper proposes a dual-channel pipeline parallel dataprocessing model DPP to implement cryptographic op-erations for cross-data streams with different servicerequirements in a multiuser environment ampe modelensures synchronization between dependent job packagesand parallel processing between independent job packagesand data streams It hides the processing of independentjob packages in the process of dependent job packages toimprove the processing speed of cross-data streamsPrototype experiments prove that the system under thismodel can realize correct and rapid processing of multi-service and personalized cross-data streams Increasingthe depth of the cryptographic algorithm pipeline and

improving the processing performance of each module inthe pipeline can improve the overall performance of thesystem

Data Availability

ampe data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

ampe authors declare that they have no conflicts of interest

Acknowledgments

ampis work was supported by the National Key RampD Programof China (no 2017YFB0802705) and the National NaturalScience Foundation of China (no 61672515)

References

[1] Y Song and Z Li ldquoApplying array contraction to a sequenceof DOALL loopsrdquo in Proceedings of the International Con-ference on Parallel Processing (ICPPrsquo04) vol 1 pp 46ndash53Montreal Canada August 2004

[2] G Elsesser V Ngo S Bhattacharya and W T Tsai ldquoLoadbalancing of DOALL loops in the Perfect Clubrdquo in Proceedingsof the 1993 Proceedings Seventh International Parallel ProcessingSymposium pp 129ndash133 Newport CA USA April 1993

10 20 30 40File number

0

1

2

3

4

5

6

Proc

essin

g ra

te (G

bps)

Single 0Single 50Single 100

Dual 0Dual 50Dual 100

Multiple files sm4 ECB and CBC cross encryption

Figure 12 Processing rate comparison of dual channel and singlechannel

Journal of Electrical and Computer Engineering 9

[3] D K C Ding-Kai Chen and P C Y Pen-Chung YewldquoStatement re-ordering for DOACROSS loopsrdquo in Pro-ceedings of the 1994 Internatonal Conference on ParallelProcessing vol 2 pp 24ndash28 Raleigh NC USA August 1994

[4] G Ottoni R Rangan A Stoler and D I August ldquoAutomaticthread extraction with decoupled software pipeliningrdquo inProceedings of the 38th Annual IEEEACM InternationalSymposium on Microarchitecture (MICROrsquo05) pp 105ndash118Barcelona Spain November 2005

[5] V Krishnan and J Torrellas ldquoA chip-multiprocessor archi-tecture with speculative multithreadingrdquo IEEE Transactionson Computers vol 48 no 9 pp 866ndash880 1999

[6] A S Rajam L E Campostrini J M M Caamantildeo andP Clauss ldquoSpeculative runtime parallelization of loop neststowards greater scope and efficiencyrdquo in Proceedings of the2015 IEEE International Parallel and Distributed ProcessingSymposium Workshop (IPDPSW) pp 245ndash254 HyderabadIndia May 2015

[7] S Aldea A Estebanez D R Llanos and A Gonzalez-Escribano ldquoAn OpenMP extension that supports thread-levelspeculationrdquo IEEE Transactions on Parallel and DistributedSystems vol 27 no 1 pp 78ndash91 2016

[8] J Salamanca J N Amaral and G Araujo ldquoEvaluating andimproving thread-level speculation in hardware transactionalmemoriesrdquo in Proceedings of the 2016 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)pp 586ndash595 Chicago IL USA May 2016

[9] Z Ying and B Qinghai ldquoampe scheme for improving the ef-ficiency of block cipher algorithmrdquo in Proceedings of the 2014IEEE Workshop on Advanced Research and Technology inIndustry Applications (WARTIA) pp 824ndash826 Ottawa ONCanada September 2014

[10] P Kitsos and A N Skodras ldquoAn FPGA implementationand performance evaluation of the seed block cipherrdquo inProceedings of the 2011 17th International Conference onDigital Signal Processing (DSP) pp 1ndash5 Corfu Greece July2011

[11] L Bossuet N Datta C Mancillas-Lopez and M NandildquoELmD a pipelineable authenticated encryption and itshardware implementationrdquo IEEE Transactions on Computersvol 65 no 11 pp 3318ndash3331 2016

[12] P U Deshpande and S A Bhosale ldquoAES encryption enginesof many core processor arrays on FPGA by using parallelpipeline and sequential techniquerdquo in Proceedings of the 2015International Conference on Energy Systems and Applicationspp 75ndash80 Pune India October 2015

[13] T Kryjak and M Gorgon ldquoPipeline implementation of the128-bit block cipher CLEFIA in FPGArdquo in Proceedings of the2009 International Conference on Field Programmable Logicand Applications pp 373ndash378 Prague Czech RepublicAugust 2009

[14] S Lin S He X Guo and D Guo ldquoAn efficient algorithm forcomputing modular division over GF(2m) in elliptic curvecryptographyrdquo in Proceedings of the 2017 11th IEEE In-ternational Conference on Anti-counterfeiting Security andIdentification (ASID) pp 179ndash182 Xiamen China 2017

[15] K M John and S Sabi ldquoA novel high performance ECCprocessor architecture with two staged multiplierrdquo in Pro-ceedings of the 2017 IEEE International Conference on Elec-trical Instrumentation and Communication Engineering(ICEICE) pp 1ndash5 Karur India April 2017

[16] M S Albahri M Benaissa and Z U A Khan ldquoParallelimplementation of ECC point multiplication on a homoge-neous multi-core microcontrollerrdquo in Proceedings of the 2016

12th International Conference on Mobile Ad-Hoc and SensorNetworks (MSN) pp 386ndash389 Hefei China December 2016

[17] W K Lee B M Goi R C W Phan and G S Poh ldquoHighspeed implementation of symmetric block cipher on GPUrdquo inProceedings of the 2014 International Symposium on IntelligentSignal Processing and Communication Systems (ISPACS)pp 102ndash107 Kuching Malaysia December 2014

[18] J Ma X Chen R Xu and J Shi ldquoImplementation andevaluation of different parallel designs of AES using CUDArdquoin Proceedings of the 2017 IEEE Second International Con-ference on Data Science in Cyberspace (DSC) pp 606ndash614Shenzhen China June 2017

[19] W Dai Y Doroz and B Sunar ldquoAccelerating NTRU basedhomomorphic encryption using GPUsrdquo in Proceedings of the2014 IEEE High Performance Extreme Computing Conference(HPEC) pp 1ndash6 Waltham MA USA September 2014

[20] G Barlas A Hassan and Y A Jundi ldquoAn analytical approachto the design of parallel block cipher encryptiondecryptiona CPUGPU case studyrdquo in Proceedings of the 2011 19thInternational Euromicro Conference on Parallel Distributedand Network-Based Processing pp 247ndash251 Ayia NapaCyprus February 2011

[21] H Kondo S Otani M Nakajima et al ldquoHeterogeneousmulticore SoC with SiP for secure multimedia applicationsrdquoIEEE Journal of Solid-State Circuits vol 44 no 8 pp 2251ndash2259 2009

[22] S Wang J Han Y Li Y Bo and X Zeng ldquoA 920 MHz quad-core cryptography processor accelerating parallel task pro-cessing of public-key algorithmsrdquo in Proceedings of the IEEE2013 Custom Integrated Circuits Conference pp 1ndash4 San JoseCA USA September 2013

[23] M Alfadel E S M El-Alfy and K M A Kamal ldquoEvaluatingtime and throughput at different modes of operation in AESalgorithmrdquo in Proceedings of the 2017 8th InternationalConference on Information Technology (ICIT) pp 795ndash801Amman Jordan May 2017

[24] A Abidi S Tawbi C Guyeux B Bouallegue andM Machhout ldquoSummary of topological study of chaotic cbcmode of operationrdquo in Proceedings of the 2016 IEEE IntlConference on Computational Science and Engineering (CSE)and IEEE Intl Conference on Embedded and UbiquitousComputing (EUC) and 15th Intl Symposium on DistributedComputing and Applications for Business Engineering(DCABES) pp 436ndash443 Paris France August 2016

[25] S Najjar-Ghabel S Yousefi andM Z Lighvan ldquoA high speedimplementation counter mode cryptography using hardwareparallelismrdquo in Proceedings of the 2016 Eighth InternationalConference on Information and Knowledge Technology (IKT)pp 55ndash60 Hamedan Iran September 2016

[26] H M Heys ldquoAnalysis of the statistical cipher feedback modeof block ciphersrdquo IEEE Transactions on Computers vol 52no 1 pp 77ndash92 2003

[27] M A Alomari K Samsudin and A R Ramli ldquoA study onencryption algorithms and modes for disk encryptionrdquo inProceedings of the 2009 International Conference on SignalProcessing Systems pp 793ndash797 Singapore 2009

[28] L Wang H M Cui L Chen and X B Feng ldquoResearch ontask parallel programming modelrdquo Journal of Softwarevol 24 no 1 pp 77ndash90 2013

[29] K Huang G C Fox and J J DongarraDistributed and CloudComputing From Parallel Processing to the Internet of JingsMorgan Kaufmann Burlington MA USA 2011

10 Journal of Electrical and Computer Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

Result output module UP_CTRL If IP_CTRL gives theID number the data of the same ID number are extractedfrom RAM and the output result resultprime is calculated and fedback to IP_CTRL and output to the output FIFOo at the sametime If IP_CTRL has no ID number output the data inuFIFO are extracted and resultrsquo is calculated and sent toFIFOo ampe data in FIFOo are fed back to the result receptionprocess of the cipher server management system through theinterface module in DMA mode

52 Test ampe test environment is as follows ampe mainfrequency of the heterogeneous multicore parallel pro-cessing system implemented by Xilinx XC7K325t FPGA is160MHz and the interface with the upper application isPCIe 20lowast 8

Test 1 SM4 CBC encryption for a 4000MB file ampe end ofthe test operation takes 114390935 s so the data streamprocessing rate is 4000lowast 8114390935 279742446Mbps

Test 2 Test 21 to Test 24 use eight 400MB files and jobpackages of the eight files enter the cipher machine in aninterleaving manner ampese files use different IVs andKEYs in different working modes ampe end time of each fileprocessing is shown in Table 2 For the data set themaximum end time is the total time it takes ampe data flowprocessing rate is derived from the following formula

DMA_ctrl

FIFO1

IP_CTRL UP_CTRL

PCI_E interface

PD_TLP_CTRL PU_TLP_CTRL

IP FIFO

DOWN_FIFO

PSHEDULE

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM4SM4

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

FIFO1

IP_CTRL

SM3

FIFOoFIFOoFIFOo

UP_CTRL

IP FIFO

FIFO2

uFIFO RAM

Result

Result

Figure 11 Hardware architecture of partem

Table 2 Cipher operation time under cross files

Test 21

File File 1 File 2 File 3 File 4Mode SM4 CBC

Time (s) 135175 134912 134994 134831File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 134788 134872 134950 135014Processing rate 3200lowast 8135175 s 189Gbps

Test 22

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 125029 124946 124988 124784File File 5 File 6 File 7 File 8Mode SM4 CBC

Time (s) 124825 124906 124864 125052Processing rate 3200lowast 8125052 s 205Gbps

Test 23

File File 1 File 2 File 3 File 4Mode SM4 ECB

Time (s) 43340 43166 40397 41721File File 5 File 6 File 7 File 8Mode SM4 ECB

Time (s) 43433 49895 41908 43623Processing rate 3200lowast 849895 s 513Gbps

Test 24

File File 1 File 2 File 3 File 4Mode SM3 SM4 ECB

Time (s) 128909 124151 130394 131713File File 5 File 6 File 7 File 8Mode SM4 CBC SM4 OFB

Time (s) 127424 126119 131909 124842Processing rate 3200lowast 8131909194Gbps

8 Journal of Electrical and Computer Engineering

rate (bps) size of data flow (bit)the total time (s)

(6)

Analysis Because Test 1 has only one file in the CBCmode the job packages are interrelated and all are executedserial Although packages of each file are interrelated thefiles of Test 21 are independent of each other so the dataflow processing rate of Test 21 is higher than that of Test 1In Test 22 4 files are in the ECB work mode and theindependent job packages operation time can be hiddenwithin the operation time of the dependency packages sothe data flow processing rate of Test 22 is higher than thatof the Test 21 Similarly the data processing rate of Test 23is the highest Test 24 has 2 files with independent jobpackages and 6 files with dependent packages but they areallocated in two algorithm units so the operation rate isclose to Test 21

Test 3 Processing rate compare of dual channel and singlechannel ampe total amount of job packages is 10000and they are randomly assigned to j files If Ni representsthe number of job packages in filei 1113936

ji1Ni 10000 If j

is 10 20 30 40 the ECB or CBC encryption mode isadopted Change the number of files in CBC encryptionmode and compare the completion time of data flowin single-channel architecture and dual-channel archi-tecture ampe average value of data flow processing time isrun several times and the comparison result is shown inFigure 12 Single 0means that all files use ECBmode andthe system adopts single-channel architecture Dual 50indicates that 50 of the files in the data stream useCBC mode and the system is dual-channel architectureand so on

As can be seen from Figure 12 when the data flow is anindependent data flow the algorithm operation unit adoptsthe pipeline design so the processing rate under the dualchannel is close to the processing rate under the singlechannel with the increase of the associated job packages inthe data flow the advantage of the data processing rate ofdual channel is gradually displayed and with the increase ofthe number of files in the data stream the advantage of thedata processing rate is more obvious

6 Conclusion

Based on the characteristics of cryptographic operationsthis paper proposes a dual-channel pipeline parallel dataprocessing model DPP to implement cryptographic op-erations for cross-data streams with different servicerequirements in a multiuser environment ampe modelensures synchronization between dependent job packagesand parallel processing between independent job packagesand data streams It hides the processing of independentjob packages in the process of dependent job packages toimprove the processing speed of cross-data streamsPrototype experiments prove that the system under thismodel can realize correct and rapid processing of multi-service and personalized cross-data streams Increasingthe depth of the cryptographic algorithm pipeline and

improving the processing performance of each module inthe pipeline can improve the overall performance of thesystem

Data Availability

ampe data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

ampe authors declare that they have no conflicts of interest

Acknowledgments

ampis work was supported by the National Key RampD Programof China (no 2017YFB0802705) and the National NaturalScience Foundation of China (no 61672515)

References

[1] Y Song and Z Li ldquoApplying array contraction to a sequenceof DOALL loopsrdquo in Proceedings of the International Con-ference on Parallel Processing (ICPPrsquo04) vol 1 pp 46ndash53Montreal Canada August 2004

[2] G Elsesser V Ngo S Bhattacharya and W T Tsai ldquoLoadbalancing of DOALL loops in the Perfect Clubrdquo in Proceedingsof the 1993 Proceedings Seventh International Parallel ProcessingSymposium pp 129ndash133 Newport CA USA April 1993

10 20 30 40File number

0

1

2

3

4

5

6

Proc

essin

g ra

te (G

bps)

Single 0Single 50Single 100

Dual 0Dual 50Dual 100

Multiple files sm4 ECB and CBC cross encryption

Figure 12 Processing rate comparison of dual channel and singlechannel

Journal of Electrical and Computer Engineering 9

[3] D K C Ding-Kai Chen and P C Y Pen-Chung YewldquoStatement re-ordering for DOACROSS loopsrdquo in Pro-ceedings of the 1994 Internatonal Conference on ParallelProcessing vol 2 pp 24ndash28 Raleigh NC USA August 1994

[4] G Ottoni R Rangan A Stoler and D I August ldquoAutomaticthread extraction with decoupled software pipeliningrdquo inProceedings of the 38th Annual IEEEACM InternationalSymposium on Microarchitecture (MICROrsquo05) pp 105ndash118Barcelona Spain November 2005

[5] V Krishnan and J Torrellas ldquoA chip-multiprocessor archi-tecture with speculative multithreadingrdquo IEEE Transactionson Computers vol 48 no 9 pp 866ndash880 1999

[6] A S Rajam L E Campostrini J M M Caamantildeo andP Clauss ldquoSpeculative runtime parallelization of loop neststowards greater scope and efficiencyrdquo in Proceedings of the2015 IEEE International Parallel and Distributed ProcessingSymposium Workshop (IPDPSW) pp 245ndash254 HyderabadIndia May 2015

[7] S Aldea A Estebanez D R Llanos and A Gonzalez-Escribano ldquoAn OpenMP extension that supports thread-levelspeculationrdquo IEEE Transactions on Parallel and DistributedSystems vol 27 no 1 pp 78ndash91 2016

[8] J Salamanca J N Amaral and G Araujo ldquoEvaluating andimproving thread-level speculation in hardware transactionalmemoriesrdquo in Proceedings of the 2016 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)pp 586ndash595 Chicago IL USA May 2016

[9] Z Ying and B Qinghai ldquoampe scheme for improving the ef-ficiency of block cipher algorithmrdquo in Proceedings of the 2014IEEE Workshop on Advanced Research and Technology inIndustry Applications (WARTIA) pp 824ndash826 Ottawa ONCanada September 2014

[10] P Kitsos and A N Skodras ldquoAn FPGA implementationand performance evaluation of the seed block cipherrdquo inProceedings of the 2011 17th International Conference onDigital Signal Processing (DSP) pp 1ndash5 Corfu Greece July2011

[11] L Bossuet N Datta C Mancillas-Lopez and M NandildquoELmD a pipelineable authenticated encryption and itshardware implementationrdquo IEEE Transactions on Computersvol 65 no 11 pp 3318ndash3331 2016

[12] P U Deshpande and S A Bhosale ldquoAES encryption enginesof many core processor arrays on FPGA by using parallelpipeline and sequential techniquerdquo in Proceedings of the 2015International Conference on Energy Systems and Applicationspp 75ndash80 Pune India October 2015

[13] T Kryjak and M Gorgon ldquoPipeline implementation of the128-bit block cipher CLEFIA in FPGArdquo in Proceedings of the2009 International Conference on Field Programmable Logicand Applications pp 373ndash378 Prague Czech RepublicAugust 2009

[14] S Lin S He X Guo and D Guo ldquoAn efficient algorithm forcomputing modular division over GF(2m) in elliptic curvecryptographyrdquo in Proceedings of the 2017 11th IEEE In-ternational Conference on Anti-counterfeiting Security andIdentification (ASID) pp 179ndash182 Xiamen China 2017

[15] K M John and S Sabi ldquoA novel high performance ECCprocessor architecture with two staged multiplierrdquo in Pro-ceedings of the 2017 IEEE International Conference on Elec-trical Instrumentation and Communication Engineering(ICEICE) pp 1ndash5 Karur India April 2017

[16] M S Albahri M Benaissa and Z U A Khan ldquoParallelimplementation of ECC point multiplication on a homoge-neous multi-core microcontrollerrdquo in Proceedings of the 2016

12th International Conference on Mobile Ad-Hoc and SensorNetworks (MSN) pp 386ndash389 Hefei China December 2016

[17] W K Lee B M Goi R C W Phan and G S Poh ldquoHighspeed implementation of symmetric block cipher on GPUrdquo inProceedings of the 2014 International Symposium on IntelligentSignal Processing and Communication Systems (ISPACS)pp 102ndash107 Kuching Malaysia December 2014

[18] J Ma X Chen R Xu and J Shi ldquoImplementation andevaluation of different parallel designs of AES using CUDArdquoin Proceedings of the 2017 IEEE Second International Con-ference on Data Science in Cyberspace (DSC) pp 606ndash614Shenzhen China June 2017

[19] W Dai Y Doroz and B Sunar ldquoAccelerating NTRU basedhomomorphic encryption using GPUsrdquo in Proceedings of the2014 IEEE High Performance Extreme Computing Conference(HPEC) pp 1ndash6 Waltham MA USA September 2014

[20] G Barlas A Hassan and Y A Jundi ldquoAn analytical approachto the design of parallel block cipher encryptiondecryptiona CPUGPU case studyrdquo in Proceedings of the 2011 19thInternational Euromicro Conference on Parallel Distributedand Network-Based Processing pp 247ndash251 Ayia NapaCyprus February 2011

[21] H Kondo S Otani M Nakajima et al ldquoHeterogeneousmulticore SoC with SiP for secure multimedia applicationsrdquoIEEE Journal of Solid-State Circuits vol 44 no 8 pp 2251ndash2259 2009

[22] S Wang J Han Y Li Y Bo and X Zeng ldquoA 920 MHz quad-core cryptography processor accelerating parallel task pro-cessing of public-key algorithmsrdquo in Proceedings of the IEEE2013 Custom Integrated Circuits Conference pp 1ndash4 San JoseCA USA September 2013

[23] M Alfadel E S M El-Alfy and K M A Kamal ldquoEvaluatingtime and throughput at different modes of operation in AESalgorithmrdquo in Proceedings of the 2017 8th InternationalConference on Information Technology (ICIT) pp 795ndash801Amman Jordan May 2017

[24] A Abidi S Tawbi C Guyeux B Bouallegue andM Machhout ldquoSummary of topological study of chaotic cbcmode of operationrdquo in Proceedings of the 2016 IEEE IntlConference on Computational Science and Engineering (CSE)and IEEE Intl Conference on Embedded and UbiquitousComputing (EUC) and 15th Intl Symposium on DistributedComputing and Applications for Business Engineering(DCABES) pp 436ndash443 Paris France August 2016

[25] S Najjar-Ghabel S Yousefi andM Z Lighvan ldquoA high speedimplementation counter mode cryptography using hardwareparallelismrdquo in Proceedings of the 2016 Eighth InternationalConference on Information and Knowledge Technology (IKT)pp 55ndash60 Hamedan Iran September 2016

[26] H M Heys ldquoAnalysis of the statistical cipher feedback modeof block ciphersrdquo IEEE Transactions on Computers vol 52no 1 pp 77ndash92 2003

[27] M A Alomari K Samsudin and A R Ramli ldquoA study onencryption algorithms and modes for disk encryptionrdquo inProceedings of the 2009 International Conference on SignalProcessing Systems pp 793ndash797 Singapore 2009

[28] L Wang H M Cui L Chen and X B Feng ldquoResearch ontask parallel programming modelrdquo Journal of Softwarevol 24 no 1 pp 77ndash90 2013

[29] K Huang G C Fox and J J DongarraDistributed and CloudComputing From Parallel Processing to the Internet of JingsMorgan Kaufmann Burlington MA USA 2011

10 Journal of Electrical and Computer Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

rate (bps) size of data flow (bit)the total time (s)

(6)

Analysis Because Test 1 has only one file in the CBCmode the job packages are interrelated and all are executedserial Although packages of each file are interrelated thefiles of Test 21 are independent of each other so the dataflow processing rate of Test 21 is higher than that of Test 1In Test 22 4 files are in the ECB work mode and theindependent job packages operation time can be hiddenwithin the operation time of the dependency packages sothe data flow processing rate of Test 22 is higher than thatof the Test 21 Similarly the data processing rate of Test 23is the highest Test 24 has 2 files with independent jobpackages and 6 files with dependent packages but they areallocated in two algorithm units so the operation rate isclose to Test 21

Test 3 Processing rate compare of dual channel and singlechannel ampe total amount of job packages is 10000and they are randomly assigned to j files If Ni representsthe number of job packages in filei 1113936

ji1Ni 10000 If j

is 10 20 30 40 the ECB or CBC encryption mode isadopted Change the number of files in CBC encryptionmode and compare the completion time of data flowin single-channel architecture and dual-channel archi-tecture ampe average value of data flow processing time isrun several times and the comparison result is shown inFigure 12 Single 0means that all files use ECBmode andthe system adopts single-channel architecture Dual 50indicates that 50 of the files in the data stream useCBC mode and the system is dual-channel architectureand so on

As can be seen from Figure 12 when the data flow is anindependent data flow the algorithm operation unit adoptsthe pipeline design so the processing rate under the dualchannel is close to the processing rate under the singlechannel with the increase of the associated job packages inthe data flow the advantage of the data processing rate ofdual channel is gradually displayed and with the increase ofthe number of files in the data stream the advantage of thedata processing rate is more obvious

6 Conclusion

Based on the characteristics of cryptographic operationsthis paper proposes a dual-channel pipeline parallel dataprocessing model DPP to implement cryptographic op-erations for cross-data streams with different servicerequirements in a multiuser environment ampe modelensures synchronization between dependent job packagesand parallel processing between independent job packagesand data streams It hides the processing of independentjob packages in the process of dependent job packages toimprove the processing speed of cross-data streamsPrototype experiments prove that the system under thismodel can realize correct and rapid processing of multi-service and personalized cross-data streams Increasingthe depth of the cryptographic algorithm pipeline and

improving the processing performance of each module inthe pipeline can improve the overall performance of thesystem

Data Availability

ampe data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

ampe authors declare that they have no conflicts of interest

Acknowledgments

ampis work was supported by the National Key RampD Programof China (no 2017YFB0802705) and the National NaturalScience Foundation of China (no 61672515)

References

[1] Y Song and Z Li ldquoApplying array contraction to a sequenceof DOALL loopsrdquo in Proceedings of the International Con-ference on Parallel Processing (ICPPrsquo04) vol 1 pp 46ndash53Montreal Canada August 2004

[2] G Elsesser V Ngo S Bhattacharya and W T Tsai ldquoLoadbalancing of DOALL loops in the Perfect Clubrdquo in Proceedingsof the 1993 Proceedings Seventh International Parallel ProcessingSymposium pp 129ndash133 Newport CA USA April 1993

10 20 30 40File number

0

1

2

3

4

5

6

Proc

essin

g ra

te (G

bps)

Single 0Single 50Single 100

Dual 0Dual 50Dual 100

Multiple files sm4 ECB and CBC cross encryption

Figure 12 Processing rate comparison of dual channel and singlechannel

Journal of Electrical and Computer Engineering 9

[3] D K C Ding-Kai Chen and P C Y Pen-Chung YewldquoStatement re-ordering for DOACROSS loopsrdquo in Pro-ceedings of the 1994 Internatonal Conference on ParallelProcessing vol 2 pp 24ndash28 Raleigh NC USA August 1994

[4] G Ottoni R Rangan A Stoler and D I August ldquoAutomaticthread extraction with decoupled software pipeliningrdquo inProceedings of the 38th Annual IEEEACM InternationalSymposium on Microarchitecture (MICROrsquo05) pp 105ndash118Barcelona Spain November 2005

[5] V Krishnan and J Torrellas ldquoA chip-multiprocessor archi-tecture with speculative multithreadingrdquo IEEE Transactionson Computers vol 48 no 9 pp 866ndash880 1999

[6] A S Rajam L E Campostrini J M M Caamantildeo andP Clauss ldquoSpeculative runtime parallelization of loop neststowards greater scope and efficiencyrdquo in Proceedings of the2015 IEEE International Parallel and Distributed ProcessingSymposium Workshop (IPDPSW) pp 245ndash254 HyderabadIndia May 2015

[7] S Aldea A Estebanez D R Llanos and A Gonzalez-Escribano ldquoAn OpenMP extension that supports thread-levelspeculationrdquo IEEE Transactions on Parallel and DistributedSystems vol 27 no 1 pp 78ndash91 2016

[8] J Salamanca J N Amaral and G Araujo ldquoEvaluating andimproving thread-level speculation in hardware transactionalmemoriesrdquo in Proceedings of the 2016 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)pp 586ndash595 Chicago IL USA May 2016

[9] Z Ying and B Qinghai ldquoampe scheme for improving the ef-ficiency of block cipher algorithmrdquo in Proceedings of the 2014IEEE Workshop on Advanced Research and Technology inIndustry Applications (WARTIA) pp 824ndash826 Ottawa ONCanada September 2014

[10] P Kitsos and A N Skodras ldquoAn FPGA implementationand performance evaluation of the seed block cipherrdquo inProceedings of the 2011 17th International Conference onDigital Signal Processing (DSP) pp 1ndash5 Corfu Greece July2011

[11] L Bossuet N Datta C Mancillas-Lopez and M NandildquoELmD a pipelineable authenticated encryption and itshardware implementationrdquo IEEE Transactions on Computersvol 65 no 11 pp 3318ndash3331 2016

[12] P U Deshpande and S A Bhosale ldquoAES encryption enginesof many core processor arrays on FPGA by using parallelpipeline and sequential techniquerdquo in Proceedings of the 2015International Conference on Energy Systems and Applicationspp 75ndash80 Pune India October 2015

[13] T Kryjak and M Gorgon ldquoPipeline implementation of the128-bit block cipher CLEFIA in FPGArdquo in Proceedings of the2009 International Conference on Field Programmable Logicand Applications pp 373ndash378 Prague Czech RepublicAugust 2009

[14] S Lin S He X Guo and D Guo ldquoAn efficient algorithm forcomputing modular division over GF(2m) in elliptic curvecryptographyrdquo in Proceedings of the 2017 11th IEEE In-ternational Conference on Anti-counterfeiting Security andIdentification (ASID) pp 179ndash182 Xiamen China 2017

[15] K M John and S Sabi ldquoA novel high performance ECCprocessor architecture with two staged multiplierrdquo in Pro-ceedings of the 2017 IEEE International Conference on Elec-trical Instrumentation and Communication Engineering(ICEICE) pp 1ndash5 Karur India April 2017

[16] M S Albahri M Benaissa and Z U A Khan ldquoParallelimplementation of ECC point multiplication on a homoge-neous multi-core microcontrollerrdquo in Proceedings of the 2016

12th International Conference on Mobile Ad-Hoc and SensorNetworks (MSN) pp 386ndash389 Hefei China December 2016

[17] W K Lee B M Goi R C W Phan and G S Poh ldquoHighspeed implementation of symmetric block cipher on GPUrdquo inProceedings of the 2014 International Symposium on IntelligentSignal Processing and Communication Systems (ISPACS)pp 102ndash107 Kuching Malaysia December 2014

[18] J Ma X Chen R Xu and J Shi ldquoImplementation andevaluation of different parallel designs of AES using CUDArdquoin Proceedings of the 2017 IEEE Second International Con-ference on Data Science in Cyberspace (DSC) pp 606ndash614Shenzhen China June 2017

[19] W Dai Y Doroz and B Sunar ldquoAccelerating NTRU basedhomomorphic encryption using GPUsrdquo in Proceedings of the2014 IEEE High Performance Extreme Computing Conference(HPEC) pp 1ndash6 Waltham MA USA September 2014

[20] G Barlas A Hassan and Y A Jundi ldquoAn analytical approachto the design of parallel block cipher encryptiondecryptiona CPUGPU case studyrdquo in Proceedings of the 2011 19thInternational Euromicro Conference on Parallel Distributedand Network-Based Processing pp 247ndash251 Ayia NapaCyprus February 2011

[21] H Kondo S Otani M Nakajima et al ldquoHeterogeneousmulticore SoC with SiP for secure multimedia applicationsrdquoIEEE Journal of Solid-State Circuits vol 44 no 8 pp 2251ndash2259 2009

[22] S Wang J Han Y Li Y Bo and X Zeng ldquoA 920 MHz quad-core cryptography processor accelerating parallel task pro-cessing of public-key algorithmsrdquo in Proceedings of the IEEE2013 Custom Integrated Circuits Conference pp 1ndash4 San JoseCA USA September 2013

[23] M Alfadel E S M El-Alfy and K M A Kamal ldquoEvaluatingtime and throughput at different modes of operation in AESalgorithmrdquo in Proceedings of the 2017 8th InternationalConference on Information Technology (ICIT) pp 795ndash801Amman Jordan May 2017

[24] A Abidi S Tawbi C Guyeux B Bouallegue andM Machhout ldquoSummary of topological study of chaotic cbcmode of operationrdquo in Proceedings of the 2016 IEEE IntlConference on Computational Science and Engineering (CSE)and IEEE Intl Conference on Embedded and UbiquitousComputing (EUC) and 15th Intl Symposium on DistributedComputing and Applications for Business Engineering(DCABES) pp 436ndash443 Paris France August 2016

[25] S Najjar-Ghabel S Yousefi andM Z Lighvan ldquoA high speedimplementation counter mode cryptography using hardwareparallelismrdquo in Proceedings of the 2016 Eighth InternationalConference on Information and Knowledge Technology (IKT)pp 55ndash60 Hamedan Iran September 2016

[26] H M Heys ldquoAnalysis of the statistical cipher feedback modeof block ciphersrdquo IEEE Transactions on Computers vol 52no 1 pp 77ndash92 2003

[27] M A Alomari K Samsudin and A R Ramli ldquoA study onencryption algorithms and modes for disk encryptionrdquo inProceedings of the 2009 International Conference on SignalProcessing Systems pp 793ndash797 Singapore 2009

[28] L Wang H M Cui L Chen and X B Feng ldquoResearch ontask parallel programming modelrdquo Journal of Softwarevol 24 no 1 pp 77ndash90 2013

[29] K Huang G C Fox and J J DongarraDistributed and CloudComputing From Parallel Processing to the Internet of JingsMorgan Kaufmann Burlington MA USA 2011

10 Journal of Electrical and Computer Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

[3] D K C Ding-Kai Chen and P C Y Pen-Chung YewldquoStatement re-ordering for DOACROSS loopsrdquo in Pro-ceedings of the 1994 Internatonal Conference on ParallelProcessing vol 2 pp 24ndash28 Raleigh NC USA August 1994

[4] G Ottoni R Rangan A Stoler and D I August ldquoAutomaticthread extraction with decoupled software pipeliningrdquo inProceedings of the 38th Annual IEEEACM InternationalSymposium on Microarchitecture (MICROrsquo05) pp 105ndash118Barcelona Spain November 2005

[5] V Krishnan and J Torrellas ldquoA chip-multiprocessor archi-tecture with speculative multithreadingrdquo IEEE Transactionson Computers vol 48 no 9 pp 866ndash880 1999

[6] A S Rajam L E Campostrini J M M Caamantildeo andP Clauss ldquoSpeculative runtime parallelization of loop neststowards greater scope and efficiencyrdquo in Proceedings of the2015 IEEE International Parallel and Distributed ProcessingSymposium Workshop (IPDPSW) pp 245ndash254 HyderabadIndia May 2015

[7] S Aldea A Estebanez D R Llanos and A Gonzalez-Escribano ldquoAn OpenMP extension that supports thread-levelspeculationrdquo IEEE Transactions on Parallel and DistributedSystems vol 27 no 1 pp 78ndash91 2016

[8] J Salamanca J N Amaral and G Araujo ldquoEvaluating andimproving thread-level speculation in hardware transactionalmemoriesrdquo in Proceedings of the 2016 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)pp 586ndash595 Chicago IL USA May 2016

[9] Z Ying and B Qinghai ldquoampe scheme for improving the ef-ficiency of block cipher algorithmrdquo in Proceedings of the 2014IEEE Workshop on Advanced Research and Technology inIndustry Applications (WARTIA) pp 824ndash826 Ottawa ONCanada September 2014

[10] P Kitsos and A N Skodras ldquoAn FPGA implementationand performance evaluation of the seed block cipherrdquo inProceedings of the 2011 17th International Conference onDigital Signal Processing (DSP) pp 1ndash5 Corfu Greece July2011

[11] L Bossuet N Datta C Mancillas-Lopez and M NandildquoELmD a pipelineable authenticated encryption and itshardware implementationrdquo IEEE Transactions on Computersvol 65 no 11 pp 3318ndash3331 2016

[12] P U Deshpande and S A Bhosale ldquoAES encryption enginesof many core processor arrays on FPGA by using parallelpipeline and sequential techniquerdquo in Proceedings of the 2015International Conference on Energy Systems and Applicationspp 75ndash80 Pune India October 2015

[13] T Kryjak and M Gorgon ldquoPipeline implementation of the128-bit block cipher CLEFIA in FPGArdquo in Proceedings of the2009 International Conference on Field Programmable Logicand Applications pp 373ndash378 Prague Czech RepublicAugust 2009

[14] S Lin S He X Guo and D Guo ldquoAn efficient algorithm forcomputing modular division over GF(2m) in elliptic curvecryptographyrdquo in Proceedings of the 2017 11th IEEE In-ternational Conference on Anti-counterfeiting Security andIdentification (ASID) pp 179ndash182 Xiamen China 2017

[15] K M John and S Sabi ldquoA novel high performance ECCprocessor architecture with two staged multiplierrdquo in Pro-ceedings of the 2017 IEEE International Conference on Elec-trical Instrumentation and Communication Engineering(ICEICE) pp 1ndash5 Karur India April 2017

[16] M S Albahri M Benaissa and Z U A Khan ldquoParallelimplementation of ECC point multiplication on a homoge-neous multi-core microcontrollerrdquo in Proceedings of the 2016

12th International Conference on Mobile Ad-Hoc and SensorNetworks (MSN) pp 386ndash389 Hefei China December 2016

[17] W K Lee B M Goi R C W Phan and G S Poh ldquoHighspeed implementation of symmetric block cipher on GPUrdquo inProceedings of the 2014 International Symposium on IntelligentSignal Processing and Communication Systems (ISPACS)pp 102ndash107 Kuching Malaysia December 2014

[18] J Ma X Chen R Xu and J Shi ldquoImplementation andevaluation of different parallel designs of AES using CUDArdquoin Proceedings of the 2017 IEEE Second International Con-ference on Data Science in Cyberspace (DSC) pp 606ndash614Shenzhen China June 2017

[19] W Dai Y Doroz and B Sunar ldquoAccelerating NTRU basedhomomorphic encryption using GPUsrdquo in Proceedings of the2014 IEEE High Performance Extreme Computing Conference(HPEC) pp 1ndash6 Waltham MA USA September 2014

[20] G Barlas A Hassan and Y A Jundi ldquoAn analytical approachto the design of parallel block cipher encryptiondecryptiona CPUGPU case studyrdquo in Proceedings of the 2011 19thInternational Euromicro Conference on Parallel Distributedand Network-Based Processing pp 247ndash251 Ayia NapaCyprus February 2011

[21] H Kondo S Otani M Nakajima et al ldquoHeterogeneousmulticore SoC with SiP for secure multimedia applicationsrdquoIEEE Journal of Solid-State Circuits vol 44 no 8 pp 2251ndash2259 2009

[22] S Wang J Han Y Li Y Bo and X Zeng ldquoA 920 MHz quad-core cryptography processor accelerating parallel task pro-cessing of public-key algorithmsrdquo in Proceedings of the IEEE2013 Custom Integrated Circuits Conference pp 1ndash4 San JoseCA USA September 2013

[23] M Alfadel E S M El-Alfy and K M A Kamal ldquoEvaluatingtime and throughput at different modes of operation in AESalgorithmrdquo in Proceedings of the 2017 8th InternationalConference on Information Technology (ICIT) pp 795ndash801Amman Jordan May 2017

[24] A Abidi S Tawbi C Guyeux B Bouallegue andM Machhout ldquoSummary of topological study of chaotic cbcmode of operationrdquo in Proceedings of the 2016 IEEE IntlConference on Computational Science and Engineering (CSE)and IEEE Intl Conference on Embedded and UbiquitousComputing (EUC) and 15th Intl Symposium on DistributedComputing and Applications for Business Engineering(DCABES) pp 436ndash443 Paris France August 2016

[25] S Najjar-Ghabel S Yousefi andM Z Lighvan ldquoA high speedimplementation counter mode cryptography using hardwareparallelismrdquo in Proceedings of the 2016 Eighth InternationalConference on Information and Knowledge Technology (IKT)pp 55ndash60 Hamedan Iran September 2016

[26] H M Heys ldquoAnalysis of the statistical cipher feedback modeof block ciphersrdquo IEEE Transactions on Computers vol 52no 1 pp 77ndash92 2003

[27] M A Alomari K Samsudin and A R Ramli ldquoA study onencryption algorithms and modes for disk encryptionrdquo inProceedings of the 2009 International Conference on SignalProcessing Systems pp 793ndash797 Singapore 2009

[28] L Wang H M Cui L Chen and X B Feng ldquoResearch ontask parallel programming modelrdquo Journal of Softwarevol 24 no 1 pp 77ndash90 2013

[29] K Huang G C Fox and J J DongarraDistributed and CloudComputing From Parallel Processing to the Internet of JingsMorgan Kaufmann Burlington MA USA 2011

10 Journal of Electrical and Computer Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: AnEfficientStreamDataProcessingModelforMultiuser ...2018/04/02  · mentation scheme of the cryptographic algorithm can be applied to the algorithm core in this model. is paper focuses

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom