The need for modern computing paradigm: Science …The need for modern computing paradigm: Science applied to computing János V égh Kalimános BT Debrecen, Hungary [email protected]

The need for modern computing paradigm:Science applied to computingJanos Vegh

Kalimanos BTDebrecen, Hungary

[email protected] ORCID: 0000-0002-3247-7810

Alin TisanRoyal Holloway, University of London, UK

Department of Electronic [email protected] ORCID: 0000-0002-8280-2722

Abstract—More than hundred years ago the classic physics wasin its full power, with just a few unexplained phenomena; which,however, led to a revolution and the development of the modernphysics. The outbreak was possible by studying the nature underextreme conditions which finally led to the understanding of therelativistic and quantal behavior. Today, the computing is in asimilar position: it is a sound success story, with exponentiallygrowing utilization but, as moving towards extreme utilizationconditions, with a growing number of difficulties and unexpectedissues which cannot be explained based on the classic computingparadigm. The paper draws the attention that under extremeconditions, computing behavior could differ than the one inthe normal conditions, and pinpoints that certain, unnoticed orneglected features enable the explanation of the new phenomena,and the enhancement of some computing features. Moreover,a new modern computing paradigm implementation idea isproposed.

Index Terms—modern computing paradigm, performance lim-itation, efficiency, parallelized computing, supercomputing, highperformance computing, distributed computing, Amdahl’s Law

I. INTRODUCTION

Initially computers were constructed with the goal to re-duce the computation time of rather complex but sequentialmathematical tasks [1]–[4]. Today a computer is deployed forradically different goals, where the mathematical computationin most cases is just a very small part of the task. The majorityof the non-computational activities are reduced to or imitatedby some kind of computations, because this is the only activitythat the computer can do.

Because of this, the current computer architecture (con-cluded from the 70-year old ’classic’ paradigm) is not reallysuitable for the goals they are expected to work for: to handlethe varying degree of parallelism in a dynamic way; to workin multi-tasking mode; to prepare systems comprising many-many processors with a good efficacy; to cooperate with thelarge number of other processors (both on-chip and off-chip);to monitor many external actions and react to them in real-time; to operate ’big data’ processing systems, etc. Not onlythe efficacy of computing is very low because of the differentperformance losses [5], but more and more limitations cometo the light [6].

Project no. 125547 has been implemented with the support providedfrom the National Research, Development and Innovation Fund of Hungary,financed under the K funding scheme. Also the ERC-ECAS support of project886183 is acknowledged.

Not only that in extreme conditions computing quicklyreaches its performance limits [6], [7], but this performancedegrades even more accelerated in systems comprising par-allelized sequential processors [8]. The very rigid HW-SWseparation leads to the phenomenon known as priority inver-sion [9], the very large rigid architectures suffer from frequentcomponent errors, the complex cyber-physical systems must beequipped with excessive computing facilities to provide real-time operation, delivering large amount of data from the bigstorage centres to the places where the processing capacity isconcentrated consitutes a major challenge, etc. All these issueshave a common reason [10]: the classical computing paradigmthat reflects the state of the art of computing 70 years ago.Computing needs renewal [11]. Not only the materials andmethods of gate handling [12] but also our thinking must berebooted.

It was early discovered that the age of conventional archi-tectures was over [13], [14]; the only question that remainedopen whether the game is over, too [15]. When comes toparallel computing, the todays technology is able to delivernot only many, but ”too many” cores [16]. Their computingperformance, however, does not increase linearly with thenumber of cores [17] (”a trend that can’t go on ad infinitum”[18]), and only a few of them can be utilized simultaneously[19]. Consequently, it is unreasonable to use high number ofcores for such extreme system operation [20]. Approaching thelimits of the classic paradigm of computing has rearranged thetechnology ranking [21]. ”New ways of exploiting the siliconreal estate need to be explored” [22] and a modern computingparadigm is needed indeed.

One of the major issues is, that ”being able to run softwaredeveloped for today’s computers on tomorrow’s has reigned inthe cost of software development. But this guarantee betweencomputer designers and software designers has become abarrier to a new computing era.” [21] A new computerparadigm, that is an extension rather than a replacement ofthe old paradigm, has better chances to be accepted, since itprovides a smooth transition from the age of old computingto the age of modern computing.

arX

iv:1

908.

0265

1v3

[cs

.GL

] 5

Jan

202

0

II. ANALOGIES WITH THE CLASSIC VERSUS MODERNPHYSICS

The case of computing is very much analogous with thecase of classic physics versus the modern physics (relativisticand quantum). Given the world we live in it is rather counter-intuitive to accept that as we move towards unusual condi-tions, the adding of speeds behaves differently, the energybecomes discontinuous, the momentum and the position ofa particle cannot be measured accurately at the same time.In normal cases there is no notable difference, when usingif applying or not the non-classical principles. However, aswe get farther from the everyday conditions, the differencegets more considerable, and even leads to phenomena one cannever experience under the usual, everyday conditions. Theanalogies do not want to imply direct correspondence betweencertain physical and computing phenomena. Rather, the paperdraws the attention to both that under extreme conditionsqualitatively different behavior may be encountered, and thatscrutinizing certain, formerly unnoticed or neglected aspectsenables to explain the new phenomena. Unlike in the nature,the technical implementation of the critical points can bechanged, and through this the behavior of the computingsystems can also be changed. In this paper, only two of theaffected important areas of computing will be touched in moredetails: parallel processing and multitasking.

A. Analogy with the special relativity

In the above sense, there is an important difference betweenthe operation and the performance of the single-processor andthose of the parallelized but sequentially working computersystems. As long as just a few (thousands) single processorsare aggregated into a large computer system, the resultingperformance will correspond (approximately) to the sum of thesingle-processor performance values: similarly to the classicrule of adding speeds (see also table I). However, whenassembling larger computing systems (and approaching withtheir performance ”the speed of light” of computing systems inthe range of millions of processors) the nominal performancestarts to deviate from the experienced payload performance:the phenomenon known as efficiency appears. In addition,there is more than just one efficiency [23]: the measurableefficiency depends on the method of measurement (the utilizedbenchmark program). Actually, the simple Amdahl’s Lawresults in a 2-dim surface1 shown in Figure 1. Changing theimplementation method, the efficacy at large number of corescan be an order of magnitude higher.

What makes the analogy really relevant, it is shown inTable I: although the function form is different, the modernapproach to computing (see the model) introduces a correctionterm that remains close to unity until the extreme largenumber of cores is approached, then the performance functionssaturate. The phenomenon is surprisingly similar to increasingthe speed of a body with constant acceleration (see Figure 5):

1As can be seen from Figure 6, at extreme high number of cores (1− α)also depends on the number of cores.

Physics ComputingAdding of speeds Adding of performanceClassic Classicv(t) = t× a Perftotal(N) = N × Perfsinglec = Light Speedt = time N = number of coresa = acceleration Perfsingle = single processor performancen = optical density α = parallelismModern (relativistic) Modern [24]v(t) = t×a√

1+( t×gc/n

)2Perftotal(N) =

N×Perfsingle

n×(1−α)+α

TABLE ITHE ”ADDING SPEEDS” ANALOGIES BETWEEN THE CLASSIC AND

MODERN ARTS OF SCIENCE AND COMPUTING, RESPECTIVELY

105106

107108

10−710−6

10−8

10−2

10−1

100

Noofcores

(1− αHPLeff )

Efficiency

Dependence of EHPL and EHPCG on (1− αHPLeff ) and N

TOP5’2018.11

SummitSierraTaihulightTianhe-2K computer

Fig. 1. The efficiency surface corresponding to the ”modern paradigm”,see Table I, with some measured (with HPL and HPCG) efficiencies ofsupercomputers. Recall that ”this decay in performance is not a fault of thearchitecture, but is dictated by the limited parallelism”. [25]

at low time (or speed) values there is no noticeable differencebetween the speeds calculated with or without relativisticcorrection, but as speed closes to the speed of light, the speedof the accelerated body saturates. Whether it is formulated thatthe mass increases or the time slows down, the essence is thesame: when approaching (the specific) limits, the behavior ofthe subject drastically changes.

The performance measurements are simple time measure-ments: a standardized set of machine instructions is executed(a large number of times) and the known number of operationsis divided by the measurement time; both for the single-processor and for the distributed parallelized sequential sys-tems. In the latter case, however, the joint work must also beorganized, implemented with extra machine instructions andextra execution time. As Fig. 2 shows, in the parallel operatingmode both the software and the hardware contribute to theexecution time, i.e. both must be considered in Amdahl’s Law.

This is the origin of the efficiency: one of the processorsorchestrates the joint operation, the others are waiting. Afterreaching a certain number of processors there is no more

Param 1993 Param 2018(Sunway/Taihulight)

α = 1− 10−3α = 3.3− 10−8

Total = 1013 clocks

Ncores = 103 Ncores = 1.06× 107

RMax

RPeak= 1

N×(1−α)+α

= 1103×10−3+1

= 0.5

RMax

RPeak= 1

N×(1−α)+α

= 0.74

Proc

Time(notproportional)

0

1

2

3

4

5

6

7

8

9

10

Model of distributed parallel processing

α = PayloadTotal

P0 P1 P2 P3 P4

AccessInitiation

SoftwarePre

OSPre

T0

PD

00

Process

0PD

01

T1

PD

10

Process

1PD

11

T2

PD

20

Process

2PD

21

T3

PD

30

Process

3PD

31

T4

PD

40

Process

4PD

41

Just waiting

Just waitingOSPost

SoftwarePost

AccessTermination

Payload Total

Extended

Fig. 2. A general model of parallel operation. For better visibility, the lengths of the boxes are not proportional with the time the corresponding action needs.The data are taken from [25] and [26]

increase in the payload fraction when adding more processors:the first fellow processor already finished the task and is wait-ing while the last one is still waiting for the start command.The physical size of the computing system also matters: theprocessor connected with a cable of length of dozens of metersto the first one must spend several hundreds clock cycles withwaiting (not mentioning geographically distributed computersystems, such as some clouds, connected through general-purpose networks). Detailed calculations are given in [27].

The phenomenon itself is known for decades [25]: ”Amdahlargued that most parallel programs have some portion of theirexecution that is inherently serial and must be executed by asingle processor while others remain idle. . . . In fact, therecomes a point when using more processors . . . actually in-creases the execution time rather than reducing it.” Presentlyhowever the theory was almost forgotten mainly due to thequick development of the parallelization technology and theincrease of the single-processor performance.

During the past quarter of century, the proportion of thecontributions changed considerably: today the number of pro-cessors is thousands of times higher than a quarter of centuryago, the growing physical size and the higher processingspeed increased the role of the propagation delay. As a resultof the technical development the same phenomenon returnedin a technically different form at much higher number ofprocessors. Due to this, the computing performance cannot beincreased above the performance defined by the parallelization

technology and the number of processors (this is similar toaffirming that an object having the speed of light cannot be fur-ther accelerated). Exceeding a certain computing performance(using the classic paradigm and its classic implementation) isprohibited by the laws of nature.

The ”speed of light” limit is specific for the differentarchitectures, but the higher is the achieved performance, theharder is to increase it. It looks like that in the feasibilitystudies an analysis whether some inherent performance boundexists remained out of sight either in USA [28], [29] or inEU [30] or in Japan [31] or in China [32].

In Fig. 3, the development of the payload performance ofsome top supercomputers in function of the year of con-struction is depicted. In Fig. 4 the dependence of payloadperformance on nominal performance is depicted for differentbenchmark types, specifically for the commonly utilized HighPerformance Linpack (HPL) and High Performance ConjugateGradients (HPCG) benchmarks. All diagram lines clearly showthe signs of saturation [33] (or rooflines [34]): the more inter-processor communication is needed for solving the task, themore ”dense” is the environment and the smaller is the specific”speed of light”. The third ”roofline” is rather guessed [35]for the case of processor-based brain simulation case, wherehigher orders of magnitude of communication is required. Thepayload performance of the processor-based Artificial Intelli-gence tasks, given the amount of communication involved, isexpected to be between the latter two rooflines, closer to that

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

100

101

102

years

RMax

(pet

aFL

OPS

)Summit

SierraTrinity

TaihulightTianhe-2A

K computerGyoukouPiz Daint

Fig. 3. The RMax payload performance in function of the year of construc-tion for different configurations. The performances of all configurations seemto have their individual saturation value. The later a supercomputer appearsin the competition, the smaller is the performance ratio with respect to itspredecessor; the higher is its rank, the harder is to improve its performance.

of the HPCG level.Supercomputers are computing systems stretched to the

limit [29], [36]. They are a reflection of the engineeringperfectness which, in the case of Summit supercomputer,showed an increase of 17 % in the computing performancewhen adding 5 % more cores (and making fine-tuning afterits quick startup) in its first half year after its appearance,and another 3.5 % increase in its performance when adding0.7 % more cores after another half a year. The one-timeappearance of Gyoukou is mystic: it could catch slot #4 onthe list using just 12% of its 20M cores (and was withdrawnimmediately) although its explicit ambition was to be the #1.This example shows that the lack of understanding of thecomputing systems behaviour under extreme conditions ledto the false presumptions of frauds when reporting computingperformances [37]. Another champion candidate, supercom-puter Aurora [38] –after years of building– was retargeted justweeks before its planned startup. In November 2017 Intelannounced that Aurora has been shifted to 2021. As partof the announcement Knights Hill was canceled and insteadbe replaced by a ”new platform and new microarchitecturespecifically designed for exascale”. The lesson learned wasthat specific design is needed for exascale.

As Figure 4 shows (and discussed in details in [8]), thedeviation from the linear dependence is unnoticeable at lowperformance values: here the ”classic speed addition” is valid.At extremely large performance values, however, the depen-dence is strongly non-linear and specific on the measurementconditions: here the performance limits manifest. The largescatter of the measured data is generated by the varianceof different processor and connection types, design ideas,manufacturers, etc. However, the diagram perfectly reflects thetheory that describes the tendency.

10−6 10−5 10−4 10−3 10−2 10−110−6

10−5

10−4

10−3

10−2

10−1

RPeak (exaFLOPS)

RMax(exaFLOPS)

HPL

5 ∗ 10−7

1 ∗ 10−5

HPCG

1 ∗ 10−4

1.5 ∗ 10−3

Fig. 4. The payload performance RMax in function of the nominal per-formance RPeak , at different (1 − αeff ) values. The figures display themeasured values derived using HPL (empty marks) and HPCG (filled marks)benchmarks, for the TOP15 supercomputers (as of 2019 June). The diagramlines marked as HPL and HPCG correspond to the behavior of supercomputerTaihulight at (1− αeff ) values 3.3 ∗ 10−8 and 2.4 ∗ 10−5, respectively.The uncorrected values of the new supercomputers Summit and Sierraare shown as diamonds, and the same values corrected for single-processorperformance are shown as rectangles. The black dots mark the performancedata of supercomputers JUQUEEN and K as of 2014 June, for HPLand HPCG benchmarks, respectively. The saturation effect can be observedfor both HPL and HPCG benchmarks. The shaded area only highlights thenonlinearity. The red dot denotes the performance value of the system usedby [39]. The dashed line shows the plausible saturation performance value ofthe brain simulation.

B. Analogy with the general relativity

The mentioned decrease of the system’s performance man-ifests in the appearance of the performance wall [8], [34], anew limitation due to the parallelized sequential computing.To make a parallel with the modern physics, it is known thatthe objects with extreme large masses behave differently fromwhat we know they should behave under ’normal’ conditions.In other words, the behavior of large scale ’matter’ largelydeviates from that of the small scale ’matter’. This analogy tophysics is the ’dark silicon’ [19] whose behaviour resemblesthe ’dark matter’: the (silicon) cores are there and usable, but(because of the thermal dissipation) the large amount of coresbehaves differently.

Moreover, the parallel computing introduced the ”dark per-formance”. Due to the classic computing principles, the firstcore must speak to all fellow cores, and this non-parallelizablefraction of the time increases with the number of the cores [34],[40]. The result is that the top supercomputers show anefficacy around only 1% when solving real-life tasks. Anotheranalogy is made with the ”gravitational collapse”, wherea ”communicational collapse” is demonstrated in Fig. 5.(a)in [41]: showing that at extremely large number of coresthat exceeds communication intensity leads to unexpected anddrastic changes of the network latency.

The signals propagation time is also very much similar

105 106 107 108106

107

108

time(s)

speed(m

/s)

Relativistic speed of body accelerated by ’g’

v(t), n = 1

v(t), n = 2

Fig. 5. How the speed of a body accelerated by g depends on the time, inrelativistic approach, see also table I. Compare to Figs. 4 and 6

to the finite propagation of the physical fields, and even thelatency time of the interfaces can be paired with creating andattenuating the physical carriers. Zero-time on/off signals arepossible both in the classical physics and classical computing,while in the corresponding modern counterparts, to create,transfer and detect signals time must be accounted for; theeffect noticed as that the time delay through wires (in thisextended sense) grows compared to the time of gating [7].

Figure 6 introduces a further analogy. The non-parallelizablefraction (denoted on the figure by αX

eff ) of the comput-ing task comprises components of different origin. As al-ready discussed, and was noticed decades ago, ”the inherentcommunication-to-computation ratio in a parallel applicationis one of the important determinants of its performance onany architecture” [25], suggesting that the communicationcan be a dominant contribution in systems performance. Fig-ure. 6.A displays the case of a minimum communication, andFigure 6.B the moderately increased one (corresponding toreal-life supercomputer tasks). As the nominal performanceincreases linearly and the performance decreases exponentiallywith the number of cores, at some critical value where aninflection point occurs, the resulting performance starts to de-crease. The resulting large non-parallelizable fraction stronglydecreases the efficacy (or in other words: the performance gainor speedup) of the system [8], [43]. The effect was noticedearly [25], under different technical conditions but forgottendue to the successes of the development of the parallelizationtechnology.

Figure 6.A illustrates the behavior measured with HPLbenchmark. The looping contribution becomes remarkablearound 0.1 Eflops, and breaks down payload performancewhen approaching 1 Eflops (see also Fig. 1 in [25]). In

10−3 10−2 10−1 10010−10

10−9

10−8

10−7

10−6

10−5

10−4

RPeak(Eflop/s)

(1−αH

PL

eff

)

10−5

10−4

10−3

10−2

10−1

100

RH

PL

Max(Eflop/s)

αSW

αOS

αeff

RMax(Eflop/s)

A

10−3 10−2 10−1 10010−10

10−9

10−8

10−7

10−6

10−5

10−4

RPeak(Eflop/s)

(1−αH

PCG

eff

)

10−5

10−4

10−3

10−2

10−1

100

RH

PCG

Max

(Eflop/s)

αSW

αOS

αeff

RMax(Eflop/s)

B

10−3 10−2 10−1 10010−10

10−9

10−8

10−7

10−6

10−5

10−4

RPeak(Eflop/s)

(1−αN

Neff)

10−5

10−4

10−3

10−2

10−1

100

RN

NM

ax(Eflop/s)

αSW

αOS

αeff

RMax(Eflop/s)

C

Fig. 6. Contributions (1 − αXeff ) to (1 − αtotaleff ) and max payloadperformance RMax of a fictive supercomputer (P = 1Gflop/s @ 1GHz)in function of the nominal performance. The blue diagram line refers to theright hand scale (RMax values), all others (the (1 − αXeff ) contributions)to the left scale. The figure is purely illustrating the concepts; the displayednumbers are somewhat similar to the real ones. The black dots mark the HPLand HPCG performance, respectively, of the computer used in works [39],[42].

figure 6.B the behavior measured with benchmark HPCG isdisplayed. In this case the contribution of the application(brown line) is much higher, the looping contribution (thingreen line) is the same as above. Consequently, the achievablepayload performance is lower and also the breakdown of theperformance is softer.

C. Analogy with the quantum physics

The electronic computers are clock-driven systems, i.e. noaction can happen in a time period shorter than the lengthof one clock period. The typical value of that ”quantum oftime” today is in the nanosecond range, so in the everydaypractice, the time seems to be continuous and the ”quantalnature of time” cannot be noticed. Some (sequential) non-payload fragment in the total time is always present in theparallelized sequential systems: it cannot be smaller thanthe ratio of the length of two clock periods divided by thetotal measurement time, since forking and joining the otherthreads cannot be shorter than one clock period. Unfortunately,the technical implementation needs about ten thousand timeslonger time [44], [45].

The total time of the performance measurement is large(typically hours) but finite, so the non-parallelizable fractionis small but finite. As discussed, the latter increases with thenumber of the cores in the system. Because of this, in contrastwith the statement in [25] that ”the serial fraction . . . is adiminishing function of the problem size”, at sufficiently largenumber of cores the serial fraction even starts to dominate.Because of Amdahl’s Law, the absolute value of the computingperformance of parallelized systems has inherently an upperlimit, and the efficiency is the lower the higher is the numberof the aggregated processing units.

The processor-based brain simulation provides an ”experi-mental evidence” [35] that the time in computing shows quan-tal behavior, analogously with the energy in physics. Whensimulating neurons using processors, the ratio of the simulated(biological) time and the processor time used to simulate thebiological effect may considerably differ, so to avoid workingwith ”signals from the future”, periodic synchronization isrequired that introduces a special ”biological clock cycle”. Therole of this clock period is the same as that of the clock signalin the clocked digital electronics: what happens in this period,it happens ”at the same time”2.

The brain simulation (and in somewhat smaller scale: ar-tificial neural computing) requires intensive data exchangebetween the parallel threads: the neurons are expected to tellthe result of their neural calculations periodically to thousandsof fellow neurons. The commonly used 1 ms ”grid time” is,however, 106 times longer than the 1 ns clock cycle commonin the digital electronics [39]. Correspondingly, its influenceon the performance is noticeable. Figure 6.C demonstrateswhat happens if the clock cycle is 5000 times longer than

2This periodic synchronization will be a limiting factor in large-scaleutilization of processor-based artificial neural chips [46], although thanks tothe cca. thousand times higher ”single-processor performance”, only whenapproaching the computing capacity of (part of) the brain.

in Figure 6.B: it causes a drastic decrease in the achievableperformance and strongly shifts the performance breakdowntoward lower nominal performance values. As shown, the”quantal nature of time” in computing changes the behaviorof the performance drastically.

In addition, the thousands times more communication con-tributes considerably to the non-payload sequential-only frac-tion, so it degrades further the efficacy of the computing sys-tem. What is worse, they are expected to send their messagesat the end of the grid time period, causing a huge burst ofmessages.

Not only the achievable performance is by orders of mag-nitude lower, but also the ”communicational collapse” (seealso [25]) occurs at orders of magnitude lower nominal per-formance. This is the reason why less than one percent ofthe planned capacity can be achieved even by the custombuilt large scale ANN simulators [47]. Similarly, the SW andHW based simulators show up the same limitation [35], [39].This is why only a few dozens of thousands of neurons canbe simulated on processor-based brain simulators (includingboth the many-thread software simulators and the purpose-built brain simulator) [39]. The memory of extremely largesupercomputers can be populated with objects simulatingneurons [48], but as soon as they need to communicate,the task collapses as predicted in Fig. 6. This is indirectlyunderpinned [42] by that the different handling of the threadschanges the efficacy sensitively and that the time required formore detailed simulation increases non-linearly [35], [39].

D. Analogy with the interactions of particles

In the classic computing the processors ability to com-municating with each other is not a native feature: in theSingle Processor Approach questions like message sendingto and receiving from some other party as well as sharingresources has no sense at all (as no other party exists);messaging is very ineffectively imitated by SW in the layerbetween HW and the real SW. This feature alone preventsbuilding exascale supercomputers [8]: after reaching a criticalnumber of processors, adding more processors leads to adecreasing resulting performance [8], [35], as experiencedin [25], [42] and caused demonstrative failures such as thecases of Gyoukou or Aurora or SpiNNaker. This criticalnumber (using the present technology and implementation) isunder 10M cores; the only exception (as of end of 2019) isTaihulight, because it is using a slightly different computingprinciple with its cooperating processors [49].

The laws of parallel computing result in the actual behaviorof the computing systems the more difference from thatexpected on the basis of the classical computing the morecommunication takes place [34]. Similarly, in physics, thebehavior of an atom is strongly changed by the interaction(communication) with other particles.

This phenomenon cannot be explained in the ”classic com-puting” frame. The limits of single-processor performanceenforced by the laws of nature [7] are topped by the limitationsof parallel computing [8], [34], and further limited through

introducing the ”biological clock period” [35]. Notice thatthese contributions are competing with each other, the actualcircumstances decide which of them will dominate. Theireffect, however, is very similar: according to Amdahl, whatis not parallel is qualified as sequential.

E. Analogy with the uncertainty principle

Even the quantum physical uncertainty principle whichstates that (unlike in classical physics) one cannot measureaccurately certain pairs of physical properties of a particle(like the position and the momentum) at the same time, hasits counterpart in computing. Using registers (and caches andpipelines), one can perform computations with much higherspeed, but to service an interrupt, one has to save/restoreregisters and renew cache content. Similar is the case withaccelerators: copying data from one memory to another ordealing with coherence increases latency. That is, one cannothave low latency and high performance at the same time, usingthe same processor design principles. The same processordesign principles cannot be equally good for preparing high-performance single thread applications and high performanceparallelized sequential systems.

III. THE CLASSIC VERSUS MODERN PARADIGM

Today we have extremely inexpensive (and at the same time:extremely complex and powerful) processors around (a ”freeresource” [47]) and we come to the age when no additionalreasonable functionality can be implemented in processorsthrough adding more transistors, the over-engineered pro-cessors optimized for single-processor regime do not enablereducing the clock period [50]. The computing power hiddenin many-core processors cannot be utilized effectively for pay-load work, because of the ”power wall” (partly because of theimproper working regime [51]): we came at the age of ”darksilicon” [19], we have ”too many” processors [16] around.The supercomputers face critical efficiency and performanceissues; the real-time (especially the cyber-physical) systemsexperience serious predictability, latency and throughput is-sues; in summary, the computing performance (without chang-ing the present paradigm) reached its technological bounds.Computing needs renewal [11]. Our proposal, the ExplicitlyMany-Processor Approach (EMPA) [52], is to introduce a newcomputing paradigm and through that to reshape the way inwhich computers are designed and used today.

A. Overview of the modern paradigm

The new paradigm is based on making fine distinctions inspecific points, which are also present in the old paradigm.Those points, however, must be scrutinized in all occurringcases, and whether and how long can they can be neglected.These points are:

• consider explicitly that not only one processor (akaCentral Processing Unit) exists, i.e.

– the processing capability (akin to the data storagecapability) is one of the resources rather than acentral singleton

– not necessarily the same processing unit (out of theseveral identical ones) is used to solve all parts ofthe problem

– a kind of redundancy (an easy method of replac-ing a flawed processing unit) through using virtualprocessing units is provided (mainly to increase themean time between technical errors), like [53], [54]

– different processors can and must cooperate in solv-ing a task, i.e. direct data and control exchangeamong the processing units are made possible; theability to communicate with other processing units,similar to [49], is a native feature

– flexibility for making ad-hoc assemblies for moreefficient processing is provided

– the large number of processors is used for unusualtasks, such as replacing memory operation with usingadditional processors

• the misconception of the segregated computer compo-nents is reinterpreted

– the efficacy of utilization of the several processorsis increased by using multi-port memories (similarto [55])

– a ”memory only” concept (somewhat similar to thatin [56]) is introduced (as opposed to the ”registersonly” concept), using purpose-oriented, optionallydistributed, partly local, memory banks

– the principle of locality is introduced into memoryhandling at hardware level, through introducing hi-erarchic buses

• the misconception of the ”sequential only” execution [57]is reinterpreted

– von Neumann required only ”proper sequencing” forthe single processing unit; this is extended to severalprocessing units

– the tasks are broken into reasonably sized and logi-cally interconnected fragments, unlike unreasonablyfragmented by the scheduler

– the ”one-processor-one process” principle remainsvalid for the a task fragments, but not necessarilyfor the complete task

– the fragments can be executed in parallel if bothdata dependence and hardware availability enablesit (another kind of asynchronous computing [58])

• a closer hardware/software cooperation is elaborated– the hardware and software only exist together (akin

to ”stack memory”)– when a hardware has no duty, it can sleep (”does not

exist” and does not take power)– the overwhelming part of the duties of synchroniza-

tion, scheduling, etc. of the OS are taken over by thehardware

– the compiler helps the processor with compile-timeinformation and the processor is able to adapt (con-figure) itself to the task depending on the actualhardware availability

– strong support for multi-threading and resource shar-ing, as well as low real-time latency is provided, atprocessor level

– the internal latency of the assembled large-scalesystems is much reduced, while their performanceis considerably enhanced

– the task fragments are able to return control voluntar-ily without the intervention of the operating system(OS), enabling to implement more effective and moresimple operating systems

B. Details of the concept

Our proposal introduces a new concept that permits workingwith virtual processors at programming level and their map-ping to physical cores at runtime level, i.e. to let the computingsystem to adapt itself to the task. A major idea of EMPA (for anearly and less mature version see [52]) is to use of quasi-thread(QT) as atomic unit of processing that comprises both the HW(the physical core) and the SW (the code fragment running onthe core). This idea was derived with having in mind the bestfeatures of both the HW core and the SW thread. In analogywith physics, the QTs have ”dual nature”: in the HW worldof the ”classic computing” they are represented as a ’core’, inthe SW world as a ’thread’. However, they are the same entityin the sense of the ’modern computing’. The terms ’core’ and’thread’ are borrowed from the conventional computing, but inthe ’modern computing’ they can actually exist only togetherin a time-limited way3. EMPA is a new computing paradigmwhich needs a new underlying architecture, rather than anew kind of parallel processing running on a conventionalarchitecture, so it can be reasonably compared to the terms andideas used in conventional computing only in a very limitedway; although many of its ideas and solutions are adapted fromthe ’classic computing’.

The executable task is broken into reasonably sized andloosely dependent QTs. (The QTs can optionally be nestedinto each other, akin to subroutines.) In EMPA for every newQT a new independent processing unit (PU) is also implied,the internals (PC and part of registers) are set up properly,and they execute their task independently4 (but under thesupervision of the processor comprising the cores).

In other words: the processing capacity is considered asa resource in the same sense as the memory is consideredas a storage resource. This approach enables the programmerto work with virtual processors (mapped to physical PUsby the computer at run-time) and they can utilize the quick

3Akin to dynamic variables on the stack: their lifetime is limited to theperiod when the HW and SW are properly connected. The physical memoryis always there, but it is ”stack memory” only when properly handled by theHW/SW components.

4Although the idea of executing the single-thread task ”in pieces” maylook strange for the first moment, actually the same happens when theOS schedules/blocks a task. The key differences are that in EMPA not thesame processor is used, the QTs are cut into fragments in a reasonable way(preventing issues like priority inversion [9]), the QTs can be processed atthe same time as long as their mathematical dependence and the actual HWresource availability enable it.

resource PUs where they can replace utilizing the slow re-source memory (say, hiring a quick processor from a corepool can be competitive with saving and restoring registersin the slow memory, for example when making a recursivecall). The third major idea is that the PUs can cooperatein various ways, including data and control synchronization,as well as outsourcing part of the received job (received asa nested QT) to a helper core. An obvious example is tooutsource the housekeeping activity of loop organization toa helper core: counting, addressing, comparing, etc. can bedone by a helper core, while the main calculation remains tothe originally delegated core. As the mapping to physical coresoccurs at runtime, (depending on the actual HW availability)the processor can eliminate the (maybe temporarily) deniedcores as well as to adapt the resource need (requested bythe compiler) of the task to the actual computing resourceavailability.

The processor has an additional control layer for organiz-ing the joint work of its cores. The cores have just a fewextra communication signals and are able to execute bothconventional and so called meta-instructions (for configuringthe architecture). The latter ones are executed in a co-processorstyle: when finding a meta-instruction, the core notifies theprocessor which suspends the conventional operation of thecore, controls executing the meta-instruction (utilizing theresources of the core, providing helper cores and handlingthe connections between the cores as requested) then resumescore operation.

The processor needs to find the needed PUs (cores) and theprocessing ability has to accommodate to the task; quickly,flexibly, effectively and inexpensively. A kind of ‘On demand’computing that works ‘As-a-Service’. This is a task not onlyfor the processor but the complete computing system mustparticipate and for that goal the complete computing stackmust be rebuilt.

Behind the former attempts to optimize code executioninside the processor there was no established theory, andthey actually could achieve only moderate success because inSPA the processor is working in real time, it has not enoughresources, knowledge and time do discover those optionscompletely [59]. In the classic computing, the compiler canfind out anything about enhancing the performance but hasno information about the actual run-time HW availability,furthermore it has no way to tell its findings to the processor.The processor has the HW availability information, but has to”reinvent the wheel” with respect to enhancing performance;in real time. In EMPA, the compiler puts its findings in theexecutable code in form of meta-instructions (”configware”),and the actual core executes them with the assistance of thenew control layer of the processor. The processor can choosefrom those options, considering the actual HW availability, in astyle ’if NeededNumberOfResourcesAvailable then Method1else Method2’, maybe nested one to another.

C. Some advantages of EMPA

The approach results in several considerable advantages, butthe page limit forces us to mention just a few.

• as a new QT receives a new Processing Unit (PU)(s),there is no need to save/restore registers and return ad-dress (less memory utilization and less instruction cycles)

• the OS can receive its own PU, which is initialized inkernel mode and can promptly (i.e. without the need ofcontext change) service the requests from the requestorcore

• for resource sharing, temporarily a PU can be delegatedto protect the critical section; the next call to run the codefragment with the same offset will be delayed until theprocessing by the first PU terminates

• the processor can natively accommodate to the variableneed of parallelization

• the actually out-of-use cores are waiting in low energyconsumption mode

• the hierarchic core-to-core communication greatly in-creases the memory throughput

• the asynchronous-style computing [60] largely reducesthe loss due to the gap [61] between speed of theprocessor and that of the memory

• the direct core-to-core connection (more dynamic thanin [49]) greatly enhances efficacy in large systems [62]

• the thread-like feature to fork() and the hierarchic buseschange the dependence of on the number of cores fromlinear to logarithmic [8] (enables to build really exa-scalesupercomputers)

The very first version of EMPA [11] has been implementedin a form of simple (practically untimed) simulator [63], nowan advanced (Transaction Level Modelled) simulator is pre-pared in SystemC. The initial version adapted Y86 cores [64],the new one RISC-V cores. Also part-solutions are modeledin FPGA.

SUMMARY

The today’s computing is more and more typically utilizedunder extreme conditions: providing extreme low latency timein interrupt-driven system, extremely large computing perfor-mance in parallelly working systems, relatively high perfor-mance on a complex computer system for a long time, servicerequests in an energy aware mode. To some measure, theseactivities are solved under the umbrella of the old paradigmfor non-extreme scale systems. Those experiences, however,must be reviewed when working with extreme-large systems,because the scaling is as nonlinear as the phenomena areexperienced. Consequently, scrutinizing the details of the basicprinciples of computing, a ”modern computing paradigm”that is able to explain the new extreme-condition phenomenaon one side and enables to build computing systems withmuch more advantageous features on the other side, can beconstructed.

REFERENCES

[1] J. J. P. Eckert and J. W. Mauchly, “Automatic High-Speed Computing:A Progress Report on the EDVAC,” Moore School Library, University ofPennsylvania, Philadephia, Tech. Rep. Report of Work under ContractNo. W-670-ORD-4926, Supplement No 4, September 1945.

[2] J. von Neumann, “First Draft of a Report on the EDVAC,” http://www.wiley.com/legacy/wileychi/wang archi/supp/appendix a.pdf, 1945.

[3] M. R. Williams, “The Origins, Uses, and Fate of the EDVAC,” IEEEAnn. Hist. Comput., vol. 15, no. 1, pp. 22–38, Jan. 1993. [Online].Available: http://dx.doi.org/10.1109/85.194089

[4] M. D. Godfrey and D. F. Hendry, “The Computer as von NeumannPlanned It,” IEEE Annals of the History of Computing, vol. 15, no. 1,pp. 11–21, 1993.

[5] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C.Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, “Understandingsources of inefficiency in general-purpose chips,” in Proceedings of the37th Annual International Symposium on Computer Architecture, ser.ISCA ’10. New York, NY, USA: ACM, 2010, pp. 37–47. [Online].Available: http://doi.acm.org/10.1145/1815961.1815968

[6] US DOE Office of Science, “Report of a Roundtable Convened to Con-sider Neuromorphic Computing Basic Research Needs,” https://science.energy.gov/∼/media/bes/pdf/reports/2016/NCFMtSA rpt.pdf, 2015.

[7] I. Markov, “Limits on fundamental limits to computation,” Nature, vol.512(7513), pp. 147–154, 2014.

[8] J. Vegh, J. Vasarhelyi, and D. Drotos, “The performance wall of largeparallel computing systems,” in Lecture Notes in Networks and Systems68. Springer, 2019, pp. 224–237.

[9] O. Babaoglu, K. Marzullo, and F. B. Schneider, “A formalization ofpriority inversion,” Real-Time Systems, vol. 5, no. 4, p. 285303, Oct1993. [Online]. Available: https://doi.org/10.1007/BF01088832

[10] S(o)OS project, “Resource-independent execution support on exa-scalesystems,” http://www.soos-project.eu/index.php/related-initiatives, 2010.

[11] J. Vegh, Renewing computing paradigms for more efficient paralleliza-tion of single-threads, ser. Advances in Parallel Computing. IOS Press,2018, vol. 29, ch. 13, pp. 305–330.

[12] P. Cadareanu et al., “Rebooting our computing models,” in Design,Automation and Test in Europe Conference and Exhibition, 2019, pp.1469 – 1476.

[13] G. M. Amdahl, “Validity of the Single Processor Approach to AchievingLarge-Scale Computing Capabilities,” in AFIPS Conference Proceed-ings, vol. 30, 1967, pp. 483–485.

[14] M. D. Godfrey, “Innovation in Computational Architecture and Design,”ICL Technical Journal, vol. 5, pp. 18–31, 1986.

[15] S. H. Fuller and L. I. Millett, Eds., The Future of Computing Per-formance: Game Over or Next Level? National Academies Press,Washington, 2011.

[16] A. Mendelson, “How many cores are too many cores? ,” 2007.[17] M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,”

IEEE Computer, vol. 41, no. 7, pp. 33–38, 2008.[18] M. Feldman, “Exascale Is Not Your Grandfathers HPC,” https://www.

nextplatform.com/2019/10/22/exascale-is-not-your-grandfathers-hpc/,2019.

[19] M. Shafique and S. Garg, “Computing in the dark silicon era: Currenttrends and research challenges,” IEEE Design and Test, vol. 34, no. 2,pp. 8–23, 4 2017.

[20] T. Ungerer, “Multi-core execution of hard real-time applications sup-porting analyzability,” IEEE Micro, vol. 99, pp. 66–75, 2010.

[21] T. M. Conte, E. P. Debenedictis, and R. S. W. sand M. D. Hill,“Challenges to Keeping the Computer Industry Centered in the US,”https://arxiv.org/abs/1706.10267, 2017.

[22] J. A. Chandy and J. Singaraju, “Hardware parallelism vs. softwareparallelism,” in Proceedings of the First USENIX Conference on HotTopics in Parallelism, ser. HotPar’09. Berkeley, CA, USA: USENIXAssociation, 2009, pp. 2–2.

[23] IEEE Spectrum, “Two Different Top500 SupercomputingBenchmarks Show Two Different Top Supercomput-ers,” https://spectrum.ieee.org/tech-talk/computing/hardware/two-different-top500-supercomputing-benchmarks-show\-two-different-top-supercomputers, 2017.

[24] J. Vegh, P. Molnar, and J. Vasarhelyi, “A figure of merit fordescribing the performance of scaling of parallelization,” CoRR, vol.abs/1606.02686, 2016. [Online]. Available: http://arxiv.org/abs/1606.02686

http://www.wiley.com/legacy/wileychi/wang_archi/supp/appendix_a.pdf

http://www.wiley.com/legacy/wileychi/wang_archi/supp/appendix_a.pdf

http://dx.doi.org/10.1109/85.194089

http://doi.acm.org/10.1145/1815961.1815968

https://science.energy.gov/~/media/bes/pdf/reports/2016/NCFMtSA_rpt.pdf

https://science.energy.gov/~/media/bes/pdf/reports/2016/NCFMtSA_rpt.pdf

https://doi.org/10.1007/BF01088832

http://www.soos-project.eu/index.php/related-initiatives

https://www.nextplatform.com/2019/10/22/exascale-is-not-your-grandfathers-hpc/

https://www.nextplatform.com/2019/10/22/exascale-is-not-your-grandfathers-hpc/

https://arxiv.org/abs/1706.10267

https://spectrum.ieee.org/tech-talk/computing/hardware/two-different-top500-supercomputing-benchmarks-show\ -two -different-top-supercomputers



http://arxiv.org/abs/1606.02686


[25] J. P. Singh, J. L. Hennessy, and A. Gupta, “Scaling parallel programs formultiprocessors: Methodology and examples,” Computer, vol. 26, no. 7,pp. 42–50, Jul. 1993.

[26] J. Dongarra, “Report on the Sunway TaihuLight System,” Universityof Tennessee Department of Electrical Engineering and ComputerScience, Tech. Rep. Tech Report UT-EECS-16-742, June 2016.[Online]. Available: http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

[27] J. Vegh and P. Molnar, “How to measure perfectness of parallelizationin hardware/software systems,” in 18th Internat. Carpathian ControlConf. ICCC, 2017, pp. 394–399.

[28] US Government NSA and DOE, “A Report from the NSA-DOE Tech-nical Meeting on High Performance Computing,” https://www.nitrd.gov/nitrdgroups/images/b/b4/NSA DOE HPC TechMeetingReport.pdf, De-cember 2016.

[29] R. F. Service, “Design for U.S. exascale computer takes shape,” Science,vol. 359, pp. 617–618, 2018.

[30] European Commission, “Implementation of the Action Planfor the European High-Performance Computing strategy,”http://ec.europa.eu/newsroom/dae/document.cfm? doc id=15269,2016.

[31] Extremtech, “ Japan Tests Silicon for Exascale Computing in2021.” https://www.extremetech.com/computing/ 272558-japan-tests-silicon-for-exascale-computing -in-2021, 2018.

[32] Liao, Xiang-ke and Lu, Kai and Yang, Can-qun and Li, Jin-wenand Yuan, Yuan and Lai, Ming-che and Huang, Li-bo and Lu,Ping-jing and Fang, Jian-bin and Ren, Jing and Shen, Jie, “Movingfrom exascale to zettascale computing: challenges and techniques,”Frontiers of Information Technology & Electronic Engineering,vol. 19, no. 10, p. 12361244, Oct 2018. [Online]. Available:https://doi.org/10.1631/FITEE.1800494

[33] P. J. Denning and T. Lewis, “Exponential Laws of Computing Growth,”Communications of the ACM, pp. 54–65, Jan. 2017.

[34] J. Vegh, “The performance wall of parallelized sequential computing:the roofline of supercomputer performance gain,” Parallel Computing,vol. in review, p. http://arxiv.org/abs/1908.02280, 2019.

[35] J. Vegh, “How Amdahl’s Law limits the performance of large artificialneural networks: (Why the functionality of full-scale brain simula-tion on processor-based simulators is limited),” Brain Informatics,vol. 6, pp. 1–11, 2019.

[36] K. Bourzac, “Streching supercomputers to the limit,” Nature, vol. 551,pp. 554–556, 2017.

[37] The Japan Times, “Chief of firm behind worlds fourth-fastest supercomputer arrested in Tokyo for alleged fraud,”https://www.japantimes.co.jp/news/2017/12/05/national/crime-legal/chief-firm-behind-worlds-fourth-fastest-supercomputer-arrested-tokyo-alleged-fraud/#.WmQ-KXRG3CI, 2017.

[38] Inside HPC, “Is Aurora Morphing into an Exas-cale AI Supercomputer?” https://insidehpc.com/2017/06/told-aurora-morphing-novel-architecture-ai-supercomputer/, 2017.

[39] S. J. van Albada, A. G. Rowley, J. Senk, M. Hopkins, M. Schmidt, A. B.Stokes, D. R. Lester, M. Diesmann, and S. B. Furber, “PerformanceComparison of the Digital Neuromorphic Hardware SpiNNaker and theNeural Network Simulation Software NEST for a Full-Scale CorticalMicrocircuit Model,” Frontiers in Neuroscience, vol. 12, p. 291, 2018.

[40] J. Vegh, The performance wall of the parallelized sequential computing– Can parallelization save the (computing) world?, 1st ed. LambertAcademic Publishing, 2019.

[41] S. Moradi and R. Manohar, “The impact of on-chip communication onmemory technologies for neuromorphic systems,” Journal of Physics D:Applied Physics, vol. 52, no. 1, p. 014003, oct 2018.

[42] T. Ippen, J. M. Eppler, H. E. Plesser, and M. Diesmann, “ConstructingNeuronal Network Models in Massively Parallel Environments,” Fron-tiers in Neuroinformatics, vol. 11, p. 30, 2017.

[43] J. Vegh, “How Amdahl’s law restricts supercomputer applicationsand building ever bigger supercomputers,” CoRR, vol. abs/1708.01462,2018. [Online]. Available: http://arxiv.org/abs/1708.01462

[44] D. Tsafrir, “The context-switch overhead inflicted by hardware interrupts(and the enigma of do-nothing loops),” in Proceedings of the 2007Workshop on Experimental Computer Science, ser. ExpCS ’07. NewYork, NY, USA: ACM, 2007, pp. 3–3.

[45] F. M. David, J. C. Carlyle, and R. H. Campbell, “Context switchoverheads for linux on arm platforms,” in Proceedings of the2007 Workshop on Experimental Computer Science, ser. ExpCS

’07. New York, NY, USA: ACM, 2007. [Online]. Available:http://doi.acm.org/10.1145/1281700.1281703

[46] M. D. et al, “Loihi: A Neuromorphic Manycore Processor withOn-Chip Learning,” IEEE Micro, vol. 38, pp. 82–99, 2018.

[47] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras,S. Temple, and A. D. Brown, “Overview of the SpiNNaker SystemArchitecture,” IEEE Transactions on Computers, vol. 62, no. 12, pp.2454–2467, 2013.

[48] S. Kunkel, M. Schmidt, J. M. Eppler, H. E. Plesser, G. Masumoto,J. Igarashi, S. Ishii, T. Fukai, A. Morrison, M. Diesmann, and M. Helias,“Spiking network simulation code for petascale computers,” Frontiers inNeuroinformatics, vol. 8, p. 78, 2014.

[49] F. Zheng, H.-L. Li, H. Lv, F. Guo, X.-H. Xu, and X.-H. Xie, “Co-operative computing techniques for a deeply fused and heterogeneousmany-core processor architecture,” Journal of Computer Science andTechnology, vol. 30, no. 1, pp. 145–162, Jan 2015.

[50] M. Schlansker and B. Rau, “EPIC: Explicitly Parallel Instruction Com-puting,” Computer, vol. 33, no. 2, pp. 37–45, Feb 2000.

[51] L. A. Barroso and U. Hlzle, “The Case for Energy-Proportional Com-puting,” Computer, vol. 40, pp. 33–37, 2007.

[52] J. Vegh, “Introducing the Explicitly Many-Processor Approach,” Paral-lel Computing, vol. 75, pp. 28 – 40, 2018.

[53] ARM. (2011) big.LITTLE technology. [Online]. Available: https://developer.arm.com/technologies/big-little

[54] J. Congy and et al, “Accelerating Sequential Applications on CMPsUsing Core Spilling,” Parallel and Distributed Systems, vol. 18, pp.1094–1107, 2007.

[55] Cypress, “CY7C026A: 16K x 16 Dual-Port StaticRAM,” http://www.cypress.com/documentation/datasheets/cy7c026a-16k-x-16-dual-port-static-ram, 2015.

[56] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel,“Scratchpad memory: Design alternative for cache on-chip memoryin embedded systems,” in Proceedings of the Tenth InternationalSymposium on Hardware/Software Codesign, ser. CODES ’02. NewYork, NY, USA: ACM, 2002, pp. 73–78. [Online]. Available:http://doi.acm.org/10.1145/774789.774805

[57] J. Backus, “Can Programming Languages Be liberated from the vonNeumann Style? A Functional Style and its Algebra of Programs,”Communications of the ACM, vol. 21, pp. 613–641, 1978.

[58] P. Gohil and J. Horn and J. He and A. Papageorgiou and C. Poole,“IBM CICS Asynchronous API: Concurrent Processing Made Simple,”http://www.redbooks.ibm.com/redbooks/pdfs/sg248411.pdf, 2017.

[59] D. W. Wall, “Limits of instruction-level parallelism,” New York,NY, USA, pp. 176–188, Apr. 1991. [Online]. Available: http://doi.acm.org/10.1145/106974.106991

[60] IBM, “IBM CICS Asynchronous APIConcurrent Processing Made Sim-ple,” http://www.redbooks.ibm.com/redbooks/pdfs/sg248411.pdf, 2019.

[61] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer,M. Smelyanskiy, M. Girkar, and P. Dubey, “Can TraditionalProgramming Bridge the Ninja Performance Gap for ParallelComputing Applications?” Commun. ACM, vol. 58, no. 5, pp. 77–86,Apr. 2015. [Online]. Available: http://doi.acm.org/10.1145/2742910

[62] Y. Ao, C. Yang, F. Liu, W. Yin, L. Jiang, and Q. Sun, “PerformanceOptimization of the HPCG Benchmark on the Sunway TaihuLightSupercomputer,” ACM Trans. Archit. Code Optim., vol. 15, no. 1, pp.11:1–11:20, Mar. 2018.

[63] J. Vegh, “EMPAthY86: A cycle accurate simulator for ExplicitlyMany-Processor Approach (EMPA) computer.” jul 2016. [Online].Available: https://github.com/jvegh/EMPAthY86

[64] Randal E. Bryant and David R. O’Hallaron, Computer Systems: AProgrammer’s Perspective. Pearson, 2014.

http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

https://www.nitrd.gov/nitrdgroups/images/b/b4/NSA_DOE_HPC _TechMeetingReport.pdf

https://www.nitrd.gov/nitrdgroups/images/b/b4/NSA_DOE_HPC _TechMeetingReport.pdf

https://doi.org/10.1631/FITEE.1800494

https://www.japantimes.co.jp/news/2017/12/05/national/crime-legal/chief-firm-behind-worlds-fourth-fastest-supercomputer-arrested-tokyo -alleged-fraud/#.WmQ-KXRG3CI



https://insidehpc.com/2017/06/told-aurora-morphing-novel-architecture-ai-supercomputer/

https://insidehpc.com/2017/06/told-aurora-morphing-novel-architecture-ai-supercomputer/


http://doi.acm.org/10.1145/1281700.1281703

https://developer.arm.com/technologies/big-little

https://developer.arm.com/technologies/big-little

http://www.cypress.com/documentation/datasheets/cy7c026a-16k-x-16-dual-port-static-ram

http://www.cypress.com/documentation/datasheets/cy7c026a-16k-x-16-dual-port-static-ram

http://doi.acm.org/10.1145/774789.774805

http://www.redbooks.ibm.com/redbooks/pdfs/sg248411.pdf

http://doi.acm.org/10.1145/106974.106991

http://doi.acm.org/10.1145/106974.106991

http://www.redbooks.ibm.com/redbooks/pdfs/sg248411.pdf

http://doi.acm.org/10.1145/2742910

https://github.com/jvegh/EMPAthY86

Documents

The need for modern computing paradigm: Science …The need for modern computing paradigm: Science applied to computing János V égh Kalimános BT Debrecen, Hungary [email protected]